mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2025-01-15 21:23:23 +00:00
43733 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Linus Torvalds
|
7f73ba68cf |
Thermal control updates for 6.8-rc1
- Add dynamic thresholds for trip point crossing detection to prevent trip point crossing notifications from being sent at incorrect times or not at all in some cases (Rafael J. Wysocki). - Fix synchronization issues related to the resume of thermal zones during a system-wide resume and allow thermal zones to be resumed concurrently (Rafael J. Wysocki). - Modify the thermal zone unregistration to wait for the given zone to go away completely before returning to the caller and rework the sysfs interface for trip points on top of that (Rafael J. Wysocki). - Fix a possible NULL pointer dereference in thermal zone registration error path (Rafael J. Wysocki). - Clean up the IPA thermal governor and modify it (with the help of a new governor callback) to avoid allocating and freeing memory every time its throttling callback is invoked (Lukasz Luba). - Make the IPA thermal governor handle thermal instance weight changes via sysfs correctly (Lukasz Luba). - Update the thermal netlink code to avoid sending messages if there are no recipients (Stanislaw Gruszka). - Convert Mediatek Thermal to the json-schema (Rafał Miłecki). - Fix thermal DT bindings issue on Loongson (Binbin Zhou). - Fix returning NULL instead of -ENODEV during thermal probe on Loogsoon (Binbin Zhou). - Add thermal DT binding for tsens on the SM8650 platform (Neil Armstrong). - Add reboot on the critical trip point crossing option feature (Fabio Estevam). - Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal suspend/resume on AmLogic (Uwe Kleine-König) - Add D1/T113s THS controller support to the Sun8i thermal control driver (Maxim Kiselev) - Fix example in the thermal DT binding for QCom SPMI (Johan Hovold). - Fix compilation warning in the tmon utility (Florian Eckert). - Add support for interrupt-based thermal configuration on Exynos along with a set of related cleanups (Mateusz Majewski). - Make the Intel HFI thermal driver enable an HFI instance (eg. processor package) from its first online CPU and disable it when the last CPU in it goes offline (Ricardo Neri). - Fix a kernel-doc warning and a spello in the cpuidle_cooling thermal driver (Randy Dunlap). - Move the .get_temp() thermal zone callback presence check to the thermal zone registration code (Daniel Lezcano). - Use the for_each_trip() macro for trip points table walks in a few places in the thermal core (Rafael J. Wysocki). - Make all trip point updates (via sysfs as well as from the platform firmware) trigger trip change notifications (Rafael J. Wysocki). - Drop redundant code from the thermal core and make one function in it take a const pointer argument (Rafael J. Wysocki). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmWb8iUSHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxKLkP/iDsuDwmhZAjbAu2iftk/8ad8Trm2VoK +9eZ5Eqa8lKEcJLb0RxueTnFT4ppvT/hY99HOG4FM+mCnWeH/Z32N697DhqiUg4v GZUpeOPzxYgsfOOTeuL5XgfrVMgBjJrJunTXmzgAd8lIhTmRbAMVmFVJ18CJO11O RHgqvYznYFi5cywA9/NkG2xkhFB0VDoiTuIiuMMV+pMjqF0d5ooBMkhmjvPQ5Rp9 FjNJ7hqiTamAsDPdULAFqhIGGhKZWWFbh4+S+JPCwBW8nqvxyJpemsm20vrwctJR bSXWQkgkDpWEeg9yrEAOO/Uk9yGd3jiLfkvPBKbK0x/YxGZ4hOYHcbF3cOUvmPYP 5K3ZJ61DNrzB/5S3LY54VYrWmTVRdK6Lk3HYNvfAUYFJZMZ5oMYZLCUmo4SswUdy UUEIY27H7L18eLhP9zCcKo4njdaVG+vXQn/rJIFOpG0k9OElzPs1X8Dp/m9pKQDR rDUsMXqB34NUVrIEhjAgqvwF5xHooW8gykpuJgxwBetA9w8Pls2A/mzLsDY3wgdQ htiANGpKTDqBQSn+HrjzYckv9/R+1tDyTJmEDNZwllA1DJfrOlpCRD2VHRpgTZEA Ldnq0bhyq6RQnousqxhgpYkIAoGaebs9XasRH0YtBG5gIumeWfqeVzmTcM5xdsNB yf6RdQy8QunS =QVyh -----END PGP SIGNATURE----- Merge tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control updates from Rafael Wysocki: "These add support for the D1/T113s THS controller to the sun8i driver and a DT-based mechanism for platforms to indicate a preference to reboot (instead of shutting down) on crossing a critical trip point, fix issues, make other improvements (in the IPA governor, the Intel HFI driver, the exynos driver and the thermal netlink interface among other places) and clean up code. One long-standing issue addressed here is that trip point crossing notifications sent to user space might be unreliable due to the incorrect handling of trip point hysteresis in the thermal core: multiple notifications might be sent for the same event or there might be events without any notification at all. Specifics: - Add dynamic thresholds for trip point crossing detection to prevent trip point crossing notifications from being sent at incorrect times or not at all in some cases (Rafael J. Wysocki) - Fix synchronization issues related to the resume of thermal zones during a system-wide resume and allow thermal zones to be resumed concurrently (Rafael J. Wysocki) - Modify the thermal zone unregistration to wait for the given zone to go away completely before returning to the caller and rework the sysfs interface for trip points on top of that (Rafael J. Wysocki) - Fix a possible NULL pointer dereference in thermal zone registration error path (Rafael J. Wysocki) - Clean up the IPA thermal governor and modify it (with the help of a new governor callback) to avoid allocating and freeing memory every time its throttling callback is invoked (Lukasz Luba) - Make the IPA thermal governor handle thermal instance weight changes via sysfs correctly (Lukasz Luba) - Update the thermal netlink code to avoid sending messages if there are no recipients (Stanislaw Gruszka) - Convert Mediatek Thermal to the json-schema (Rafał Miłecki) - Fix thermal DT bindings issue on Loongson (Binbin Zhou) - Fix returning NULL instead of -ENODEV during thermal probe on Loogsoon (Binbin Zhou) - Add thermal DT binding for tsens on the SM8650 platform (Neil Armstrong) - Add reboot on the critical trip point crossing option feature (Fabio Estevam) - Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal suspend/resume on AmLogic (Uwe Kleine-König) - Add D1/T113s THS controller support to the Sun8i thermal control driver (Maxim Kiselev) - Fix example in the thermal DT binding for QCom SPMI (Johan Hovold) - Fix compilation warning in the tmon utility (Florian Eckert) - Add support for interrupt-based thermal configuration on Exynos along with a set of related cleanups (Mateusz Majewski) - Make the Intel HFI thermal driver enable an HFI instance (eg. processor package) from its first online CPU and disable it when the last CPU in it goes offline (Ricardo Neri) - Fix a kernel-doc warning and a spello in the cpuidle_cooling thermal driver (Randy Dunlap) - Move the .get_temp() thermal zone callback presence check to the thermal zone registration code (Daniel Lezcano) - Use the for_each_trip() macro for trip points table walks in a few places in the thermal core (Rafael J. Wysocki) - Make all trip point updates (via sysfs as well as from the platform firmware) trigger trip change notifications (Rafael J. Wysocki) - Drop redundant code from the thermal core and make one function in it take a const pointer argument (Rafael J. Wysocki)" * tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits) thermal: trip: Constify thermal zone argument of thermal_zone_trip_id() thermal: intel: hfi: Disable an HFI instance when all its CPUs go offline thermal: intel: hfi: Enable an HFI instance from its first online CPU thermal: intel: hfi: Refactor enabling code into helper functions thermal/drivers/exynos: Use set_trips ops thermal/drivers/exynos: Use BIT wherever possible thermal/drivers/exynos: Split initialization of TMU and the thermal zone thermal/drivers/exynos: Stop using the threshold mechanism on Exynos 4210 thermal/drivers/exynos: Simplify regulator (de)initialization thermal/drivers/exynos: Handle devm_regulator_get_optional return value correctly thermal/drivers/exynos: Wwitch from workqueue-driven interrupt handling to threaded interrupts thermal/drivers/exynos: Drop id field thermal/drivers/exynos: Remove an unnecessary field description tools/thermal/tmon: Fix compilation warning for wrong format dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Clean up examples dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Fix example node names thermal/drivers/sun8i: Add D1/T113s THS controller support dt-bindings: thermal: sun8i: Add binding for D1/T113s THS controller thermal: amlogic: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions thermal: amlogic: Make amlogic_thermal_disable() return void ... |
||
Linus Torvalds
|
063a7ce32d |
lsm/stable-6.8 PR 20240105
-----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKUIUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNyHw/+IKnqL1MZ5QS+/HtSzi4jCL47N9yZ OHLol6XswyEGHH9myKPPGnT5lVA93v98v4ty2mws7EJUSGZQQUntYBPbU9Gi40+B XDzYSRocoj96sdlKeOJMgaWo3NBRD9HYSoGPDNWZixy6m+bLPk/Dqhn3FabKf1lo 2qQSmstvChFRmVNkmgaQnBCAtWVqla4EJEL0EKX6cspHbuzRNTeJdTPn6Q/zOUVL O2znOZuEtSVpYS7yg3uJT0hHD8H0GnIciAcDAhyPSBL5Uk5l6gwJiACcdRfLRbgp QM5Z4qUFdKljV5XBCzYnfhhrx1df08h1SG84El8UK8HgTTfOZfYmawByJRWNJSQE TdCmtyyvEbfb61CKBFVwD7Tzb9/y8WgcY5N3Un8uCQqRzFIO+6cghHri5NrVhifp nPFlP4klxLHh3d7ZVekLmCMHbpaacRyJKwLy+f/nwbBEID47jpPkvZFIpbalat+r QaKRBNWdTeV+GZ+Yu0uWsI029aQnpcO1kAnGg09fl6b/dsmxeKOVWebir25AzQ++ a702S8HRmj80X+VnXHU9a64XeGtBH7Nq0vu0lGHQPgwhSx/9P6/qICEPwsIriRjR I9OulWt4OBPDtlsonHFgDs+lbnd0Z0GJUwYT8e9pjRDMxijVO9lhAXyglVRmuNR8 to2ByKP5BO+Vh8Y= =Py+n -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull security module updates from Paul Moore: - Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and lsm_set_self_attr(). The first syscall simply lists the LSMs enabled, while the second and third get and set the current process' LSM attributes. Yes, these syscalls may provide similar functionality to what can be found under /proc or /sys, but they were designed to support multiple, simultaneaous (stacked) LSMs from the start as opposed to the current /proc based solutions which were created at a time when only one LSM was allowed to be active at a given time. We have spent considerable time discussing ways to extend the existing /proc interfaces to support multiple, simultaneaous LSMs and even our best ideas have been far too ugly to support as a kernel API; after +20 years in the kernel, I felt the LSM layer had established itself enough to justify a handful of syscalls. Support amongst the individual LSM developers has been nearly unanimous, with a single objection coming from Tetsuo (TOMOYO) as he is worried that the LSM_ID_XXX token concept will make it more difficult for out-of-tree LSMs to survive. Several members of the LSM community have demonstrated the ability for out-of-tree LSMs to continue to exist by picking high/unused LSM_ID values as well as pointing out that many kernel APIs rely on integer identifiers, e.g. syscalls (!), but unfortunately Tetsuo's objections remain. My personal opinion is that while I have no interest in penalizing out-of-tree LSMs, I'm not going to penalize in-tree development to support out-of-tree development, and I view this as a necessary step forward to support the push for expanded LSM stacking and reduce our reliance on /proc and /sys which has occassionally been problematic for some container users. Finally, we have included the linux-api folks on (all?) recent revisions of the patchset and addressed all of their concerns. - Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit ioctls on 64-bit systems problem. This patch includes support for all of the existing LSMs which provide ioctl hooks, although it turns out only SELinux actually cares about the individual ioctls. It is worth noting that while Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this patch, they did both indicate they are okay with the changes. - Fix a potential memory leak in the CALIPSO code when IPv6 is disabled at boot. While it's good that we are fixing this, I doubt this is something users are seeing in the wild as you need to both disable IPv6 and then attempt to configure IPv6 labeled networking via NetLabel/CALIPSO; that just doesn't make much sense. Normally this would go through netdev, but Jakub asked me to take this patch and of all the trees I maintain, the LSM tree seemed like the best fit. - Update the LSM MAINTAINERS entry with additional information about our process docs, patchwork, bug reporting, etc. I also noticed that the Lockdown LSM is missing a dedicated MAINTAINERS entry so I've added that to the pull request. I've been working with one of the major Lockdown authors/contributors to see if they are willing to step up and assume a Lockdown maintainer role; hopefully that will happen soon, but in the meantime I'll continue to look after it. - Add a handful of mailmap entries for Serge Hallyn and myself. * tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits) lsm: new security_file_ioctl_compat() hook lsm: Add a __counted_by() annotation to lsm_ctx.ctx calipso: fix memory leak in netlbl_calipso_add_pass() selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test MAINTAINERS: add an entry for the lockdown LSM MAINTAINERS: update the LSM entry mailmap: add entries for Serge Hallyn's dead accounts mailmap: update/replace my old email addresses lsm: mark the lsm_id variables are marked as static lsm: convert security_setselfattr() to use memdup_user() lsm: align based on pointer length in lsm_fill_user_ctx() lsm: consolidate buffer size handling into lsm_fill_user_ctx() lsm: correct error codes in security_getselfattr() lsm: cleanup the size counters in security_getselfattr() lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation lsm: drop LSM_ID_IMA LSM: selftests for Linux Security Module syscalls SELinux: Add selfattr hooks AppArmor: Add selfattr hooks Smack: implement setselfattr and getselfattr hooks ... |
||
Linus Torvalds
|
eab23bc8a8 |
audit/stable-6.8 PR 20240105
-----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKJAUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXML3hAAyP/6RwJjUMM9Gsi9ZJNRz79X8uIp /MYONbzy1xKq/d78jZhjsJm9yIlLk3muVdd3oRcXdmahA5zs3jOKRaM+OfNLOrt6 nuNwS+yaMUYKsKNh/A8TLoxcmBuNAN6lubCKwbccR6hvugqrZuZFkAqCIkiWUDeb N64u1rL1q/tLI+jI76GIiK4SMMQihF3MMVVTmBWYDiIdrfPhFIHxipLgZaEBUqZM 43+2Y/blV75jcqPTZRgT9tk0LVLkiFtO97qUp9j+pYZbeoJ7CAaDH5A8NVm38yIX tyzYiTV2lGS3qf/HdLc3OpJQlBVkhbq6cRiLGvyiKQp60xiqYffoL7iFP4/DJMoT JKzoqXCixINRqdHWYbVY9hHBGg6R5c+1QqZzsnEy2MnBF++iLwJQAMz5JO9Qdh8F tD6fD82QzvfoNPuP0lBA67preqN3wiH1Zsv6cstoI/6QKCAMeTMZt/ywniBTKhX6 WMmhdmMQKTwGrnCosydAOonYesieiYPhxz6oSeRIqoHRHtNow8rjnFh7DR7yi8uc nv1x5bDqEI+QTrDys0cAq6fvdUZT2B9joqSovzXUGllRRS7w17WNf1Cu16jMTrHH FeZ2P1BvKE7YIFkqxcE/RY5NHX3ylxA4unFM8UgIheYiWbWLm5+xrwZdNL30KQJ4 4Hvvy3Buq6kb4HE= =908g -----END PGP SIGNATURE----- Merge tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "The audit updates are fairly minor with only two patches: - Send an audit ACK to userspace immediately upon receiving an auditd registration event as opposed to waiting until the registration has been fully processed and the audit backlog starts filling the netlink buffers. Sending the ACK earlier, as done here, is still safe as the operation should not fail at the point when the ACK is done, and doing so helps avoid the ACK being dropped in extreme situations. - Update the audit MAINTAINERS entry with additional information. There isn't anything in this update that should be new to regular contributors or list subscribers, but I'm pushing to start documenting our processes, conventions, etc. and this seems like an important part of that" * tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: MAINTAINERS: update the audit entry audit: Send netlink ACK before setting connection in auditd_set |
||
Linus Torvalds
|
9f2a635235 |
Quite a lot of kexec work this time around. Many singleton patches in
many places. The notable patch series are: - nilfs2 folio conversion from Matthew Wilcox in "nilfs2: Folio conversions for file paths". - Additional nilfs2 folio conversion from Ryusuke Konishi in "nilfs2: Folio conversions for directory paths". - IA64 remnant removal in Heiko Carstens's "Remove unused code after IA-64 removal". - Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere in "Treewide: enable -Wmissing-prototypes". This had some followup fixes: - Nathan Chancellor has cleaned up the hexagon build in the series "hexagon: Fix up instances of -Wmissing-prototypes". - Nathan also addressed some s390 warnings in "s390: A couple of fixes for -Wmissing-prototypes". - Arnd Bergmann addresses the same warnings for MIPS in his series "mips: address -Wmissing-prototypes warnings". - Baoquan He has made kexec_file operate in a top-down-fitting manner similar to kexec_load in the series "kexec_file: Load kernel at top of system RAM if required" - Baoquan He has also added the self-explanatory "kexec_file: print out debugging message if required". - Some checkstack maintenance work from Tiezhu Yang in the series "Modify some code about checkstack". - Douglas Anderson has disentangled the watchdog code's logging when multiple reports are occurring simultaneously. The series is "watchdog: Better handling of concurrent lockups". - Yuntao Wang has contributed some maintenance work on the crash code in "crash: Some cleanups and fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZ2R6AAKCRDdBJ7gKXxA juCVAP4t76qUISDOSKugB/Dn5E4Nt9wvPY9PcufnmD+xoPsgkQD+JVl4+jd9+gAV vl6wkJDiJO5JZ3FVtBtC3DFA/xHtVgk= =kQw+ -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Quite a lot of kexec work this time around. Many singleton patches in many places. The notable patch series are: - nilfs2 folio conversion from Matthew Wilcox in 'nilfs2: Folio conversions for file paths'. - Additional nilfs2 folio conversion from Ryusuke Konishi in 'nilfs2: Folio conversions for directory paths'. - IA64 remnant removal in Heiko Carstens's 'Remove unused code after IA-64 removal'. - Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere in 'Treewide: enable -Wmissing-prototypes'. This had some followup fixes: - Nathan Chancellor has cleaned up the hexagon build in the series 'hexagon: Fix up instances of -Wmissing-prototypes'. - Nathan also addressed some s390 warnings in 's390: A couple of fixes for -Wmissing-prototypes'. - Arnd Bergmann addresses the same warnings for MIPS in his series 'mips: address -Wmissing-prototypes warnings'. - Baoquan He has made kexec_file operate in a top-down-fitting manner similar to kexec_load in the series 'kexec_file: Load kernel at top of system RAM if required' - Baoquan He has also added the self-explanatory 'kexec_file: print out debugging message if required'. - Some checkstack maintenance work from Tiezhu Yang in the series 'Modify some code about checkstack'. - Douglas Anderson has disentangled the watchdog code's logging when multiple reports are occurring simultaneously. The series is 'watchdog: Better handling of concurrent lockups'. - Yuntao Wang has contributed some maintenance work on the crash code in 'crash: Some cleanups and fixes'" * tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (157 commits) crash_core: fix and simplify the logic of crash_exclude_mem_range() x86/crash: use SZ_1M macro instead of hardcoded value x86/crash: remove the unused image parameter from prepare_elf_headers() kdump: remove redundant DEFAULT_CRASH_KERNEL_LOW_SIZE scripts/decode_stacktrace.sh: strip unexpected CR from lines watchdog: if panicking and we dumped everything, don't re-enable dumping watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting watchdog/hardlockup: adopt softlockup logic avoiding double-dumps kexec_core: fix the assignment to kimage->control_page x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init() lib/trace_readwrite.c:: replace asm-generic/io with linux/io nilfs2: cpfile: fix some kernel-doc warnings stacktrace: fix kernel-doc typo scripts/checkstack.pl: fix no space expression between sp and offset x86/kexec: fix incorrect argument passed to kexec_dprintk() x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs nilfs2: add missing set_freezable() for freezable kthread kernel: relay: remove relay_file_splice_read dead code, doesn't work docs: submit-checklist: remove all of "make namespacecheck" ... |
||
Linus Torvalds
|
fb46e22a9e |
Many singleton patches against the MM code. The patch series which
are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series "maple_tree: add mt_free_one() and mt_attr() helpers" "Some cleanups of maple tree" - In the series "mm: use memmap_on_memory semantics for dax/kmem" Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series "Add folio_zero_tail() and folio_fill_tail()" "Make folio_start_writeback return void" "Fix fault handler's handling of poisoned tail pages" "Convert aops->error_remove_page to ->error_remove_folio" "Finish two folio conversions" "More swap folio conversions" - Kefeng Wang has also contributed folio-related work in the series "mm: cleanup and use more folio in page fault" - Jim Cromie has improved the kmemleak reporting output in the series "tweak kmemleak report format". - In the series "stackdepot: allow evicting stack traces" Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series "mm: page_alloc: fixes for high atomic reserve caluculations". - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series "samples: introduce cgroup events listeners". - Some mapletree maintanance work from Liam Howlett in the series "maple_tree: iterator state changes". - Nhat Pham has improved zswap's approach to writeback in the series "workload-specific and memory pressure-driven zswap writeback". - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series "mm/damon: let users feed and tame/auto-tune DAMOS" "selftests/damon: add Python-written DAMON functionality tests" "mm/damon: misc updates for 6.8" - Yosry Ahmed has improved memcg's stats flushing in the series "mm: memcg: subtree stats flushing and thresholds". - In the series "Multi-size THP for anonymous memory" Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series "More buffer_head cleanups". - Suren Baghdasaryan has done work on Andrea Arcangeli's series "userfaultfd move option". UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a "KSM Advisor", in the series "mm/ksm: Add ksm advisor". This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series "mm/zswap: dstmem reuse optimizations and cleanups". - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is "Clean up the writeback paths". - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series "kasan: save mempool stack traces". - Andrey also performed some KASAN maintenance work in the series "kasan: assorted clean-ups". - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series "mm/rmap: interface overhaul". - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series "mm/mglru: Kconfig cleanup". - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series "Remove some lruvec page accounting functions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU= =0NHs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ... |
||
Linus Torvalds
|
d30e51aa7b |
slab updates for 6.8
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmWWu9EACgkQu+CwddJF iJpXvQf/aGL7uEY57VpTm0t4gPwoZ9r2P89HxI/nQs9XgVzDcBmVp/cC0LDvSdcm t91kJO538KeGjMgvlhLMTEuoShH5FlPs6cOwrGAYUoAGa4NwiOpGvliGky+nNHqY w887ZgSzVLq0UOuSvn86N6enumMvewt4V+872+OWo6O1HWOJhC0SgHTIa8QPQtwb yZ9BghO5IqMRXiZEsSIwyO+tQHcaU6l2G5huFXzgMFUhkQqAB9KTFc3h6rYI+i80 L4ppNXo2KNPGTDRb9dA8LNMWgvmfjhCb7chs8o1zSY2PwZlkzOix7EUBLCAIbc/2 EIaFC8AsZjfT47D1t72r8QpHB+C14Q== =J+E7 -----END PGP SIGNATURE----- Merge tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - SLUB: delayed freezing of CPU partial slabs (Chengming Zhou) Freezing is an operation involving double_cmpxchg() that makes a slab exclusive for a particular CPU. Chengming noticed that we use it also in situations where we are not yet installing the slab as the CPU slab, because freezing also indicates that the slab is not on the shared list. This results in redundant freeze/unfreeze operation and can be avoided by marking separately the shared list presence by reusing the PG_workingset flag. This approach neatly avoids the issues described in 9b1ea29bc0d7 ("Revert "mm, slub: consider rest of partial list if acquire_slab() fails"") as we can now grab a slab from the shared list in a quick and guaranteed way without the cmpxchg_double() operation that amplifies the lock contention and can fail. As a result, lkp has reported 34.2% improvement of stress-ng.rawudp.ops_per_sec - SLAB removal and SLUB cleanups (Vlastimil Babka) The SLAB allocator has been deprecated since 6.5 and nobody has objected so far. We agreed at LSF/MM to wait until the next LTS, which is 6.6, so we should be good to go now. This doesn't yet erase all traces of SLAB outside of mm/ so some dead code, comments or documentation remain, and will be cleaned up gradually (some series are already in the works). Removing the choice of allocators has already allowed to simplify and optimize the code wiring up the kmalloc APIs to the SLUB implementation. * tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits) mm/slub: free KFENCE objects in slab_free_hook() mm/slub: handle bulk and single object freeing separately mm/slub: introduce __kmem_cache_free_bulk() without free hooks mm/slub: fix bulk alloc and free stats mm/slub: optimize free fast path code layout mm/slub: optimize alloc fastpath code layout mm/slub: remove slab_alloc() and __kmem_cache_alloc_lru() wrappers mm/slab: move kmalloc() functions from slab_common.c to slub.c mm/slab: move kmalloc_slab() to mm/slab.h mm/slab: move kfree() from slab_common.c to slub.c mm/slab: move struct kmem_cache_node from slab.h to slub.c mm/slab: move memcg related functions from slab.h to slub.c mm/slab: move pre/post-alloc hooks from slab.h to slub.c mm/slab: consolidate includes in the internal mm/slab.h mm/slab: move the rest of slub_def.h to mm/slab.h mm/slab: move struct kmem_cache_cpu declaration to slub.c mm/slab: remove mm/slab.c and slab_def.h mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs mm/slab: remove CONFIG_SLAB code from slab common code cpu/hotplug: remove CPUHP_SLAB_PREPARE hooks ... |
||
ZhangPeng
|
3dc2f20920 |
swiotlb: check alloc_size before the allocation of a new memory pool
The allocation request for swiotlb contiguous memory greater than 128*2KB cannot be fulfilled because it exceeds the maximum contiguous memory limit. If the swiotlb memory we allocate is larger than 128*2KB, swiotlb_find_slots() will still schedule the allocation of a new memory pool, which will increase memory overhead. Fix it by adding a check with alloc_size no more than 128*2KB before scheduling the allocation of a new memory pool in swiotlb_find_slots(). Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Signed-off-by: Christoph Hellwig <hch@lst.de> |
||
Linus Torvalds
|
9f8413c4a6 |
cgroup: Changes for v6.8
- Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1 is mostly in maintenance mode, this addition is very small while having an outsized usefulness for users who are still on cgroup1. Yafang also optimized root cgroup list access by making it RCU protected in the process. - Waiman Long optimized rstat operation leading to substantially lower and more consistent lock hold time while flushing the hierarchical statistics. As the lock can be acquired briefly in various hot paths, this reduction has cascading benefits. - Waiman also improved the quality of isolation for cpuset's isolated partitions. CPUs which are allocated to isolated partitions are now excluded from running unbound work items and cpu_is_isolated() test which is used by vmstat and memcg to reduce interference now includes cpuset isolated CPUs. While it isn't there yet, the hope is eventually reaching parity with the isolation level provided by the `isolcpus` boot param but in a dynamic manner. This involved a couple workqueue patches which were applied directly to cgroup/for-6.8 rather than ping-ponged through the wq tree. This was because the wq code change was small and the area is usually very static and unlikely to cause conflicts. However, luck had it that there was a wq bug fix in the area during the 6.7 cycle which caused a conflict. The conflict is contextual but can be a bit confusing to resolve, so there is one merge from wq/for-6.7-fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZYnuJg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGQ5kAP9nMMWqi+R1HeG7+hWROTVjQZ0OM9KRcpZ1TmjF FNbkJgEAzt+sPnoWwYDTSI7pkNeZ/IM7x1qkkKGvENNtUXrz0Ac= =PyYN -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1 is mostly in maintenance mode, this addition is very small while having an outsized usefulness for users who are still on cgroup1. Yafang also optimized root cgroup list access by making it RCU protected in the process. - Waiman Long optimized rstat operation leading to substantially lower and more consistent lock hold time while flushing the hierarchical statistics. As the lock can be acquired briefly in various hot paths, this reduction has cascading benefits. - Waiman also improved the quality of isolation for cpuset's isolated partitions. CPUs which are allocated to isolated partitions are now excluded from running unbound work items and cpu_is_isolated() test which is used by vmstat and memcg to reduce interference now includes cpuset isolated CPUs. While it isn't there yet, the hope is eventually reaching parity with the isolation level provided by the `isolcpus` boot param but in a dynamic manner. * tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Move rcu_head up near the top of cgroup_root cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check cgroup: Avoid false cacheline sharing of read mostly rstat_cpu cgroup/rstat: Optimize cgroup_rstat_updated_list() cgroup: Fix documentation for cpu.idle cgroup/cpuset: Expose cpuset.cpus.isolated workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask cgroup/cpuset: Keep track of CPUs in isolated partitions selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask selftests: cgroup: Fixes a typo in a comment cgroup: Add a new helper for cgroup1 hierarchy cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root() cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show() cgroup: Make operations on the cgroup root_list RCU safe cgroup: Remove unnecessary list_empty() |
||
Linus Torvalds
|
bfe8eb3b85 |
Scheduler changes for v6.8:
- Energy scheduling: - Consolidate how the max compute capacity is used in the scheduler and how we calculate the frequency for a level of utilization. - Rework interface between the scheduler and the schedutil governor - Simplify the util_est logic - Deadline scheduler: - Work more towards reducing SCHED_DEADLINE starvation of low priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks monopolize CPU cycles, via the introduction of 'deadline servers' (nested/2-level scheduling). "Fair servers" to make use of this facility are not introduced yet. - EEVDF: - Introduce O(1) fastpath for EEVDF task selection - NUMA balancing: - Tune the NUMA-balancing vma scanning logic some more, to better distribute the probability of a particular vma getting scanned. - Plus misc fixes, cleanups and updates. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWcASMRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jLbg/+NOwF18M6klF1/3jUaV1PU09vRzYnnA7w oF7Tru7JLV+/vZK+rwI1zxzj5Nj3sVBQPIyp1embEHx7Z/QH8MIaIVpcSFsDDCYY Q8n6ZVRB+lKWEo5+Ti6JEJftDAWuLHXwFWDa57oWPuR0Tc736+zYHUfj7jdKk0RI nT/lnOT6hXU8q26O4QFrBrrhvCCxc4byo7buKPQfqie0bDA70ppIWkFQoQME6mvQ US9jvOyUipOiPV06DPwFvPDJUQBGq2VdJNk+5zCEtcqEfLREuo/Xq1Ww1x1BWaZI 761532EuDo73iMK4IFZrvVmj1ioz957qbje11MSSkDdKj692xxjXyvnY0NBvZuho Ueog/jQ4D4I2qu7pPSCF8UfnI/Hw4Q+KJ89j3pcywRm4hmCTf9k3MGpAaVLVxH7G e5REZ5MSsFZi4Cs+zF87Of5KCKLhTr1qSetNtShinKahg06WZ+MZ8tW4jb52qy0j F8PMlvfBI3f7SOtA8s2P26mDGQ21YQehN2d5P+Fbwj/U3fjIlSTOyx6NwLpFwYaS Vf+fctchGFV1Sh7c2JjCh+ecYfXx3ghT/pvyPOImJtxtCKSRUQ8c26ApC1OsWfOE FdHv4f2dPqcyswCZzIv/2fyDXc9eaS2E05EMDNqVuMCGnzidzSs81n7hBioNMrnH ZgHK90TmEbw= =wTVh -----END PGP SIGNATURE----- Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Energy scheduling: - Consolidate how the max compute capacity is used in the scheduler and how we calculate the frequency for a level of utilization. - Rework interface between the scheduler and the schedutil governor - Simplify the util_est logic Deadline scheduler: - Work more towards reducing SCHED_DEADLINE starvation of low priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks monopolize CPU cycles, via the introduction of 'deadline servers' (nested/2-level scheduling). "Fair servers" to make use of this facility are not introduced yet. EEVDF: - Introduce O(1) fastpath for EEVDF task selection NUMA balancing: - Tune the NUMA-balancing vma scanning logic some more, to better distribute the probability of a particular vma getting scanned. Plus misc fixes, cleanups and updates" * tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) sched/fair: Fix tg->load when offlining a CPU sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair() sched/fair: Use all little CPUs for CPU-bound workloads sched/fair: Simplify util_est sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true) arm64/amu: Use capacity_ref_freq() to set AMU ratio cpufreq/cppc: Set the frequency used for computing the capacity cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}() energy_model: Use a fixed reference frequency cpufreq/schedutil: Use a fixed reference frequency cpufreq: Use the fixed and coherent frequency for scaling capacity sched/topology: Add a new arch_scale_freq_ref() method freezer,sched: Clean saved_state when restoring it during thaw sched/fair: Update min_vruntime for reweight_entity() correctly sched/doc: Update documentation after renames and synchronize Chinese version sched/cpufreq: Rework iowait boost sched/cpufreq: Rework schedutil governor performance estimation sched/pelt: Avoid underestimation of task utilization sched/timers: Explain why idle task schedules out on remote timer enqueue sched/cpuidle: Comment about timers requirements VS idle handler ... |
||
Linus Torvalds
|
aac4de465a |
Performance events changes for v6.8 are:
- Add branch stack counters ABI extension to better capture the growing amount of information the PMU exposes via branch stack sampling. There's matching tooling support. - Fix race when creating the nr_addr_filters sysfs file - Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support. - Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU support. - Misc cleanups & fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb4lURHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jlnQ/+NSzrPQ9hEiS5a1iMMxdwC6IoXCmeFVsv s5NsGaVC7FEgjm3oCfvQlP63HolMO9R7TNLZsgINzOda5IHtE7WUcgBK7gbZr+NT WabdTyFrdmUr+Br0rLrEe0bxDSQU7r41ptqKE5HZRM9/3SbLhWgaXSJbfFAG2JV0 xboZ/2qzb7Puch6VTWv1YhuIpr1Pi817As4SOo7JR4V8jBB2bh2eZ7XBN1z23aw2 xuglbYml5gs4dOaFTqkRLWyn2PmrZ9wYKcdp63FVUscZ4LxvSw749BxEcNpTbxLp PT6uXIKw9PnStNfscfrsk6fDocVJzqrOK71blgiOKbmhWTE0UimEpFf1Hd3ooewg hFp3hmkE5Bc2MTUnwivkBxj96fz5rXH+3+Cue/5NsvDNlhlkswIIxzDw8M1G4rOI KQMDUYFOhQPa3Hi1lSp2SgHI5AcYHudepr/Z3QMxD3iLs+Wo2cmDcp8d2VrMLfb7 GHSITG592iYcZPYsJosxby8CSFaUPxIl9l3AODQwWuEjd4PcOYa6iB2HbEa/mC3R wXcs8mFIMAaH/HRYUlqUDA5pOqN5chb13iDtS4JqJqBKyWgdrDLCVxoZSQvB64+I bldyy1e5oQSVVwJ42WLkUK3Eld2x75ki1JLZFwMgYuOgQv3jfu2VNenUWJ5ig0La dPpHP8PwOoc= =2O/5 -----END PGP SIGNATURE----- Merge tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: - Add branch stack counters ABI extension to better capture the growing amount of information the PMU exposes via branch stack sampling. There's matching tooling support. - Fix race when creating the nr_addr_filters sysfs file - Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support - Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU support - Misc cleanups & fixes * tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Factor out topology_gidnid_map() perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology() perf/x86/amd: Reject branch stack for IBS events perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge perf/x86/intel/uncore: Support IIO free-running counters on GNR perf/x86/intel/uncore: Support Granite Rapids perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR perf: Fix the nr_addr_filters fix perf/x86/intel/cstate: Add Grand Ridge support perf/x86/intel/cstate: Add Sierra Forest support x86/smp: Export symbol cpu_clustergroup_mask() perf/x86/intel/cstate: Cleanup duplicate attr_groups perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file perf/x86/intel: Support branch counters logging perf/x86/intel: Reorganize attrs and is_visible perf: Add branch_sample_call_stack perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag perf: Add branch stack counters |
||
Linus Torvalds
|
f24dc33f8e |
Timer subsystem changes for v6.8:
- Various preparatory cleanups & enhancements of the timer-wheel code, in preparation for the WIP 'pull timers at expiry' timer migration model series (which will replace the current 'push timers at enqueue' migration model), by Anna-Maria Behnsen: - Update comments and clean up confusing variable names - Add debug check to warn about time travel - Improve/expand timer-wheel tracepoints - Optimize away unnecessary IPIs for deferrable timers - Restructure & clean up next_expiry_recalc() - Clean up forward_timer_base() - Introduce __forward_timer_base() and use it to simplify and micro-optimize get_next_timer_interrupt() - Restructure the get_next_timer_interrupt()'s idle logic for better readability and to enable a minor optimization. - Fix the nextevt calculation when no timers are pending - Fix the sysfs_get_uname() prototype declaration Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb0XIRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1h9kg/9FpjbiogIKrDXb/pJHyhYkK6jzN4aNrQo wsOz4FDKyvioqLfr5ndpFE++DwsyzUibPfHJzfwD5IilTyolm2jW44VSCBzNdm72 lI6NGIcIxmIeCuO4bLmJj/fuQAugQ+ajmA2pyC/aBSO4Q2jtnxjYMGiV9zMWmOsa E816CK5zp6IVx+w0GWwK5yW5YR5dscSQCD+mBYVAdTWYoRNTy6xonsMTRuNi0ePx donetpu0eWG9NGwUdax/65oKVLZMR/rKAI/3pInhkOS9BsL2o8Rt4o2Y+9aBFi2t 2m+ZbFg5hngJwhP8Mfc7A+I3qiWgCOMGNGrebyzlwb+0PnNBPzrwnNPveW3R9QRx LMxSU3aH66bXeX+YCF4y2tjWSmYooAnztPstUGrs+sq36+NF0wyY6ip/36S6MRGk zjedqWnrHQeeZlzOLiKNzB+FIBnOt6bhZEh1Wk1/zwi9UWxw+7+I6tR0b57NqRxZ VHq38fp+O2OEAj5JvwJ6FomOd+onqQ2wwveG5OMCa+hwM2ZCuVXQRYgM2ohMfwl3 BMSd3KMZsBiHT0zyun3k/uJ7CaIjArPh016baSS10ArSl9sE64aJj7ELtuSLqtaD idJFXu3tv6VgDT2rMhLWNHvzQoK+gb8+/qnms4Ea+wY2f7nubi0aH20qHfugkgis 4KOkw9cQw0U= =n40J -----END PGP SIGNATURE----- Merge tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer subsystem updates from Ingo Molnar: - Various preparatory cleanups & enhancements of the timer-wheel code, in preparation for the WIP 'pull timers at expiry' timer migration model series (which will replace the current 'push timers at enqueue' migration model), by Anna-Maria Behnsen: - Update comments and clean up confusing variable names - Add debug check to warn about time travel - Improve/expand timer-wheel tracepoints - Optimize away unnecessary IPIs for deferrable timers - Restructure & clean up next_expiry_recalc() - Clean up forward_timer_base() - Introduce __forward_timer_base() and use it to simplify and micro-optimize get_next_timer_interrupt() - Restructure the get_next_timer_interrupt()'s idle logic for better readability and to enable a minor optimization. - Fix the nextevt calculation when no timers are pending - Fix the sysfs_get_uname() prototype declaration * tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix nextevt calculation when no timers are pending timers: Rework idle logic timers: Use already existing function for forwarding timer base timers: Split out forward timer base functionality timers: Clarify check in forward_timer_base() timers: Move store of next event into __next_timer_interrupt() timers: Do not IPI for deferrable timers tracing/timers: Add tracepoint for tracking timer base is_idle flag tracing/timers: Enhance timer_start tracepoint tick-sched: Warn when next tick seems to be in the past tick/sched: Cleanup confusing variables tick-sched: Fix function names in comments time: Make sysfs_get_uname() function visible in header |
||
Linus Torvalds
|
cdc202281a |
Move various entry functions from kernel/entry/common.c to <linux/entry-common.h>,
and always-inline them, to improve syscall entry performance on s390 by ~11%. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbxQ4RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1h5Cw//TEIlWPCLpIeiDsOCKb5g2e4U+AatNIGt ysmCvTWsKOiBEItDbZpwdpcdv/Ed41UXkS7Zmwetw81P50rz/i+kIJZW4gdl9GiV qhjj0gbhGQ43myQkGdYIcmdVaHl9fuyDGZSai6c17zgdOoL5CvCGGiL5Dn4Cn36x skm8P66r9DuM9cLTnhqQHMKp7cf4HQAX+awhFeppCquhzh3M2I8GsUVrT7tZV+Jw zOMLVjsI8Va4JyGsl07DoqWlyFWcoYvJ5ayzvDCaBxgeFIK9uZgwkKV0HT9q5tvg RhsHQK4zbxgkaMMCgEt/WdT14YesO2+5+ml91Zkjp2NMud0O0gmd2YXZju1aOQQw XCL3pm6DB4oN+IkW9lo6k3rqo9PEip9rt/FAfkNLeb50elHfSZSvE1ZxXSQwx5N4 pHDNMcK6SMsJhEdJInNotViKrpXX0Rjr7x1pY/2DA9bMP/jX/9+J3ODuGCDZrvjp eq4JM15VSq6tVmg+LMcszThWz+9gIaLFAqQwFt3G082ANDkOvg0mK7T65gccDuyA Gl6f/p3tAYHYxOI9KOBN6Daq3QAqlMT+M4YgNbbv8fanWYIzRd3U/Y+YrUCnryu8 Db+8FHlUkbJb/clUofJ5nj0Ene7xReJX/8m8XxA95Cc/UGYU8w0cachXDoPyKZUP xtFW3xn8K4M= =VHW9 -----END PGP SIGNATURE----- Merge tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull generic syscall updates from Ingo Molnar: "Move various entry functions from kernel/entry/common.c to a header file, and always-inline them, to improve syscall entry performance on s390 by ~11%" * tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Move syscall_enter_from_user_mode() to header file entry: Move enter_from_user_mode() to header file entry: Move exit to usermode functions to header file |
||
Linus Torvalds
|
6cbf5b3105 |
Locking changes for v6.8:
- lock guards: - Use lock guards in the ptrace code - Introduce conditional guards to extend to conditional lock primitives like mutex_trylock()/mutex_lock_interruptible()/etc. - lockdep: - Optimize 'struct lock_class' to be smaller - Update file patterns in MAINTAINERS - mutexes: Document mutex lifetime rules a bit more Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbur0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1i19xAAtAZs8Xi5X3IuLoygfMg81k91GvavLpv/ sVYGbKLm6+r0dOr4w3w76jVyWXNrVtTUFJysfK6nTNInEO+P0stL6aDmQVDYIaCw jaQDO/UgPfxUxnBgMy7dUvsnmxBw4GO3zcQpx8GuAuuuEtNcQcnMoP2t/RvBpBEI K2xCCkIT5LQPKbu9LkVZ/fAhZHtMypipuIMtEpfVYEKCMEwDmXoHuj3SNo4LGt04 0wZ5hHVhTcOQDm1/tjSXKsmxwQRVhI6OCcjXJ8hxDiPXg9vWO0+CzOibujl/jfhs Dw45D7wwiCHFUsmKHKz335Jtk8wBpgWUtlxZ+GB/TVfAQkLv1xBdoE1F8yLcBEfx yBNxh+0wecPWSrIsRLZEotRRu7obmBsIW04qUVP3oLWIu3tzhcud9gR43CeeplkT RW1lkdLt5SzN//MLx1cWPqKjfi6wiolaveD1RrqIbVXRhhrFH3o7+EDEOqGNwOvu 0D9yx4Su7SYlAYXgGnGLnnmcmDt2cEoOXkY3K608sYW45dZOpNIu56+EfHiVH1fI Q0lZNHXNDkX85Zzxoam6Y0SYo74lXYBmtL+RSYvvaKjRYgrJGxGcOciK+/c8kmY2 +Eazx5sxoR5nKDMP8MrZCZ5CQtqdPB9IYk7hUvQ31BYL1LSKhA5Mi/xjsFwMSNwv pEBazHxty58= =fOTd -----END PGP SIGNATURE----- Merge tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molar: "Lock guards: - Use lock guards in the ptrace code - Introduce conditional guards to extend to conditional lock primitives like mutex_trylock()/mutex_lock_interruptible()/etc. lockdep: - Optimize 'struct lock_class' to be smaller - Update file patterns in MAINTAINERS mutexes: - Document mutex lifetime rules a bit more" * tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/mutex: Clarify that mutex_unlock(), and most other sleeping locks, can still use the lock object after it's unlocked locking/mutex: Document that mutex_unlock() is non-atomic ptrace: Convert ptrace_attach() to use lock guards locking/lockdep: Slightly reorder 'struct lock_class' to save some memory MAINTAINERS: Add include/linux/lockdep*.h cleanup: Add conditional guard support |
||
Kirill A. Shutemov
|
5e0a760b44 |
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kirill A. Shutemov
|
fd37721803 |
mm, treewide: introduce NR_PAGE_ORDERS
NR_PAGE_ORDERS defines the number of page orders supported by the page allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total. NR_PAGE_ORDERS assists in defining arrays of page orders and allows for more natural iteration over them. [kirill.shutemov@linux.intel.com: fixup for kerneldoc warning] Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
5db8752c3b |
vfs-6.8.iov_iter
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI= =p6PB -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iov_iter cleanups from Christian Brauner: "This contains a minor cleanup. The patches drop an unused argument from import_single_range() allowing to replace import_single_range() with import_ubuf() and dropping import_single_range() completely" * tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iov_iter: replace import_single_range() with import_ubuf() iov_iter: remove unused 'iov' argument from import_single_range() |
||
Linus Torvalds
|
c604110e66 |
vfs-6.8.misc
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc= =0bRl -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ... |
||
Steven Rostedt (Google)
|
4f1991a92c |
tracing histograms: Simplify parse_actions() function
The parse_actions() function uses 'len = str_has_prefix()' to test which action is in the string being parsed. But then it goes and repeats the logic for each different action. This logic can be simplified and duplicate code can be removed as 'len' contains the length of the found prefix which should be used for all actions. Link: https://lore.kernel.org/all/20240107112044.6702cb66@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240107203258.37e26d2b@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andy Shevchenko <andy@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Rafael J. Wysocki
|
f1e5e46397 |
Merge branch 'pm-sleep'
Merge system-wide power management updates for 6.8-rc1: - Fix possible deadlocks in the core system-wide PM code that occur if device-handling functions cannot be executed asynchronously during resune from system-wide suspend (Rafael J. Wysocki). - Clean up unnecessary local variable initializations in multiple places in the hibernation code (Wang chaodong, Li zeming). - Adjust core hibernation code to avoid missing wakeup events that occur after saving an image to persistent storage (Chris Feng). - Update hibernation code to enforce correct ordering during image compression and decompression (Hongchen Zhang). - Use kmap_local_page() instead of kmap_atomic() in copy_data_page() during hibernation and restore (Chen Haonan). - Adjust documentation and code comments to reflect recent task freezer changes (Kevin Hao). - Repair excess function parameter description warning in the hibernation image-saving code (Randy Dunlap). * pm-sleep: PM: sleep: Fix possible deadlocks in core system-wide PM code async: Introduce async_schedule_dev_nocall() async: Split async_schedule_node_domain() PM: hibernate: Repair excess function parameter description warning PM: sleep: Remove obsolete comment from unlock_system_sleep() Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes PM: hibernate: Use kmap_local_page() in copy_data_page() PM: hibernate: Enforce ordering during image compression/decompression PM: hibernate: Avoid missing wakeup events during hibernation PM: hibernate: Do not initialize error in snapshot_write_next() PM: hibernate: Do not initialize error in swap_write_page() PM: hibernate: Drop unnecessary local variable initialization |
||
Ingo Molnar
|
cdb3033e19 |
Merge branch 'sched/urgent' into sched/core, to pick up pending v6.7 fixes for the v6.8 merge window
This fix didn't make it upstream in time, pick it up for the v6.8 merge window. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Jakub Kicinski
|
8158a50f90 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZZgrfgAKCRDbK58LschI g87JAQDu+oUG3aWnRJi+lJTK8vGnKRuBwUxgnI5Ze99N0tuPmAEAz1gpXLYP+fKE eqRhZGGhujdHC9if3Le+nG6nvf8Gvw0= =KPkZ -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-01-05 We've added 40 non-merge commits during the last 2 day(s) which contain a total of 73 files changed, 1526 insertions(+), 951 deletions(-). The main changes are: 1) Fix a memory leak when streaming AF_UNIX sockets were inserted into multiple sockmap slots/maps, from John Fastabend. 2) Fix gotol in s390 BPF JIT with large offsets, from Ilya Leoshkevich. 3) Fix reattachment branch in bpf_tracing_prog_attach() and reject the request if there is no valid attach_btf, from Jiri Olsa. 4) Remove deprecated bpfilter kernel leftovers given the project is developed in user space (https://github.com/facebook/bpfilter), from Quentin Deslandes. 5) Relax tracing BPF program recursive attach rules given right now it is not possible to create tracing program call cycles, from Dmitrii Dolgov. 6) Fix excessive memory consumption for the bpf_global_percpu_ma for systems with a large number of CPUs, from Yonghong Song. 7) Small x86 BPF JIT cleanup to reuse emit_nops instead of open-coding memcpy of x86_nops, from Leon Hwang. 8) Follow-up for libbpf to support __arg_ctx global function argument tag semantics to complement the merged kernel side, from Andrii Nakryiko. 9) Introduce "volatile compare" macros for BPF selftests in order to make the latter more robust against compiler optimization, from Alexei Starovoitov. 10) Small simplification in verifier's size checking of helper accesses along with additional selftests, from Andrei Matei. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (40 commits) selftests/bpf: Test re-attachment fix for bpf_tracing_prog_attach bpf: Fix re-attachment branch in bpf_tracing_prog_attach selftests/bpf: Add test for recursive attachment of tracing progs bpf: Relax tracing prog recursive attach rules bpf, x86: Use emit_nops to replace memcpy x86_nops selftests/bpf: Test gotol with large offsets selftests/bpf: Double the size of test_loader log s390/bpf: Fix gotol with large offsets bpfilter: remove bpfilter bpf: Remove unnecessary cpu == 0 check in memalloc selftests/bpf: add __arg_ctx BTF rewrite test selftests/bpf: add arg:ctx cases to test_global_funcs tests libbpf: implement __arg_ctx fallback logic libbpf: move BTF loading step after relocation step libbpf: move exception callbacks assignment logic into relocation step libbpf: use stable map placeholder FDs libbpf: don't rely on map->fd as an indicator of map being created libbpf: use explicit map reuse flag to skip map creation steps libbpf: make uniform use of btf__fd() accessor inside libbpf selftests/bpf: Add a selftest with > 512-byte percpu allocation size ... ==================== Link: https://lore.kernel.org/r/20240105170105.21070-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Yuntao Wang
|
6dff315972 |
crash_core: fix and simplify the logic of crash_exclude_mem_range()
The purpose of crash_exclude_mem_range() is to remove all memory ranges that overlap with [mstart-mend]. However, the current logic only removes the first overlapping memory range. Commit a2e9a95d2190 ("kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges") attempted to address this issue, but it did not fix all error cases. Let's fix and simplify the logic of crash_exclude_mem_range(). Link: https://lkml.kernel.org/r/20240102144905.110047-4-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kinsey Ho
|
61dd3f246b |
mm/mglru: add CONFIG_LRU_GEN_WALKS_MMU
Add CONFIG_LRU_GEN_WALKS_MMU such that if disabled, the code that walks page tables to promote pages into the youngest generation will not be built. Also improves code readability by adding two helper functions get_mm_state() and get_next_mm(). Link: https://lkml.kernel.org/r/20231227141205.2200125-3-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: Donet Tom <donettom@linux.vnet.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jiri Olsa
|
715d82ba63 |
bpf: Fix re-attachment branch in bpf_tracing_prog_attach
The following case can cause a crash due to missing attach_btf: 1) load rawtp program 2) load fentry program with rawtp as target_fd 3) create tracing link for fentry program with target_fd = 0 4) repeat 3 In the end we have: - prog->aux->dst_trampoline == NULL - tgt_prog == NULL (because we did not provide target_fd to link_create) - prog->aux->attach_btf == NULL (the program was loaded with attach_prog_fd=X) - the program was loaded for tgt_prog but we have no way to find out which one BUG: kernel NULL pointer dereference, address: 0000000000000058 Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x15b/0x430 ? fixup_exception+0x22/0x330 ? exc_page_fault+0x6f/0x170 ? asm_exc_page_fault+0x22/0x30 ? bpf_tracing_prog_attach+0x279/0x560 ? btf_obj_id+0x5/0x10 bpf_tracing_prog_attach+0x439/0x560 __sys_bpf+0x1cf4/0x2de0 __x64_sys_bpf+0x1c/0x30 do_syscall_64+0x41/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 Return -EINVAL in this situation. Fixes: f3a95075549e0 ("bpf: Allow trampoline re-attach for tracing and lsm programs") Cc: stable@vger.kernel.org Signed-off-by: Jiri Olsa <olsajiri@gmail.com> Acked-by: Jiri Olsa <olsajiri@gmail.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com> Link: https://lore.kernel.org/r/20240103190559.14750-4-9erthalion6@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dmitrii Dolgov
|
19bfcdf949 |
bpf: Relax tracing prog recursive attach rules
Currently, it's not allowed to attach an fentry/fexit prog to another one fentry/fexit. At the same time it's not uncommon to see a tracing program with lots of logic in use, and the attachment limitation prevents usage of fentry/fexit for performance analysis (e.g. with "bpftool prog profile" command) in this case. An example could be falcosecurity libs project that uses tp_btf tracing programs. Following the corresponding discussion [1], the reason for that is to avoid tracing progs call cycles without introducing more complex solutions. But currently it seems impossible to load and attach tracing programs in a way that will form such a cycle. The limitation is coming from the fact that attach_prog_fd is specified at the prog load (thus making it impossible to attach to a program loaded after it in this way), as well as tracing progs not implementing link_detach. Replace "no same type" requirement with verification that no more than one level of attachment nesting is allowed. In this way only one fentry/fexit program could be attached to another fentry/fexit to cover profiling use case, and still no cycle could be formed. To implement, add a new field into bpf_prog_aux to track nested attachment for tracing programs. [1]: https://lore.kernel.org/bpf/20191108064039.2041889-16-ast@kernel.org/ Acked-by: Jiri Olsa <olsajiri@gmail.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com> Link: https://lore.kernel.org/r/20240103190559.14750-2-9erthalion6@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jakub Kicinski
|
e63c1822ac |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c e009b2efb7a8 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()") 0f2b21477988 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL") https://lore.kernel.org/all/20240105115509.225aa8a2@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Yonghong Song
|
9ddf872b47 |
bpf: Remove unnecessary cpu == 0 check in memalloc
After merging the patch set [1] to reduce memory usage for bpf_global_percpu_ma, Alexei found a redundant check (cpu == 0) in function bpf_mem_alloc_percpu_unit_init() ([2]). Indeed, the check is unnecessary since c->unit_size will be all NULL or all non-NULL for all cpus before for_each_possible_cpu() loop. Removing the check makes code less confusing. [1] https://lore.kernel.org/all/20231222031729.1287957-1-yonghong.song@linux.dev/ [2] https://lore.kernel.org/all/20231222031745.1289082-1-yonghong.song@linux.dev/ Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240104165744.702239-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Greg Kroah-Hartman
|
86438841e4 |
dma-debug: make dma_debug_add_bus take a const pointer
The driver core now can handle a const struct bus_type pointer, and the dma_debug_add_bus() call just passes on the pointer give to it to the driver core, so make this pointer const as well to allow everyone to use read-only struct bus_type pointers going forward. Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: <iommu@lists.linux.dev> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/2023121941-dejected-nugget-681e@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Yonghong Song
|
5c1a376532 |
bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
For percpu data structure allocation with bpf_global_percpu_ma, the maximum data size is 4K. But for a system with large number of cpus, bigger data size (e.g., 2K, 4K) might consume a lot of memory. For example, the percpu memory consumption with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB memory. We should discourage such usage. Let us limit the maximum data size to be 512 for bpf_global_percpu_ma allocation. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031801.1290841-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
0e2ba9f96f |
bpf: Use smaller low/high marks for percpu allocation
Currently, refill low/high marks are set with the assumption of normal non-percpu memory allocation. For example, for an allocation size 256, for non-percpu memory allocation, low mark is 32 and high mark is 96, resulting in the batch allocation of 48 elements and the allocated memory will be 48 * 256 = 12KB for this particular cpu. Assuming an 128-cpu system, the total memory consumption across all cpus will be 12K * 128 = 1.5MB memory. This might be okay for non-percpu allocation, but may not be good for percpu allocation, which will consume 1.5MB * 128 = 192MB memory in the worst case if every cpu has a chance of memory allocation. In practice, percpu allocation is very rare compared to non-percpu allocation. So let us have smaller low/high marks which can avoid unnecessary memory consumption. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
5b95e638f1 |
bpf: Refill only one percpu element in memalloc
Typically for percpu map element or data structure, once allocated, most operations are lookup or in-place update. Deletion are really rare. Currently, for percpu data strcture, 4 elements will be refilled if the size is <= 256. Let us just do with one element for percpu data. For example, for size 256 and 128 cpus, the potential saving will be 3 * 256 * 128 * 128 = 12MB. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
c39aa3b289 |
bpf: Allow per unit prefill for non-fix-size percpu memory allocator
Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") added support for non-fix-size percpu memory allocation. Such allocation will allocate percpu memory for all buckets on all cpus and the memory consumption is in the order to quadratic. For example, let us say, 4 cpus, unit size 16 bytes, so each cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. Then let us say, 8 cpus with the same unit size, each cpu has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. So if the number of cpus doubles, the number of memory consumption will be 4 times. So for a system with large number of cpus, the memory consumption goes up quickly with quadratic order. For example, for 4KB percpu allocation, 128 cpus. The total memory consumption will 4KB * 128 * 128 = 64MB. Things will become worse if the number of cpus is bigger (e.g., 512, 1024, etc.) In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is done in boot time, so for system with large number of cpus, the initial percpu memory consumption is very visible. For example, for 128 cpu system, the total percpu memory allocation will be at least (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) * 128 * 128 = ~138MB. which is pretty big. It will be even bigger for larger number of cpus. Note that the current prefill also allocates 4 entries if the unit size is less than 256. So on top of 138MB memory consumption, this will add more consumption with 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. Next patch will try to reduce this memory consumption. Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory at init stage") moved the non-fix-size percpu memory allocation to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() is called by bpf program, the memory allocator will try to fill in the cache with all sizes, causing the same amount of percpu memory consumption as in the boot stage. To reduce the initial percpu memory consumption for non-fix-size percpu memory allocation, instead of filling the cache with all supported allocation sizes, this patch intends to fill the cache only for the requested size. As typically users will not use large percpu data structure, this can save memory significantly. For example, the allocation size is 64 bytes with 128 cpus. Then total percpu memory amount will be 64 * 128 * 128 = 1MB, much less than previous 138MB. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
9fc8e80204 |
bpf: Add objcg to bpf_mem_alloc
The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's are with the same objcg. This patch made such a property explicit. The next patch will use this property to save and restore objcg for percpu unit allocator. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
9beda16c25 |
bpf: Avoid unnecessary extra percpu memory allocation
Currently, for percpu memory allocation, say if the user requests allocation size to be 32 bytes, the actually calculated size will be 40 bytes and it further rounds to 64 bytes, and eventually 64 bytes are allocated, wasting 32-byte memory. Change bpf_mem_alloc() to calculate the cache index based on the user-provided allocation size so unnecessary extra memory can be avoided. Suggested-by: Hou Tao <houtao1@huawei.com> Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Andrei Matei
|
8a021e7fa1 |
bpf: Simplify checking size of helper accesses
This patch simplifies the verification of size arguments associated to pointer arguments to helpers and kfuncs. Many helpers take a pointer argument followed by the size of the memory access performed to be performed through that pointer. Before this patch, the handling of the size argument in check_mem_size_reg() was confusing and wasteful: if the size register's lower bound was 0, then the verification was done twice: once considering the size of the access to be the lower-bound of the respective argument, and once considering the upper bound (even if the two are the same). The upper bound checking is a super-set of the lower-bound checking(*), except: the only point of the lower-bound check is to handle the case where zero-sized-accesses are explicitly not allowed and the lower-bound is zero. This static condition is now checked explicitly, replacing a much more complex, expensive and confusing verification call to check_helper_mem_access(). Error messages change in this patch. Before, messages about illegal zero-size accesses depended on the type of the pointer and on other conditions, and sometimes the message was plain wrong: in some tests that changed you'll see that the old message was something like "R1 min value is outside of the allowed memory range", where R1 is the pointer register; the error was wrongly claiming that the pointer was bad instead of the size being bad. Other times the information that the size came for a register with a possible range of values was wrong, and the error presented the size as a fixed zero. Now the errors refer to the right register. However, the old error messages did contain useful information about the pointer register which is now lost; recovering this information was deemed not important enough. (*) Besides standing to reason that the checks for a bigger size access are a super-set of the checks for a smaller size access, I have also mechanically verified this by reading the code for all types of pointers. I could convince myself that it's true for all but PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking line-by-line does not immediately prove what we want. If anyone has any qualms, let me know. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com |
||
Rafael J. Wysocki
|
7d4b5d7a37 |
async: Introduce async_schedule_dev_nocall()
In preparation for subsequent changes, introduce a specialized variant of async_schedule_dev() that will not invoke the argument function synchronously when it cannot be scheduled for asynchronous execution. The new function, async_schedule_dev_nocall(), will be used for fixing possible deadlocks in the system-wide power management core code. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> for the series. Tested-by: Youngmin Nam <youngmin.nam@samsung.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> |
||
Rafael J. Wysocki
|
6aa09a5bcc |
async: Split async_schedule_node_domain()
In preparation for subsequent changes, split async_schedule_node_domain() in two pieces so as to allow the bottom part of it to be called from a somewhat different code path. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Tested-by: Youngmin Nam <youngmin.nam@samsung.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> |
||
Joerg Roedel
|
75f74f85a4 | Merge branches 'apple/dart', 'arm/rockchip', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd' and 'core' into next | ||
Ingo Molnar
|
67a1723344 |
Linux 6.7-rc8
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmWR1E0eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYb8H/j5KAXUH3rvY3I7z pLZkdQiFX/ix2p25VYCNEes/h9gRLNTX0ndr+qd/LSjunaF17x+j0odHpjaK7dV0 YwSiP+ffNY8zDjHJP3mAbNfyov37kZjrtjciEs9/Ldk0w9Swp91iJZC6mRsJ6LbV kneZ/5nQIfZn0KFgqNCrrfCvUf3TcRskNFDvguDLCXf+lg5nJmhJnbGRExDqe5vs VIbfI9nnIjZd0Pt5rsXSQfwk0RaK/85unz6Dfshny9vAHnfTC5tX1B0sAqga+Zji nQtHQ/UT/gLSY0b/PoRXjA4wenv5UdTwKs7XaKaGKpyaDlD35CxiWXi8xhdgw+qG X4LFgDY= =N/Fv -----END PGP SIGNATURE----- Merge tag 'v6.7-rc8' into locking/core, to pick up dependent changes Pick up these commits from Linus's tree: b106bcf0f99a ("locking/osq_lock: Clarify osq_wait_next()") 563adbfc351b ("locking/osq_lock: Clarify osq_wait_next() calling convention") 7c2230982129 ("locking/osq_lock: Move the definition of optimistic_spin_node into osq_lock.c") Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Fabio Estevam
|
79fa723ba8 |
reboot: Introduce thermal_zone_device_critical_reboot()
Introduce thermal_zone_device_critical_reboot() to trigger an emergency reboot. It is a counterpart of thermal_zone_device_critical() with the difference that it will force a reboot instead of shutdown. The motivation for doing this is to allow the thermal subystem to trigger a reboot when the temperature reaches the critical temperature. Signed-off-by: Fabio Estevam <festevam@denx.de> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/20231129124330.519423-3-festevam@gmail.com |
||
Fabio Estevam
|
5a0e241003 |
thermal/core: Prepare for introduction of thermal reboot
Add some helper functions to make it easier introducing the support for thermal reboot. No functional change. Signed-off-by: Fabio Estevam <festevam@denx.de> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/20231129124330.519423-2-festevam@gmail.com |
||
David S. Miller
|
240436c06c |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZYVEqQAKCRDbK58LschI gzH6AP9hVXLpHFTWMT0+2GK2lx69VX8zW1C0SmN7WHaxUbPN9QEAwzGnELfKk00P 0IKRHSl5abhVMX7JOM3sSOhCILeKjQg= =wRLJ -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next-for-netdev The following pull-request contains BPF updates for your *net-next* tree. We've added 22 non-merge commits during the last 3 day(s) which contain a total of 23 files changed, 652 insertions(+), 431 deletions(-). The main changes are: 1) Add verifier support for annotating user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience, from Andrii Nakryiko. These tags are: - Ability to annotate a special PTR_TO_CTX argument - Ability to annotate a generic PTR_TO_MEM as non-NULL 2) Support BPF verifier tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like, from Menglong Dong. 3) Fix a warning in bpf_mem_cache's check_obj_size() as reported by LKP, from Hou Tao. 4) Re-support uid/gid options when mounting bpffs which had to be reverted with the prior token series revert to avoid conflicts, from Daniel Borkmann. 5) Fix a libbpf NULL pointer dereference in bpf_object__collect_prog_relos() found from fuzzing the library with malformed ELF files, from Mingyi Zhang. 6) Skip DWARF sections in libbpf's linker sanity check given compiler options to generate compressed debug sections can trigger a rejection due to misalignment, from Alyssa Ross. 7) Fix an unnecessary use of the comma operator in BPF verifier, from Simon Horman. 8) Fix format specifier for unsigned long values in cpustat sample, from Colin Ian King. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Linus Torvalds
|
453f5db061 |
tracing fixes for v6.7-rc7:
- Fix readers that are blocked on the ring buffer when buffer_percent is 100%. They are supposed to wake up when the buffer is full, but because the sub-buffer that the writer is on is never considered "dirty" in the calculation, dirty pages will never equal nr_pages. Add +1 to the dirty count in order to count for the sub-buffer that the writer is on. - When a reader is blocked on the "snapshot_raw" file, it is to be woken up when a snapshot is done and be able to read the snapshot buffer. But because the snapshot swaps the buffers (the main one with the snapshot one), and the snapshot reader is waiting on the old snapshot buffer, it was not woken up (because it is now on the main buffer after the swap). Worse yet, when it reads the buffer after a snapshot, it's not reading the snapshot buffer, it's reading the live active main buffer. Fix this by forcing a wakeup of all readers on the snapshot buffer when a new snapshot happens, and then update the buffer that the reader is reading to be back on the snapshot buffer. - Fix the modification of the direct_function hash. There was a race when new functions were added to the direct_function hash as when it moved function entries from the old hash to the new one, a direct function trace could be hit and not see its entry. This is fixed by allocating the new hash, copy all the old entries onto it as well as the new entries, and then use rcu_assign_pointer() to update the new direct_function hash with it. This also fixes a memory leak in that code. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZAzTxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qs9IAP9e6wZ74aEjMED9nsbC49EpyCNTqa72 y0uDS/p9ppv52gD7Be+l+kJQzYNh6bZU0+B19hNC2QVn38jb7sOadfO/1Q8= =NDkf -----END PGP SIGNATURE----- Merge tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix readers that are blocked on the ring buffer when buffer_percent is 100%. They are supposed to wake up when the buffer is full, but because the sub-buffer that the writer is on is never considered "dirty" in the calculation, dirty pages will never equal nr_pages. Add +1 to the dirty count in order to count for the sub-buffer that the writer is on. - When a reader is blocked on the "snapshot_raw" file, it is to be woken up when a snapshot is done and be able to read the snapshot buffer. But because the snapshot swaps the buffers (the main one with the snapshot one), and the snapshot reader is waiting on the old snapshot buffer, it was not woken up (because it is now on the main buffer after the swap). Worse yet, when it reads the buffer after a snapshot, it's not reading the snapshot buffer, it's reading the live active main buffer. Fix this by forcing a wakeup of all readers on the snapshot buffer when a new snapshot happens, and then update the buffer that the reader is reading to be back on the snapshot buffer. - Fix the modification of the direct_function hash. There was a race when new functions were added to the direct_function hash as when it moved function entries from the old hash to the new one, a direct function trace could be hit and not see its entry. This is fixed by allocating the new hash, copy all the old entries onto it as well as the new entries, and then use rcu_assign_pointer() to update the new direct_function hash with it. This also fixes a memory leak in that code. - Fix eventfs ownership * tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix modification of direct_function hash while in use tracing: Fix blocked reader of snapshot buffer ring-buffer: Fix wake ups when buffer_percent is set to 100 eventfs: Fix file and directory uid and gid ownership |
||
David Laight
|
b106bcf0f9 |
locking/osq_lock: Clarify osq_wait_next()
Directly return NULL or 'next' instead of breaking out of the loop. Signed-off-by: David Laight <david.laight@aculab.com> [ Split original patch into two independent parts - Linus ] Link: https://lore.kernel.org/lkml/7c8828aec72e42eeb841ca0ee3397e9a@AcuMS.aculab.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
David Laight
|
563adbfc35 |
locking/osq_lock: Clarify osq_wait_next() calling convention
osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock() but only needs the 'cpu' value to write to lock->tail. Just pass prev->cpu or OSQ_UNLOCKED_VAL instead. Should have no effect on the generated code since gcc manages to assume that 'prev != NULL' due to an earlier dereference. Signed-off-by: David Laight <david.laight@aculab.com> [ Changed 'old' to 'old_cpu' by request from Waiman Long - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
David Laight
|
7c22309821 |
locking/osq_lock: Move the definition of optimistic_spin_node into osq_lock.c
struct optimistic_spin_node is private to the implementation. Move it into the C file to ensure nothing is accessing it. Signed-off-by: David Laight <david.laight@aculab.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Steven Rostedt (Google)
|
d05cb47066 |
ftrace: Fix modification of direct_function hash while in use
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where if the number of new entries are added is large enough to cause two allocations in the loop: for (i = 0; i < size; i++) { hlist_for_each_entry(entry, &hash->buckets[i], hlist) { new = ftrace_add_rec_direct(entry->ip, addr, &free_hash); if (!new) goto out_remove; entry->direct = addr; } } Where ftrace_add_rec_direct() has: if (ftrace_hash_empty(direct_functions) || direct_functions->count > 2 * (1 << direct_functions->size_bits)) { struct ftrace_hash *new_hash; int size = ftrace_hash_empty(direct_functions) ? 0 : direct_functions->count + 1; if (size < 32) size = 32; new_hash = dup_hash(direct_functions, size); if (!new_hash) return NULL; *free_hash = direct_functions; direct_functions = new_hash; } The "*free_hash = direct_functions;" can happen twice, losing the previous allocation of direct_functions. But this also exposed a more serious bug. The modification of direct_functions above is not safe. As direct_functions can be referenced at any time to find what direct caller it should call, the time between: new_hash = dup_hash(direct_functions, size); and direct_functions = new_hash; can have a race with another CPU (or even this one if it gets interrupted), and the entries being moved to the new hash are not referenced. That's because the "dup_hash()" is really misnamed and is really a "move_hash()". It moves the entries from the old hash to the new one. Now even if that was changed, this code is not proper as direct_functions should not be updated until the end. That is the best way to handle function reference changes, and is the way other parts of ftrace handles this. The following is done: 1. Change add_hash_entry() to return the entry it created and inserted into the hash, and not just return success or not. 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove the former. 3. Allocate a "new_hash" at the start that is made for holding both the new hash entries as well as the existing entries in direct_functions. 4. Copy (not move) the direct_function entries over to the new_hash. 5. Copy the entries of the added hash to the new_hash. 6. If everything succeeds, then use rcu_pointer_assign() to update the direct_functions with the new_hash. This simplifies the code and fixes both the memory leak as well as the race condition mentioned above. Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/ Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Douglas Anderson
|
55efe4abf9 |
watchdog: if panicking and we dumped everything, don't re-enable dumping
If, as part of handling a hardlockup or softlockup, we've already dumped all CPUs and we're just about to panic, don't reenable dumping and give some other CPU a chance to hop in there and add some confusing logs right as the panic is happening. Link: https://lkml.kernel.org/r/20231220131534.4.Id3a9c7ec2d7d83e4080da6f8662ba2226b40543f@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Douglas Anderson
|
ee6bdb3f4b |
watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
If two CPUs end up reporting a hardlockup at the same time then their logs could get interleaved which is hard to read. The interleaving problem was especially bad with the "perf" hardlockup detector where the locked up CPU is always the same as the running CPU and we end up in show_regs(). show_regs() has no inherent serialization so we could mix together two crawls if two hardlockups happened at the same time (and if we didn't have `sysctl_hardlockup_all_cpu_backtrace` set). With this change we'll fully serialize hardlockups when using the "perf" hardlockup detector. The interleaving problem was less bad with the "buddy" hardlockup detector. With "buddy" we always end up calling `trigger_single_cpu_backtrace(cpu)` on some CPU other than the running one. trigger_single_cpu_backtrace() always at least serializes the individual stack crawls because it eventually uses printk_cpu_sync_get_irqsave(). Unfortunately the fact that trigger_single_cpu_backtrace() eventually calls printk_cpu_sync_get_irqsave() (on a different CPU) means that we have to drop the "lock" before calling it and we can't fully serialize all printouts associated with a given hardlockup. However, we still do get the advantage of serializing the output of print_modules() and print_irqtrace_events(). Aside from serializing hardlockups from each other, this change also has the advantage of serializing hardlockups and softlockups from each other if they happen to happen at the same time since they are both using the same "lock". Even though nobody is expected to hang while holding the lock associated with printk_cpu_sync_get_irqsave(), out of an abundance of caution, we don't call printk_cpu_sync_get_irqsave() until after we print out about the hardlockup. This makes extra sure that, even if printk_cpu_sync_get_irqsave() somehow never runs we at least print that we saw the hardlockup. This is different than the choice made for softlockup because hardlockup is really our last resort. Link: https://lkml.kernel.org/r/20231220131534.3.I6ff691b3b40f0379bc860f80c6e729a0485b5247@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Douglas Anderson
|
896260a6d6 |
watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
Instead of introducing a spinlock, use printk_cpu_sync_get_irqsave() and printk_cpu_sync_put_irqrestore() to serialize softlockup reporting. Alone this doesn't have any real advantage over the spinlock, but this will allow us to use the same function in a future change to also serialize hardlockup crawls. NOTE: for the most part this serialization is important because we often end up in the show_regs() path and that has no built-in serialization if there are multiple callers at once. However, even in the case where we end up in the dump_stack() path this still has some advantages because the stack will be guaranteed to be together in the logs with the lockup message with no interleaving. NOTE: the fact that printk_cpu_sync_get_irqsave() is allowed to be called multiple times on the same CPU is important here. Specifically we hold the "lock" while calling dump_stack() which also gets the same "lock". This is explicitly documented to be OK and means we don't need to introduce a variant of dump_stack() that doesn't grab the lock. Link: https://lkml.kernel.org/r/20231220131534.2.Ia5906525d440d8e8383cde31b7c61c2aadc8f907@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Li Zhe <lizhe.67@bytedance.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |