From f903663a8dcd6e1656e52856afbf706cc14cbe6d Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Sun, 1 Sep 2024 11:30:24 +0200 Subject: [PATCH 001/235] clk: qcom: videocc-sm8350: use HW_CTRL_TRIGGER for vcodec GDSCs A recent change in the venus driver results in a stuck clock on the Lenovo ThinkPad X13s, for example, when streaming video in firefox: video_cc_mvs0_clk status stuck at 'off' WARNING: CPU: 6 PID: 2885 at drivers/clk/qcom/clk-branch.c:87 clk_branch_wait+0x144/0x15c ... Call trace: clk_branch_wait+0x144/0x15c clk_branch2_enable+0x30/0x40 clk_core_enable+0xd8/0x29c clk_enable+0x2c/0x4c vcodec_clks_enable.isra.0+0x94/0xd8 [venus_core] coreid_power_v4+0x464/0x628 [venus_core] vdec_start_streaming+0xc4/0x510 [venus_dec] vb2_start_streaming+0x6c/0x180 [videobuf2_common] vb2_core_streamon+0x120/0x1dc [videobuf2_common] vb2_streamon+0x1c/0x6c [videobuf2_v4l2] v4l2_m2m_ioctl_streamon+0x30/0x80 [v4l2_mem2mem] v4l_streamon+0x24/0x30 [videodev] using the out-of-tree sm8350/sc8280xp venus support. [1] Update also the sm8350/sc8280xp GDSC definitions so that the hw control mode can be changed at runtime as the venus driver now requires. Fixes: ec9a652e5149 ("venus: pm_helpers: Use dev_pm_genpd_set_hwmode to switch GDSC mode on V6") Link: https://lore.kernel.org/lkml/20230731-topic-8280_venus-v1-0-8c8bbe1983a5@linaro.org/ # [1] Cc: Jagadeesh Kona Cc: Taniya Das Cc: Abel Vesa Cc: Konrad Dybcio Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold Tested-by: Steev Klimaszewski Link: https://lore.kernel.org/r/20240901093024.18841-1-johan+linaro@kernel.org Signed-off-by: Bjorn Andersson --- drivers/clk/qcom/videocc-sm8350.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/clk/qcom/videocc-sm8350.c b/drivers/clk/qcom/videocc-sm8350.c index 5bd6fe3e1298..874d4da95ff8 100644 --- a/drivers/clk/qcom/videocc-sm8350.c +++ b/drivers/clk/qcom/videocc-sm8350.c @@ -452,7 +452,7 @@ static struct gdsc mvs0_gdsc = { .pd = { .name = "mvs0_gdsc", }, - .flags = HW_CTRL | RETAIN_FF_ENABLE, + .flags = HW_CTRL_TRIGGER | RETAIN_FF_ENABLE, .pwrsts = PWRSTS_OFF_ON, }; @@ -461,7 +461,7 @@ static struct gdsc mvs1_gdsc = { .pd = { .name = "mvs1_gdsc", }, - .flags = HW_CTRL | RETAIN_FF_ENABLE, + .flags = HW_CTRL_TRIGGER | RETAIN_FF_ENABLE, .pwrsts = PWRSTS_OFF_ON, }; From 923168a0631bc42fffd55087b337b1b6c54dcff5 Mon Sep 17 00:00:00 2001 From: Samasth Norway Ananda Date: Wed, 7 Aug 2024 10:27:13 -0700 Subject: [PATCH 002/235] ima: fix buffer overrun in ima_eventdigest_init_common Function ima_eventdigest_init() calls ima_eventdigest_init_common() with HASH_ALGO__LAST which is then used to access the array hash_digest_size[] leading to buffer overrun. Have a conditional statement to handle this. Fixes: 9fab303a2cb3 ("ima: fix violation measurement list record") Signed-off-by: Samasth Norway Ananda Tested-by: Enrico Bravi (PhD at polito.it) Cc: stable@vger.kernel.org # 5.19+ Signed-off-by: Mimi Zohar --- security/integrity/ima/ima_template_lib.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/security/integrity/ima/ima_template_lib.c b/security/integrity/ima/ima_template_lib.c index 4183956c53af..0e627eac9c33 100644 --- a/security/integrity/ima/ima_template_lib.c +++ b/security/integrity/ima/ima_template_lib.c @@ -318,15 +318,21 @@ static int ima_eventdigest_init_common(const u8 *digest, u32 digestsize, hash_algo_name[hash_algo]); } - if (digest) + if (digest) { memcpy(buffer + offset, digest, digestsize); - else + } else { /* * If digest is NULL, the event being recorded is a violation. * Make room for the digest by increasing the offset by the - * hash algorithm digest size. + * hash algorithm digest size. If the hash algorithm is not + * specified increase the offset by IMA_DIGEST_SIZE which + * fits SHA1 or MD5 */ - offset += hash_digest_size[hash_algo]; + if (hash_algo < HASH_ALGO__LAST) + offset += hash_digest_size[hash_algo]; + else + offset += IMA_DIGEST_SIZE; + } return ima_write_template_field_data(buffer, offset + digestsize, fmt, field_data); From 699ae6241920b0fa837fa57e61f7d5b0e2e65b58 Mon Sep 17 00:00:00 2001 From: Mateusz Guzik Date: Tue, 6 Aug 2024 15:36:07 +0200 Subject: [PATCH 003/235] evm: stop avoidably reading i_writecount in evm_file_release The EVM_NEW_FILE flag is unset if the file already existed at the time of open and this can be checked without looking at i_writecount. Not accessing it reduces traffic on the cacheline during parallel open of the same file and drop the evm_file_release routine from second place to bottom of the profile. Fixes: 75a323e604fc ("evm: Make it independent from 'integrity' LSM") Signed-off-by: Mateusz Guzik Reviewed-by: Roberto Sassu Cc: stable@vger.kernel.org # 6.9+ Signed-off-by: Mimi Zohar --- security/integrity/evm/evm_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/security/integrity/evm/evm_main.c b/security/integrity/evm/evm_main.c index 6924ed508ebd..377e57e9084f 100644 --- a/security/integrity/evm/evm_main.c +++ b/security/integrity/evm/evm_main.c @@ -1084,7 +1084,8 @@ static void evm_file_release(struct file *file) if (!S_ISREG(inode->i_mode) || !(mode & FMODE_WRITE)) return; - if (iint && atomic_read(&inode->i_writecount) == 1) + if (iint && iint->flags & EVM_NEW_FILE && + atomic_read(&inode->i_writecount) == 1) iint->flags &= ~EVM_NEW_FILE; } From 08ae3e5f5fc8edb9bd0c7ef9696ff29ef18b26ef Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Thu, 8 Aug 2024 16:04:59 -0600 Subject: [PATCH 004/235] integrity: Use static_assert() to check struct sizes Commit 38aa3f5ac6d2 ("integrity: Avoid -Wflex-array-member-not-at-end warnings") introduced tagged `struct evm_ima_xattr_data_hdr` and `struct ima_digest_data_hdr`. We want to ensure that when new members need to be added to the flexible structures, they are always included within these tagged structs. So, we use `static_assert()` to ensure that the memory layout for both the flexible structure and the tagged struct is the same after any changes. Signed-off-by: Gustavo A. R. Silva Tested-by: Roberto Sassu Reviewed-by: Roberto Sassu Signed-off-by: Mimi Zohar --- security/integrity/integrity.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h index 660f76cb69d3..c2c2da691123 100644 --- a/security/integrity/integrity.h +++ b/security/integrity/integrity.h @@ -37,6 +37,8 @@ struct evm_ima_xattr_data { ); u8 data[]; } __packed; +static_assert(offsetof(struct evm_ima_xattr_data, data) == sizeof(struct evm_ima_xattr_data_hdr), + "struct member likely outside of __struct_group()"); /* Only used in the EVM HMAC code. */ struct evm_xattr { @@ -65,6 +67,8 @@ struct ima_digest_data { ); u8 digest[]; } __packed; +static_assert(offsetof(struct ima_digest_data, digest) == sizeof(struct ima_digest_data_hdr), + "struct member likely outside of __struct_group()"); /* * Instead of wrapping the ima_digest_data struct inside a local structure From 9803787a23c57328cd70c393a661266c396d12fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micka=C3=ABl=20Sala=C3=BCn?= Date: Fri, 4 Oct 2024 17:31:20 +0200 Subject: [PATCH 005/235] landlock: Improve documentation of previous limitations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Improve consistency of previous limitations' subsection titles, and expand a bit the IOCTL section. This changes some HTML anchors and may break some external links though. Cc: Konstantin Meskhidze Cc: Tahera Fahimi Reviewed-by: Günther Noack Link: https://lore.kernel.org/r/20241004153122.501775-1-mic@digikod.net Signed-off-by: Mickaël Salaün --- Documentation/userspace-api/landlock.rst | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst index c8d3e46badc5..bb7480a05e2c 100644 --- a/Documentation/userspace-api/landlock.rst +++ b/Documentation/userspace-api/landlock.rst @@ -8,7 +8,7 @@ Landlock: unprivileged access control ===================================== :Author: Mickaël Salaün -:Date: September 2024 +:Date: October 2024 The goal of Landlock is to enable to restrict ambient rights (e.g. global filesystem or network access) for a set of processes. Because Landlock @@ -563,33 +563,34 @@ always allowed when using a kernel that only supports the first or second ABI. Starting with the Landlock ABI version 3, it is now possible to securely control truncation thanks to the new ``LANDLOCK_ACCESS_FS_TRUNCATE`` access right. -Network support (ABI < 4) -------------------------- +TCP bind and connect (ABI < 4) +------------------------------ Starting with the Landlock ABI version 4, it is now possible to restrict TCP bind and connect actions to only a set of allowed ports thanks to the new ``LANDLOCK_ACCESS_NET_BIND_TCP`` and ``LANDLOCK_ACCESS_NET_CONNECT_TCP`` access rights. -IOCTL (ABI < 5) ---------------- +Device IOCTL (ABI < 5) +---------------------- IOCTL operations could not be denied before the fifth Landlock ABI, so :manpage:`ioctl(2)` is always allowed when using a kernel that only supports an earlier ABI. Starting with the Landlock ABI version 5, it is possible to restrict the use of -:manpage:`ioctl(2)` using the new ``LANDLOCK_ACCESS_FS_IOCTL_DEV`` right. +:manpage:`ioctl(2)` on character and block devices using the new +``LANDLOCK_ACCESS_FS_IOCTL_DEV`` right. -Abstract UNIX socket scoping (ABI < 6) --------------------------------------- +Abstract UNIX socket (ABI < 6) +------------------------------ Starting with the Landlock ABI version 6, it is possible to restrict connections to an abstract :manpage:`unix(7)` socket by setting ``LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET`` to the ``scoped`` ruleset attribute. -Signal scoping (ABI < 6) ------------------------- +Signal (ABI < 6) +---------------- Starting with the Landlock ABI version 6, it is possible to restrict :manpage:`signal(7)` sending by setting ``LANDLOCK_SCOPE_SIGNAL`` to the From e02bfea4d7ef587bb285ad5825da4e1973ac8263 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Barnab=C3=A1s=20Cz=C3=A9m=C3=A1n?= Date: Sun, 6 Oct 2024 22:51:58 +0200 Subject: [PATCH 006/235] clk: qcom: clk-alpha-pll: Fix pll post div mask when width is not set MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Many qcom clock drivers do not have .width set. In that case value of (p)->width - 1 will be negative which breaks clock tree. Fix this by checking if width is zero, and pass 3 to GENMASK if that's the case. Fixes: 1c3541145cbf ("clk: qcom: support for 2 bit PLL post divider") Signed-off-by: Barnabás Czémán Reviewed-by: Dmitry Baryshkov Reviewed-by: Christopher Obbard Tested-by: Christopher Obbard Link: https://lore.kernel.org/r/20241006-fix-postdiv-mask-v3-1-160354980433@mainlining.org Signed-off-by: Bjorn Andersson --- drivers/clk/qcom/clk-alpha-pll.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/clk/qcom/clk-alpha-pll.c b/drivers/clk/qcom/clk-alpha-pll.c index f9105443d7db..be9bee6ab65f 100644 --- a/drivers/clk/qcom/clk-alpha-pll.c +++ b/drivers/clk/qcom/clk-alpha-pll.c @@ -40,7 +40,7 @@ #define PLL_USER_CTL(p) ((p)->offset + (p)->regs[PLL_OFF_USER_CTL]) # define PLL_POST_DIV_SHIFT 8 -# define PLL_POST_DIV_MASK(p) GENMASK((p)->width - 1, 0) +# define PLL_POST_DIV_MASK(p) GENMASK((p)->width ? (p)->width - 1 : 3, 0) # define PLL_ALPHA_MSB BIT(15) # define PLL_ALPHA_EN BIT(24) # define PLL_ALPHA_MODE BIT(25) From fa88dc7db176c79b50adb132a56120a1d4d9d18b Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Tue, 1 Oct 2024 11:01:34 +0200 Subject: [PATCH 007/235] media: dvb-core: add missing buffer index check dvb_vb2_expbuf() didn't check if the given buffer index was for a valid buffer. Add this check. Signed-off-by: Hans Verkuil Reported-by: Chenyuan Yang Closes: https://lore.kernel.org/linux-media/?q=WARNING+in+vb2_core_reqbufs Fixes: 7dc866df4012 ("media: dvb-core: Use vb2_get_buffer() instead of directly access to buffers array") Reviewed-by: Benjamin Gaignard Cc: Signed-off-by: Mauro Carvalho Chehab --- drivers/media/dvb-core/dvb_vb2.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/media/dvb-core/dvb_vb2.c b/drivers/media/dvb-core/dvb_vb2.c index 192a8230c4aa..29edaaff7a5c 100644 --- a/drivers/media/dvb-core/dvb_vb2.c +++ b/drivers/media/dvb-core/dvb_vb2.c @@ -366,9 +366,15 @@ int dvb_vb2_querybuf(struct dvb_vb2_ctx *ctx, struct dmx_buffer *b) int dvb_vb2_expbuf(struct dvb_vb2_ctx *ctx, struct dmx_exportbuffer *exp) { struct vb2_queue *q = &ctx->vb_q; + struct vb2_buffer *vb2 = vb2_get_buffer(q, exp->index); int ret; - ret = vb2_core_expbuf(&ctx->vb_q, &exp->fd, q->type, q->bufs[exp->index], + if (!vb2) { + dprintk(1, "[%s] invalid buffer index\n", ctx->name); + return -EINVAL; + } + + ret = vb2_core_expbuf(&ctx->vb_q, &exp->fd, q->type, vb2, 0, exp->flags); if (ret) { dprintk(1, "[%s] index=%d errno=%d\n", ctx->name, From bf0a800415a7397617765fe5f5278a645195c75a Mon Sep 17 00:00:00 2001 From: Qiang Yu Date: Fri, 11 Oct 2024 03:41:39 -0700 Subject: [PATCH 008/235] clk: qcom: gcc-x1e80100: Fix halt_check for pipediv2 clocks The pipediv2_clk's source from the same mux as pipe clock. So they have same limitation, which is that the PHY sequence requires to enable these local CBCs before the PHY is actually outputting a clock to them. This means the clock won't actually turn on when we vote them. Hence, let's skip the halt bit check of the pipediv2_clk, otherwise pipediv2_clk may stuck at off state during bootup. Cc: stable@vger.kernel.org Fixes: 161b7c401f4b ("clk: qcom: Add Global Clock controller (GCC) driver for X1E80100") Suggested-by: Mike Tipton Signed-off-by: Qiang Yu Reviewed-by: Konrad Dybcio Reviewed-by: Johan Hovold Link: https://lore.kernel.org/r/20241011104142.1181773-6-quic_qianyu@quicinc.com Signed-off-by: Bjorn Andersson --- drivers/clk/qcom/gcc-x1e80100.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/clk/qcom/gcc-x1e80100.c b/drivers/clk/qcom/gcc-x1e80100.c index 0f578771071f..81ba5ceab342 100644 --- a/drivers/clk/qcom/gcc-x1e80100.c +++ b/drivers/clk/qcom/gcc-x1e80100.c @@ -3123,7 +3123,7 @@ static struct clk_branch gcc_pcie_3_pipe_clk = { static struct clk_branch gcc_pcie_3_pipediv2_clk = { .halt_reg = 0x58060, - .halt_check = BRANCH_HALT_VOTED, + .halt_check = BRANCH_HALT_SKIP, .clkr = { .enable_reg = 0x52020, .enable_mask = BIT(5), @@ -3248,7 +3248,7 @@ static struct clk_branch gcc_pcie_4_pipe_clk = { static struct clk_branch gcc_pcie_4_pipediv2_clk = { .halt_reg = 0x6b054, - .halt_check = BRANCH_HALT_VOTED, + .halt_check = BRANCH_HALT_SKIP, .clkr = { .enable_reg = 0x52010, .enable_mask = BIT(27), @@ -3373,7 +3373,7 @@ static struct clk_branch gcc_pcie_5_pipe_clk = { static struct clk_branch gcc_pcie_5_pipediv2_clk = { .halt_reg = 0x2f054, - .halt_check = BRANCH_HALT_VOTED, + .halt_check = BRANCH_HALT_SKIP, .clkr = { .enable_reg = 0x52018, .enable_mask = BIT(19), @@ -3511,7 +3511,7 @@ static struct clk_branch gcc_pcie_6a_pipe_clk = { static struct clk_branch gcc_pcie_6a_pipediv2_clk = { .halt_reg = 0x31060, - .halt_check = BRANCH_HALT_VOTED, + .halt_check = BRANCH_HALT_SKIP, .clkr = { .enable_reg = 0x52018, .enable_mask = BIT(28), @@ -3649,7 +3649,7 @@ static struct clk_branch gcc_pcie_6b_pipe_clk = { static struct clk_branch gcc_pcie_6b_pipediv2_clk = { .halt_reg = 0x8d060, - .halt_check = BRANCH_HALT_VOTED, + .halt_check = BRANCH_HALT_SKIP, .clkr = { .enable_reg = 0x52010, .enable_mask = BIT(28), From 4c76f331a9a173ac8fe1297a9231c2a38f88e368 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 14:23:38 +0200 Subject: [PATCH 009/235] media: v4l2-ctrls-api: fix error handling for v4l2_g_ctrl() As detected by Coverity, the error check logic at get_ctrl() is broken: if ptr_to_user() fails to fill a control due to an error, no errors are returned and v4l2_g_ctrl() returns success on a failed operation, which may cause applications to fail. Add an error check at get_ctrl() and ensure that it will be returned to userspace without filling the control value if get_ctrl() fails. Fixes: 71c689dc2e73 ("media: v4l2-ctrls: split up into four source files") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab --- drivers/media/v4l2-core/v4l2-ctrls-api.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/media/v4l2-core/v4l2-ctrls-api.c b/drivers/media/v4l2-core/v4l2-ctrls-api.c index e5a364efd5e6..95a2202879d8 100644 --- a/drivers/media/v4l2-core/v4l2-ctrls-api.c +++ b/drivers/media/v4l2-core/v4l2-ctrls-api.c @@ -753,9 +753,10 @@ static int get_ctrl(struct v4l2_ctrl *ctrl, struct v4l2_ext_control *c) for (i = 0; i < master->ncontrols; i++) cur_to_new(master->cluster[i]); ret = call_op(master, g_volatile_ctrl); - new_to_user(c, ctrl); + if (!ret) + ret = new_to_user(c, ctrl); } else { - cur_to_user(c, ctrl); + ret = cur_to_user(c, ctrl); } v4l2_ctrl_unlock(master); return ret; @@ -770,7 +771,10 @@ int v4l2_g_ctrl(struct v4l2_ctrl_handler *hdl, struct v4l2_control *control) if (!ctrl || !ctrl->is_int) return -EINVAL; ret = get_ctrl(ctrl, &c); - control->value = c.value; + + if (!ret) + control->value = c.value; + return ret; } EXPORT_SYMBOL(v4l2_g_ctrl); @@ -811,10 +815,11 @@ static int set_ctrl_lock(struct v4l2_fh *fh, struct v4l2_ctrl *ctrl, int ret; v4l2_ctrl_lock(ctrl); - user_to_new(c, ctrl); - ret = set_ctrl(fh, ctrl, 0); + ret = user_to_new(c, ctrl); if (!ret) - cur_to_user(c, ctrl); + ret = set_ctrl(fh, ctrl, 0); + if (!ret) + ret = cur_to_user(c, ctrl); v4l2_ctrl_unlock(ctrl); return ret; } From e6a3ea83fbe15d4818d01804e904cbb0e64e543b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 16 Oct 2024 11:53:15 +0200 Subject: [PATCH 010/235] media: v4l2-tpg: prevent the risk of a division by zero As reported by Coverity, the logic at tpg_precalculate_line() blindly rescales the buffer even when scaled_witdh is equal to zero. If this ever happens, this will cause a division by zero. Instead, add a WARN_ON_ONCE() to trigger such cases and return without doing any precalculation. Fixes: 63881df94d3e ("[media] vivid: add the Test Pattern Generator") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab --- drivers/media/common/v4l2-tpg/v4l2-tpg-core.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/media/common/v4l2-tpg/v4l2-tpg-core.c b/drivers/media/common/v4l2-tpg/v4l2-tpg-core.c index 642c48e8c1f5..ded11cd8dbf7 100644 --- a/drivers/media/common/v4l2-tpg/v4l2-tpg-core.c +++ b/drivers/media/common/v4l2-tpg/v4l2-tpg-core.c @@ -1795,6 +1795,9 @@ static void tpg_precalculate_line(struct tpg_data *tpg) unsigned p; unsigned x; + if (WARN_ON_ONCE(!tpg->src_width || !tpg->scaled_width)) + return; + switch (tpg->pattern) { case TPG_PAT_GREEN: contrast = TPG_COLOR_100_RED; From 972e63e895abbe8aa1ccbdbb4e6362abda7cd457 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 15:23:01 +0200 Subject: [PATCH 011/235] media: dvbdev: prevent the risk of out of memory access The dvbdev contains a static variable used to store dvb minors. The behavior of it depends if CONFIG_DVB_DYNAMIC_MINORS is set or not. When not set, dvb_register_device() won't check for boundaries, as it will rely that a previous call to dvb_register_adapter() would already be enforcing it. On a similar way, dvb_device_open() uses the assumption that the register functions already did the needed checks. This can be fragile if some device ends using different calls. This also generate warnings on static check analysers like Coverity. So, add explicit guards to prevent potential risk of OOM issues. Fixes: 5dd3f3071070 ("V4L/DVB (9361): Dynamic DVB minor allocation") Signed-off-by: Mauro Carvalho Chehab --- drivers/media/dvb-core/dvbdev.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/media/dvb-core/dvbdev.c b/drivers/media/dvb-core/dvbdev.c index b43695bc51e7..14f323fbada7 100644 --- a/drivers/media/dvb-core/dvbdev.c +++ b/drivers/media/dvb-core/dvbdev.c @@ -86,10 +86,15 @@ static DECLARE_RWSEM(minor_rwsem); static int dvb_device_open(struct inode *inode, struct file *file) { struct dvb_device *dvbdev; + unsigned int minor = iminor(inode); + + if (minor >= MAX_DVB_MINORS) + return -ENODEV; mutex_lock(&dvbdev_mutex); down_read(&minor_rwsem); - dvbdev = dvb_minors[iminor(inode)]; + + dvbdev = dvb_minors[minor]; if (dvbdev && dvbdev->fops) { int err = 0; @@ -525,7 +530,7 @@ int dvb_register_device(struct dvb_adapter *adap, struct dvb_device **pdvbdev, for (minor = 0; minor < MAX_DVB_MINORS; minor++) if (!dvb_minors[minor]) break; - if (minor == MAX_DVB_MINORS) { + if (minor >= MAX_DVB_MINORS) { if (new_node) { list_del(&new_node->list_head); kfree(dvbdevfops); @@ -540,6 +545,14 @@ int dvb_register_device(struct dvb_adapter *adap, struct dvb_device **pdvbdev, } #else minor = nums2minor(adap->num, type, id); + if (minor >= MAX_DVB_MINORS) { + dvb_media_device_free(dvbdev); + list_del(&dvbdev->list_head); + kfree(dvbdev); + *pdvbdev = NULL; + mutex_unlock(&dvbdev_register_lock); + return ret; + } #endif dvbdev->minor = minor; dvb_minors[minor] = dvb_device_get(dvbdev); From 9883a4d41aba7612644e9bb807b971247cea9b9d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 16:05:16 +0200 Subject: [PATCH 012/235] media: dvb_frontend: don't play tricks with underflow values fepriv->auto_sub_step is unsigned. Setting it to -1 is just a trick to avoid calling continue, as reported by Coverity. It relies to have this code just afterwards: if (!ready) fepriv->auto_sub_step++; Simplify the code by simply setting it to zero and use continue to return to the while loop. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Mauro Carvalho Chehab --- drivers/media/dvb-core/dvb_frontend.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/media/dvb-core/dvb_frontend.c b/drivers/media/dvb-core/dvb_frontend.c index 4f78f30b3646..a05aa271a1ba 100644 --- a/drivers/media/dvb-core/dvb_frontend.c +++ b/drivers/media/dvb-core/dvb_frontend.c @@ -443,8 +443,8 @@ static int dvb_frontend_swzigzag_autotune(struct dvb_frontend *fe, int check_wra default: fepriv->auto_step++; - fepriv->auto_sub_step = -1; /* it'll be incremented to 0 in a moment */ - break; + fepriv->auto_sub_step = 0; + continue; } if (!ready) fepriv->auto_sub_step++; From 2aee207e5b3c94ef859316008119ea06d6798d49 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 10:33:10 +0200 Subject: [PATCH 013/235] media: mgb4: protect driver against spectre MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Frequency range is set from sysfs via frequency_range_store(), being vulnerable to spectre, as reported by smatch: drivers/media/pci/mgb4/mgb4_cmt.c:231 mgb4_cmt_set_vin_freq_range() warn: potential spectre issue 'cmt_vals_in' [r] drivers/media/pci/mgb4/mgb4_cmt.c:238 mgb4_cmt_set_vin_freq_range() warn: possible spectre second half. 'reg_set' Fix it. Fixes: 0ab13674a9bd ("media: pci: mgb4: Added Digiteq Automotive MGB4 driver") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Martin Tůma --- drivers/media/pci/mgb4/mgb4_cmt.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/media/pci/mgb4/mgb4_cmt.c b/drivers/media/pci/mgb4/mgb4_cmt.c index 70dc78ef193c..a25b68403bc6 100644 --- a/drivers/media/pci/mgb4/mgb4_cmt.c +++ b/drivers/media/pci/mgb4/mgb4_cmt.c @@ -227,6 +227,8 @@ void mgb4_cmt_set_vin_freq_range(struct mgb4_vin_dev *vindev, u32 config; size_t i; + freq_range = array_index_nospec(freq_range, ARRAY_SIZE(cmt_vals_in)); + addr = cmt_addrs_in[vindev->config->id]; reg_set = cmt_vals_in[freq_range]; From 458ea1c0be991573ec436aa0afa23baacfae101a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 09:24:24 +0200 Subject: [PATCH 014/235] media: av7110: fix a spectre vulnerability As warned by smatch: drivers/staging/media/av7110/av7110_ca.c:270 dvb_ca_ioctl() warn: potential spectre issue 'av7110->ci_slot' [w] (local cap) There is a spectre-related vulnerability at the code. Fix it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab --- drivers/staging/media/av7110/av7110.h | 4 +++- drivers/staging/media/av7110/av7110_ca.c | 25 ++++++++++++++++-------- 2 files changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/staging/media/av7110/av7110.h b/drivers/staging/media/av7110/av7110.h index ec461fd187af..b584754f4be0 100644 --- a/drivers/staging/media/av7110/av7110.h +++ b/drivers/staging/media/av7110/av7110.h @@ -88,6 +88,8 @@ struct infrared { u32 ir_config; }; +#define MAX_CI_SLOTS 2 + /* place to store all the necessary device information */ struct av7110 { /* devices */ @@ -163,7 +165,7 @@ struct av7110 { /* CA */ - struct ca_slot_info ci_slot[2]; + struct ca_slot_info ci_slot[MAX_CI_SLOTS]; enum av7110_video_mode vidmode; struct dmxdev dmxdev; diff --git a/drivers/staging/media/av7110/av7110_ca.c b/drivers/staging/media/av7110/av7110_ca.c index 6ce212c64e5d..fce4023c9dea 100644 --- a/drivers/staging/media/av7110/av7110_ca.c +++ b/drivers/staging/media/av7110/av7110_ca.c @@ -26,23 +26,28 @@ void CI_handle(struct av7110 *av7110, u8 *data, u16 len) { + unsigned slot_num; + dprintk(8, "av7110:%p\n", av7110); if (len < 3) return; switch (data[0]) { case CI_MSG_CI_INFO: - if (data[2] != 1 && data[2] != 2) + if (data[2] != 1 && data[2] != MAX_CI_SLOTS) break; + + slot_num = array_index_nospec(data[2] - 1, MAX_CI_SLOTS); + switch (data[1]) { case 0: - av7110->ci_slot[data[2] - 1].flags = 0; + av7110->ci_slot[slot_num].flags = 0; break; case 1: - av7110->ci_slot[data[2] - 1].flags |= CA_CI_MODULE_PRESENT; + av7110->ci_slot[slot_num].flags |= CA_CI_MODULE_PRESENT; break; case 2: - av7110->ci_slot[data[2] - 1].flags |= CA_CI_MODULE_READY; + av7110->ci_slot[slot_num].flags |= CA_CI_MODULE_READY; break; } break; @@ -262,15 +267,19 @@ static int dvb_ca_ioctl(struct file *file, unsigned int cmd, void *parg) case CA_GET_SLOT_INFO: { struct ca_slot_info *info = (struct ca_slot_info *)parg; + unsigned int slot_num; if (info->num < 0 || info->num > 1) { mutex_unlock(&av7110->ioctl_mutex); return -EINVAL; } - av7110->ci_slot[info->num].num = info->num; - av7110->ci_slot[info->num].type = FW_CI_LL_SUPPORT(av7110->arm_app) ? - CA_CI_LINK : CA_CI; - memcpy(info, &av7110->ci_slot[info->num], sizeof(struct ca_slot_info)); + slot_num = array_index_nospec(info->num, MAX_CI_SLOTS); + + av7110->ci_slot[slot_num].num = info->num; + av7110->ci_slot[slot_num].type = FW_CI_LL_SUPPORT(av7110->arm_app) ? + CA_CI_LINK : CA_CI; + memcpy(info, &av7110->ci_slot[slot_num], + sizeof(struct ca_slot_info)); break; } From 14a22762c3daeac59a5a534e124acbb4d7a79b3a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 11:10:31 +0200 Subject: [PATCH 015/235] media: s5p-jpeg: prevent buffer overflows The current logic allows word to be less than 2. If this happens, there will be buffer overflows, as reported by smatch. Add extra checks to prevent it. While here, remove an unused word = 0 assignment. Fixes: 6c96dbbc2aa9 ("[media] s5p-jpeg: add support for 5433") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Jacek Anaszewski --- .../media/platform/samsung/s5p-jpeg/jpeg-core.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/media/platform/samsung/s5p-jpeg/jpeg-core.c b/drivers/media/platform/samsung/s5p-jpeg/jpeg-core.c index d2c4a0178b3c..1db4609b3557 100644 --- a/drivers/media/platform/samsung/s5p-jpeg/jpeg-core.c +++ b/drivers/media/platform/samsung/s5p-jpeg/jpeg-core.c @@ -775,11 +775,14 @@ static void exynos4_jpeg_parse_decode_h_tbl(struct s5p_jpeg_ctx *ctx) (unsigned long)vb2_plane_vaddr(&vb->vb2_buf, 0) + ctx->out_q.sos + 2; jpeg_buffer.curr = 0; - word = 0; - if (get_word_be(&jpeg_buffer, &word)) return; - jpeg_buffer.size = (long)word - 2; + + if (word < 2) + jpeg_buffer.size = 0; + else + jpeg_buffer.size = (long)word - 2; + jpeg_buffer.data += 2; jpeg_buffer.curr = 0; @@ -1058,6 +1061,7 @@ static int get_word_be(struct s5p_jpeg_buffer *buf, unsigned int *word) if (byte == -1) return -1; *word = (unsigned int)byte | temp; + return 0; } @@ -1145,7 +1149,7 @@ static bool s5p_jpeg_parse_hdr(struct s5p_jpeg_q_data *result, if (get_word_be(&jpeg_buffer, &word)) break; length = (long)word - 2; - if (!length) + if (length <= 0) return false; sof = jpeg_buffer.curr; /* after 0xffc0 */ sof_len = length; @@ -1176,7 +1180,7 @@ static bool s5p_jpeg_parse_hdr(struct s5p_jpeg_q_data *result, if (get_word_be(&jpeg_buffer, &word)) break; length = (long)word - 2; - if (!length) + if (length <= 0) return false; if (n_dqt >= S5P_JPEG_MAX_MARKER) return false; @@ -1189,7 +1193,7 @@ static bool s5p_jpeg_parse_hdr(struct s5p_jpeg_q_data *result, if (get_word_be(&jpeg_buffer, &word)) break; length = (long)word - 2; - if (!length) + if (length <= 0) return false; if (n_dht >= S5P_JPEG_MAX_MARKER) return false; @@ -1214,6 +1218,7 @@ static bool s5p_jpeg_parse_hdr(struct s5p_jpeg_q_data *result, if (get_word_be(&jpeg_buffer, &word)) break; length = (long)word - 2; + /* No need to check underflows as skip() does it */ skip(&jpeg_buffer, length); break; } From 438d3085ba5b8b5bfa5290faa594e577f6ac9aa7 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 11:38:10 +0200 Subject: [PATCH 016/235] media: ar0521: don't overflow when checking PLL values The PLL checks are comparing 64 bit integers with 32 bit ones, as reported by Coverity. Depending on the values of the variables, this may underflow. Fix it ensuring that both sides of the expression are u64. Fixes: 852b50aeed15 ("media: On Semi AR0521 sensor driver") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab Acked-by: Sakari Ailus --- drivers/media/i2c/ar0521.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/media/i2c/ar0521.c b/drivers/media/i2c/ar0521.c index fc27238dd4d3..24873149096c 100644 --- a/drivers/media/i2c/ar0521.c +++ b/drivers/media/i2c/ar0521.c @@ -255,10 +255,10 @@ static u32 calc_pll(struct ar0521_dev *sensor, u32 freq, u16 *pre_ptr, u16 *mult continue; /* Minimum value */ if (new_mult > 254) break; /* Maximum, larger pre won't work either */ - if (sensor->extclk_freq * (u64)new_mult < AR0521_PLL_MIN * + if (sensor->extclk_freq * (u64)new_mult < (u64)AR0521_PLL_MIN * new_pre) continue; - if (sensor->extclk_freq * (u64)new_mult > AR0521_PLL_MAX * + if (sensor->extclk_freq * (u64)new_mult > (u64)AR0521_PLL_MAX * new_pre) break; /* Larger pre won't work either */ new_pll = div64_round_up(sensor->extclk_freq * (u64)new_mult, From 576a307a7650bd544fbb24df801b9b7863b85e2f Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 12:14:11 +0200 Subject: [PATCH 017/235] media: cx24116: prevent overflows on SNR calculus as reported by Coverity, if reading SNR registers fail, a negative number will be returned, causing an underflow when reading SNR registers. Prevent that. Fixes: 8953db793d5b ("V4L/DVB (9178): cx24116: Add module parameter to return SNR as ESNO.") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab --- drivers/media/dvb-frontends/cx24116.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/media/dvb-frontends/cx24116.c b/drivers/media/dvb-frontends/cx24116.c index 8b978a9f74a4..f5dd3a81725a 100644 --- a/drivers/media/dvb-frontends/cx24116.c +++ b/drivers/media/dvb-frontends/cx24116.c @@ -741,6 +741,7 @@ static int cx24116_read_snr_pct(struct dvb_frontend *fe, u16 *snr) { struct cx24116_state *state = fe->demodulator_priv; u8 snr_reading; + int ret; static const u32 snr_tab[] = { /* 10 x Table (rounded up) */ 0x00000, 0x0199A, 0x03333, 0x04ccD, 0x06667, 0x08000, 0x0999A, 0x0b333, 0x0cccD, 0x0e667, @@ -749,7 +750,11 @@ static int cx24116_read_snr_pct(struct dvb_frontend *fe, u16 *snr) dprintk("%s()\n", __func__); - snr_reading = cx24116_readreg(state, CX24116_REG_QUALITY0); + ret = cx24116_readreg(state, CX24116_REG_QUALITY0); + if (ret < 0) + return ret; + + snr_reading = ret; if (snr_reading >= 0xa0 /* 100% */) *snr = 0xffff; From 50b9fa751d1aef5d262bde871c70a7f44262f0bc Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 12:25:09 +0200 Subject: [PATCH 018/235] media: adv7604: prevent underflow condition when reporting colorspace Currently, adv76xx_log_status() reads some date using io_read() which may return negative values. The current logic doesn't check such errors, causing colorspace to be reported on a wrong way at adv76xx_log_status(), as reported by Coverity. If I/O error happens there, print a different message, instead of reporting bogus messages to userspace. Fixes: 54450f591c99 ("[media] adv7604: driver for the Analog Devices ADV7604 video decoder") Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Hans Verkuil --- drivers/media/i2c/adv7604.c | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/drivers/media/i2c/adv7604.c b/drivers/media/i2c/adv7604.c index 48230d5109f0..272945a878b3 100644 --- a/drivers/media/i2c/adv7604.c +++ b/drivers/media/i2c/adv7604.c @@ -2519,10 +2519,10 @@ static int adv76xx_log_status(struct v4l2_subdev *sd) const struct adv76xx_chip_info *info = state->info; struct v4l2_dv_timings timings; struct stdi_readback stdi; - u8 reg_io_0x02 = io_read(sd, 0x02); + int ret; + u8 reg_io_0x02; u8 edid_enabled; u8 cable_det; - static const char * const csc_coeff_sel_rb[16] = { "bypassed", "YPbPr601 -> RGB", "reserved", "YPbPr709 -> RGB", "reserved", "RGB -> YPbPr601", "reserved", "RGB -> YPbPr709", @@ -2621,13 +2621,21 @@ static int adv76xx_log_status(struct v4l2_subdev *sd) v4l2_info(sd, "-----Color space-----\n"); v4l2_info(sd, "RGB quantization range ctrl: %s\n", rgb_quantization_range_txt[state->rgb_quantization_range]); - v4l2_info(sd, "Input color space: %s\n", - input_color_space_txt[reg_io_0x02 >> 4]); - v4l2_info(sd, "Output color space: %s %s, alt-gamma %s\n", - (reg_io_0x02 & 0x02) ? "RGB" : "YCbCr", - (((reg_io_0x02 >> 2) & 0x01) ^ (reg_io_0x02 & 0x01)) ? - "(16-235)" : "(0-255)", - (reg_io_0x02 & 0x08) ? "enabled" : "disabled"); + + ret = io_read(sd, 0x02); + if (ret < 0) { + v4l2_info(sd, "Can't read Input/Output color space\n"); + } else { + reg_io_0x02 = ret; + + v4l2_info(sd, "Input color space: %s\n", + input_color_space_txt[reg_io_0x02 >> 4]); + v4l2_info(sd, "Output color space: %s %s, alt-gamma %s\n", + (reg_io_0x02 & 0x02) ? "RGB" : "YCbCr", + (((reg_io_0x02 >> 2) & 0x01) ^ (reg_io_0x02 & 0x01)) ? + "(16-235)" : "(0-255)", + (reg_io_0x02 & 0x08) ? "enabled" : "disabled"); + } v4l2_info(sd, "Color space conversion: %s\n", csc_coeff_sel_rb[cp_read(sd, info->cp_csc) >> 4]); From 2d861977e7314f00bf27d0db17c11ff5e85e609a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 15 Oct 2024 13:29:43 +0200 Subject: [PATCH 019/235] media: stb0899_algo: initialize cfr before using it The loop at stb0899_search_carrier() starts with a random value for cfr, as reported by Coverity. Initialize it to zero, just like stb0899_dvbs_algo() to ensure that carrier search won't bail out. Fixes: 8bd135bab91f ("V4L/DVB (9375): Add STB0899 support") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab --- drivers/media/dvb-frontends/stb0899_algo.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/dvb-frontends/stb0899_algo.c b/drivers/media/dvb-frontends/stb0899_algo.c index df89c33dac23..40537c4ccb0d 100644 --- a/drivers/media/dvb-frontends/stb0899_algo.c +++ b/drivers/media/dvb-frontends/stb0899_algo.c @@ -269,7 +269,7 @@ static enum stb0899_status stb0899_search_carrier(struct stb0899_state *state) short int derot_freq = 0, last_derot_freq = 0, derot_limit, next_loop = 3; int index = 0; - u8 cfr[2]; + u8 cfr[2] = {0}; u8 reg; internal->status = NOCARRIER; From eba6a8619d2b988f9b3a34e6b552a34fa2057d61 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 16 Oct 2024 11:03:26 +0200 Subject: [PATCH 020/235] media: cec: extron-da-hd-4k-plus: don't use -1 as an error code The logic at get_edid_tag_location() returns either an offset or an error condition. However, the error condition uses a non-standard "-1" value. This hits a Coverity bug, as Coverity assumes that positive values are underflow. While this is a false positive, returning error codes as -1 is an issue. So, instead, use -ENOENT to indicate that the tag was not found. Fixes: 056f2821b631 ("media: cec: extron-da-hd-4k-plus: add the Extron DA HD 4K Plus CEC driver") Signed-off-by: Mauro Carvalho Chehab --- .../cec/usb/extron-da-hd-4k-plus/extron-da-hd-4k-plus.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/media/cec/usb/extron-da-hd-4k-plus/extron-da-hd-4k-plus.c b/drivers/media/cec/usb/extron-da-hd-4k-plus/extron-da-hd-4k-plus.c index 8526f613a40e..cfbfc4c1b2e6 100644 --- a/drivers/media/cec/usb/extron-da-hd-4k-plus/extron-da-hd-4k-plus.c +++ b/drivers/media/cec/usb/extron-da-hd-4k-plus/extron-da-hd-4k-plus.c @@ -348,12 +348,12 @@ static int get_edid_tag_location(const u8 *edid, unsigned int size, /* Return if not a CTA-861 extension block */ if (size < 256 || edid[0] != 0x02 || edid[1] != 0x03) - return -1; + return -ENOENT; /* search tag */ d = edid[0x02] & 0x7f; if (d <= 4) - return -1; + return -ENOENT; i = 0x04; end = 0x00 + d; @@ -371,7 +371,7 @@ static int get_edid_tag_location(const u8 *edid, unsigned int size, return offset + i; i += len + 1; } while (i < end); - return -1; + return -ENOENT; } static void extron_edid_crc(u8 *edid) From ba9cf6b430433e57bfc8072364e944b7c0eca2a4 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 16 Oct 2024 11:24:15 +0200 Subject: [PATCH 021/235] media: pulse8-cec: fix data timestamp at pulse8_setup() As pointed by Coverity, there is a hidden overflow condition there. As date is signed and u8 is unsigned, doing: date = (data[0] << 24) With a value bigger than 07f will make all upper bits of date 0xffffffff. This can be demonstrated with this small code: typedef int64_t time64_t; typedef uint8_t u8; int main(void) { u8 data[] = { 0xde ,0xad , 0xbe, 0xef }; time64_t date; date = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]; printf("Invalid data = 0x%08lx\n", date); date = ((unsigned)data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]; printf("Expected data = 0x%08lx\n", date); return 0; } Fix it by converting the upper bit calculation to unsigned. Fixes: cea28e7a55e7 ("media: pulse8-cec: reorganize function order") Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab --- drivers/media/cec/usb/pulse8/pulse8-cec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/cec/usb/pulse8/pulse8-cec.c b/drivers/media/cec/usb/pulse8/pulse8-cec.c index ba67587bd43e..171366fe3544 100644 --- a/drivers/media/cec/usb/pulse8/pulse8-cec.c +++ b/drivers/media/cec/usb/pulse8/pulse8-cec.c @@ -685,7 +685,7 @@ static int pulse8_setup(struct pulse8 *pulse8, struct serio *serio, err = pulse8_send_and_wait(pulse8, cmd, 1, cmd[0], 4); if (err) return err; - date = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]; + date = ((unsigned)data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]; dev_info(pulse8->dev, "Firmware build date %ptT\n", &date); dev_dbg(pulse8->dev, "Persistent config:\n"); From 404b739e895522838f1abdc340c554654d671dde Mon Sep 17 00:00:00 2001 From: Umang Jain Date: Wed, 16 Oct 2024 18:32:24 +0530 Subject: [PATCH 022/235] staging: vchiq_arm: Use devm_kzalloc() for vchiq_arm_state allocation The struct vchiq_arm_state 'platform_state' is currently allocated dynamically using kzalloc(). Unfortunately, it is never freed and is subjected to memory leaks in the error handling paths of the probe() function. To address the issue, use device resource management helper devm_kzalloc(), to ensure cleanup after its allocation. Fixes: 71bad7f08641 ("staging: add bcm2708 vchiq driver") Cc: stable@vger.kernel.org Signed-off-by: Umang Jain Reviewed-by: Dan Carpenter Link: https://lore.kernel.org/r/20241016130225.61024-2-umang.jain@ideasonboard.com Signed-off-by: Greg Kroah-Hartman --- drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c index 3dbeffc650d3..0d8d5555e8af 100644 --- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c +++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c @@ -593,7 +593,7 @@ vchiq_platform_init_state(struct vchiq_state *state) { struct vchiq_arm_state *platform_state; - platform_state = kzalloc(sizeof(*platform_state), GFP_KERNEL); + platform_state = devm_kzalloc(state->dev, sizeof(*platform_state), GFP_KERNEL); if (!platform_state) return -ENOMEM; From 807babf69027b4f1c55e72b06879658e83830880 Mon Sep 17 00:00:00 2001 From: Umang Jain Date: Wed, 16 Oct 2024 18:32:25 +0530 Subject: [PATCH 023/235] staging: vchiq_arm: Use devm_kzalloc() for drv_mgmt allocation The struct drv_mgmt 'mgmt' is currently allocated dynamically using kzalloc(). Unfortunately, it is subjected to memory leaks in the error handling paths of the probe() function. To address this issue, use device resource management helper devm_kzalloc(), to ensure cleanup after the allocation. Cc: stable@vger.kernel.org Fixes: 1c9e16b73166 ("staging: vc04_services: vchiq_arm: Split driver static and runtime data") Signed-off-by: Umang Jain Reviewed-by: Dan Carpenter Link: https://lore.kernel.org/r/20241016130225.61024-3-umang.jain@ideasonboard.com Signed-off-by: Greg Kroah-Hartman --- drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c index 0d8d5555e8af..6c488b1e2624 100644 --- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c +++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c @@ -1731,7 +1731,7 @@ static int vchiq_probe(struct platform_device *pdev) return -ENOENT; } - mgmt = kzalloc(sizeof(*mgmt), GFP_KERNEL); + mgmt = devm_kzalloc(&pdev->dev, sizeof(*mgmt), GFP_KERNEL); if (!mgmt) return -ENOMEM; @@ -1789,8 +1789,6 @@ static void vchiq_remove(struct platform_device *pdev) arm_state = vchiq_platform_get_arm_state(&mgmt->state); kthread_stop(arm_state->ka_thread); - - kfree(mgmt); } static struct platform_driver vchiq_driver = { From dad2f20715163e80aab284fb092efc8c18bf97c7 Mon Sep 17 00:00:00 2001 From: Daniel Burgener Date: Tue, 15 Oct 2024 13:26:46 -0400 Subject: [PATCH 024/235] landlock: Fix grammar issues in documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Improve user space and kernel documentation. Signed-off-by: Daniel Burgener Link: https://lore.kernel.org/r/20241015172647.2007644-1-dburgener@linux.microsoft.com [mic: Extend commit message, reword ptrace restriction as discussed in the thread] Signed-off-by: Mickaël Salaün --- Documentation/security/landlock.rst | 14 ++--- Documentation/userspace-api/landlock.rst | 69 ++++++++++++------------ 2 files changed, 41 insertions(+), 42 deletions(-) diff --git a/Documentation/security/landlock.rst b/Documentation/security/landlock.rst index 36f26501fd15..59ecdb1c0d4d 100644 --- a/Documentation/security/landlock.rst +++ b/Documentation/security/landlock.rst @@ -11,18 +11,18 @@ Landlock LSM: kernel documentation Landlock's goal is to create scoped access-control (i.e. sandboxing). To harden a whole system, this feature should be available to any process, -including unprivileged ones. Because such process may be compromised or +including unprivileged ones. Because such a process may be compromised or backdoored (i.e. untrusted), Landlock's features must be safe to use from the kernel and other processes point of view. Landlock's interface must therefore expose a minimal attack surface. Landlock is designed to be usable by unprivileged processes while following the system security policy enforced by other access control mechanisms (e.g. DAC, -LSM). Indeed, a Landlock rule shall not interfere with other access-controls -enforced on the system, only add more restrictions. +LSM). A Landlock rule shall not interfere with other access-controls enforced +on the system, only add more restrictions. Any user can enforce Landlock rulesets on their processes. They are merged and -evaluated according to the inherited ones in a way that ensures that only more +evaluated against inherited rulesets in a way that ensures that only more constraints can be added. User space documentation can be found here: @@ -43,7 +43,7 @@ Guiding principles for safe access controls only impact the processes requesting them. * Resources (e.g. file descriptors) directly obtained from the kernel by a sandboxed process shall retain their scoped accesses (at the time of resource - acquisition) whatever process use them. + acquisition) whatever process uses them. Cf. `File descriptor access rights`_. Design choices @@ -71,7 +71,7 @@ the same results, when they are executed under the same Landlock domain. Taking the ``LANDLOCK_ACCESS_FS_TRUNCATE`` right as an example, it may be allowed to open a file for writing without being allowed to :manpage:`ftruncate` the resulting file descriptor if the related file -hierarchy doesn't grant such access right. The following sequences of +hierarchy doesn't grant that access right. The following sequences of operations have the same semantic and should then have the same result: * ``truncate(path);`` @@ -81,7 +81,7 @@ Similarly to file access modes (e.g. ``O_RDWR``), Landlock access rights attached to file descriptors are retained even if they are passed between processes (e.g. through a Unix domain socket). Such access rights will then be enforced even if the receiving process is not sandboxed by Landlock. Indeed, -this is required to keep a consistent access control over the whole system, and +this is required to keep access controls consistent over the whole system, and this avoids unattended bypasses through file descriptor passing (i.e. confused deputy attack). diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst index bb7480a05e2c..d639c61cb472 100644 --- a/Documentation/userspace-api/landlock.rst +++ b/Documentation/userspace-api/landlock.rst @@ -10,11 +10,11 @@ Landlock: unprivileged access control :Author: Mickaël Salaün :Date: October 2024 -The goal of Landlock is to enable to restrict ambient rights (e.g. global +The goal of Landlock is to enable restriction of ambient rights (e.g. global filesystem or network access) for a set of processes. Because Landlock -is a stackable LSM, it makes possible to create safe security sandboxes as new -security layers in addition to the existing system-wide access-controls. This -kind of sandbox is expected to help mitigate the security impact of bugs or +is a stackable LSM, it makes it possible to create safe security sandboxes as +new security layers in addition to the existing system-wide access-controls. +This kind of sandbox is expected to help mitigate the security impact of bugs or unexpected/malicious behaviors in user space applications. Landlock empowers any process, including unprivileged ones, to securely restrict themselves. @@ -86,8 +86,8 @@ to be explicit about the denied-by-default access rights. LANDLOCK_SCOPE_SIGNAL, }; -Because we may not know on which kernel version an application will be -executed, it is safer to follow a best-effort security approach. Indeed, we +Because we may not know which kernel version an application will be executed +on, it is safer to follow a best-effort security approach. Indeed, we should try to protect users as much as possible whatever the kernel they are using. @@ -129,7 +129,7 @@ version, and only use the available subset of access rights: LANDLOCK_SCOPE_SIGNAL); } -This enables to create an inclusive ruleset that will contain our rules. +This enables the creation of an inclusive ruleset that will contain our rules. .. code-block:: c @@ -219,42 +219,41 @@ If the ``landlock_restrict_self`` system call succeeds, the current thread is now restricted and this policy will be enforced on all its subsequently created children as well. Once a thread is landlocked, there is no way to remove its security policy; only adding more restrictions is allowed. These threads are -now in a new Landlock domain, merge of their parent one (if any) with the new -ruleset. +now in a new Landlock domain, which is a merger of their parent one (if any) +with the new ruleset. Full working code can be found in `samples/landlock/sandboxer.c`_. Good practices -------------- -It is recommended setting access rights to file hierarchy leaves as much as +It is recommended to set access rights to file hierarchy leaves as much as possible. For instance, it is better to be able to have ``~/doc/`` as a read-only hierarchy and ``~/tmp/`` as a read-write hierarchy, compared to ``~/`` as a read-only hierarchy and ``~/tmp/`` as a read-write hierarchy. Following this good practice leads to self-sufficient hierarchies that do not depend on their location (i.e. parent directories). This is particularly relevant when we want to allow linking or renaming. Indeed, having consistent -access rights per directory enables to change the location of such directory +access rights per directory enables changing the location of such directories without relying on the destination directory access rights (except those that are required for this operation, see ``LANDLOCK_ACCESS_FS_REFER`` documentation). Having self-sufficient hierarchies also helps to tighten the required access rights to the minimal set of data. This also helps avoid sinkhole directories, -i.e. directories where data can be linked to but not linked from. However, +i.e. directories where data can be linked to but not linked from. However, this depends on data organization, which might not be controlled by developers. In this case, granting read-write access to ``~/tmp/``, instead of write-only -access, would potentially allow to move ``~/tmp/`` to a non-readable directory +access, would potentially allow moving ``~/tmp/`` to a non-readable directory and still keep the ability to list the content of ``~/tmp/``. Layers of file path access rights --------------------------------- Each time a thread enforces a ruleset on itself, it updates its Landlock domain -with a new layer of policy. Indeed, this complementary policy is stacked with -the potentially other rulesets already restricting this thread. A sandboxed -thread can then safely add more constraints to itself with a new enforced -ruleset. +with a new layer of policy. This complementary policy is stacked with any +other rulesets potentially already restricting this thread. A sandboxed thread +can then safely add more constraints to itself with a new enforced ruleset. One policy layer grants access to a file path if at least one of its rules encountered on the path grants the access. A sandboxed thread can only access @@ -265,7 +264,7 @@ etc.). Bind mounts and OverlayFS ------------------------- -Landlock enables to restrict access to file hierarchies, which means that these +Landlock enables restricting access to file hierarchies, which means that these access rights can be propagated with bind mounts (cf. Documentation/filesystems/sharedsubtree.rst) but not with Documentation/filesystems/overlayfs.rst. @@ -278,21 +277,21 @@ access to multiple file hierarchies at the same time, whether these hierarchies are the result of bind mounts or not. An OverlayFS mount point consists of upper and lower layers. These layers are -combined in a merge directory, result of the mount point. This merge hierarchy -may include files from the upper and lower layers, but modifications performed -on the merge hierarchy only reflects on the upper layer. From a Landlock -policy point of view, each OverlayFS layers and merge hierarchies are -standalone and contains their own set of files and directories, which is -different from bind mounts. A policy restricting an OverlayFS layer will not -restrict the resulted merged hierarchy, and vice versa. Landlock users should -then only think about file hierarchies they want to allow access to, regardless -of the underlying filesystem. +combined in a merge directory, and that merged directory becomes available at +the mount point. This merge hierarchy may include files from the upper and +lower layers, but modifications performed on the merge hierarchy only reflect +on the upper layer. From a Landlock policy point of view, all OverlayFS layers +and merge hierarchies are standalone and each contains their own set of files +and directories, which is different from bind mounts. A policy restricting an +OverlayFS layer will not restrict the resulted merged hierarchy, and vice versa. +Landlock users should then only think about file hierarchies they want to allow +access to, regardless of the underlying filesystem. Inheritance ----------- Every new thread resulting from a :manpage:`clone(2)` inherits Landlock domain -restrictions from its parent. This is similar to the seccomp inheritance (cf. +restrictions from its parent. This is similar to seccomp inheritance (cf. Documentation/userspace-api/seccomp_filter.rst) or any other LSM dealing with task's :manpage:`credentials(7)`. For instance, one process's thread may apply Landlock rules to itself, but they will not be automatically applied to other @@ -311,8 +310,8 @@ Ptrace restrictions A sandboxed process has less privileges than a non-sandboxed process and must then be subject to additional restrictions when manipulating another process. To be allowed to use :manpage:`ptrace(2)` and related syscalls on a target -process, a sandboxed process should have a subset of the target process rules, -which means the tracee must be in a sub-domain of the tracer. +process, a sandboxed process should have a superset of the target process's +access rights, which means the tracee must be in a sub-domain of the tracer. IPC scoping ----------- @@ -322,7 +321,7 @@ interactions between sandboxes. Each Landlock domain can be explicitly scoped for a set of actions by specifying it on a ruleset. For example, if a sandboxed process should not be able to :manpage:`connect(2)` to a non-sandboxed process through abstract :manpage:`unix(7)` sockets, we can -specify such restriction with ``LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET``. +specify such a restriction with ``LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET``. Moreover, if a sandboxed process should not be able to send a signal to a non-sandboxed process, we can specify this restriction with ``LANDLOCK_SCOPE_SIGNAL``. @@ -394,7 +393,7 @@ Backward and forward compatibility Landlock is designed to be compatible with past and future versions of the kernel. This is achieved thanks to the system call attributes and the associated bitflags, particularly the ruleset's ``handled_access_fs``. Making -handled access right explicit enables the kernel and user space to have a clear +handled access rights explicit enables the kernel and user space to have a clear contract with each other. This is required to make sure sandboxing will not get stricter with a system update, which could break applications. @@ -606,9 +605,9 @@ Build time configuration Landlock was first introduced in Linux 5.13 but it must be configured at build time with ``CONFIG_SECURITY_LANDLOCK=y``. Landlock must also be enabled at boot -time as the other security modules. The list of security modules enabled by +time like other security modules. The list of security modules enabled by default is set with ``CONFIG_LSM``. The kernel configuration should then -contains ``CONFIG_LSM=landlock,[...]`` with ``[...]`` as the list of other +contain ``CONFIG_LSM=landlock,[...]`` with ``[...]`` as the list of other potentially useful security modules for the running system (see the ``CONFIG_LSM`` help). @@ -670,7 +669,7 @@ Questions and answers What about user space sandbox managers? --------------------------------------- -Using user space process to enforce restrictions on kernel resources can lead +Using user space processes to enforce restrictions on kernel resources can lead to race conditions or inconsistent evaluations (i.e. `Incorrect mirroring of the OS code and state `_). From 387285530d1d4bdba8c5dff5aeabd8d71638173f Mon Sep 17 00:00:00 2001 From: Matthieu Buffet Date: Sat, 19 Oct 2024 17:15:32 +0200 Subject: [PATCH 025/235] samples/landlock: Fix port parsing in sandboxer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If you want to specify that no port can be bind()ed, you would think (looking quickly at both help message and code) that setting LL_TCP_BIND="" would do it. However the code splits on ":" then applies atoi(), which does not allow checking for errors. Passing an empty string returns 0, which is interpreted as "allow bind(0)", which means bind to any ephemeral port. This bug occurs whenever passing an empty string or when leaving a trailing/leading colon, making it impossible to completely deny bind(). To reproduce: export LL_FS_RO="/" LL_FS_RW="" LL_TCP_BIND="" ./sandboxer strace -e bind nc -n -vvv -l -p 0 Executing the sandboxed command... bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 Listening on 0.0.0.0 37629 Use strtoull(3) instead, which allows error checking. Check that the entire string has been parsed correctly without overflows/underflows, but not that the __u64 (the type of struct landlock_net_port_attr.port) is a valid __u16 port: that is already done by the kernel. Fixes: 5e990dcef12e ("samples/landlock: Support TCP restrictions") Signed-off-by: Matthieu Buffet Link: https://lore.kernel.org/r/20241019151534.1400605-2-matthieu@buffet.re Signed-off-by: Mickaël Salaün --- samples/landlock/sandboxer.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/samples/landlock/sandboxer.c b/samples/landlock/sandboxer.c index f847e832ba14..4cbef9d2f15b 100644 --- a/samples/landlock/sandboxer.c +++ b/samples/landlock/sandboxer.c @@ -60,6 +60,25 @@ static inline int landlock_restrict_self(const int ruleset_fd, #define ENV_SCOPED_NAME "LL_SCOPED" #define ENV_DELIMITER ":" +static int str2num(const char *numstr, __u64 *num_dst) +{ + char *endptr = NULL; + int err = 0; + __u64 num; + + errno = 0; + num = strtoull(numstr, &endptr, 10); + if (errno != 0) + err = errno; + /* Was the string empty, or not entirely parsed successfully? */ + else if ((*numstr == '\0') || (*endptr != '\0')) + err = EINVAL; + else + *num_dst = num; + + return err; +} + static int parse_path(char *env_path, const char ***const path_list) { int i, num_paths = 0; @@ -160,7 +179,6 @@ static int populate_ruleset_net(const char *const env_var, const int ruleset_fd, char *env_port_name, *env_port_name_next, *strport; struct landlock_net_port_attr net_port = { .allowed_access = allowed_access, - .port = 0, }; env_port_name = getenv(env_var); @@ -171,7 +189,17 @@ static int populate_ruleset_net(const char *const env_var, const int ruleset_fd, env_port_name_next = env_port_name; while ((strport = strsep(&env_port_name_next, ENV_DELIMITER))) { - net_port.port = atoi(strport); + __u64 port; + + if (strcmp(strport, "") == 0) + continue; + + if (str2num(strport, &port)) { + fprintf(stderr, "Failed to parse port at \"%s\"\n", + strport); + goto out_free_name; + } + net_port.port = port; if (landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT, &net_port, 0)) { fprintf(stderr, From f51e55a0892bd2030c847d4583c12498bb93f812 Mon Sep 17 00:00:00 2001 From: Matthieu Buffet Date: Sat, 19 Oct 2024 17:15:33 +0200 Subject: [PATCH 026/235] samples/landlock: Refactor help message MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Help message is getting larger with each new supported feature (scopes, and soon UDP). Also the large number of calls to fprintf with environment variables make it hard to read. Refactor it away into a single simpler constant format string. Signed-off-by: Matthieu Buffet Link: https://lore.kernel.org/r/20241019151534.1400605-3-matthieu@buffet.re [mic: Move the small cleanups in the next commit] Signed-off-by: Mickaël Salaün --- samples/landlock/sandboxer.c | 79 +++++++++++++++++------------------- 1 file changed, 38 insertions(+), 41 deletions(-) diff --git a/samples/landlock/sandboxer.c b/samples/landlock/sandboxer.c index 4cbef9d2f15b..dba143f62bf5 100644 --- a/samples/landlock/sandboxer.c +++ b/samples/landlock/sandboxer.c @@ -290,6 +290,43 @@ static bool check_ruleset_scope(const char *const env_var, #define LANDLOCK_ABI_LAST 6 +#define XSTR(s) #s +#define STR(s) XSTR(s) + +/* clang-format off */ + +static const char help[] = + "usage: " + ENV_FS_RO_NAME "=\"...\" " + ENV_FS_RW_NAME "=\"...\" " + ENV_TCP_BIND_NAME "=\"...\" " + ENV_TCP_CONNECT_NAME "=\"...\" " + ENV_SCOPED_NAME "=\"...\" %1$s [args]...\n" + "\n" + "Execute a command in a restricted environment.\n" + "\n" + "Environment variables containing paths and ports each separated by a colon:\n" + "* " ENV_FS_RO_NAME ": list of paths allowed to be used in a read-only way.\n" + "* " ENV_FS_RW_NAME ": list of paths allowed to be used in a read-write way.\n" + "\n" + "Environment variables containing ports are optional and could be skipped.\n" + "* " ENV_TCP_BIND_NAME ": list of ports allowed to bind (server).\n" + "* " ENV_TCP_CONNECT_NAME ": list of ports allowed to connect (client).\n" + "* " ENV_SCOPED_NAME ": list of scoped IPCs.\n" + "\n" + "example:\n" + ENV_FS_RO_NAME "=\"${PATH}:/lib:/usr:/proc:/etc:/dev/urandom\" " + ENV_FS_RW_NAME "=\"/dev/null:/dev/full:/dev/zero:/dev/pts:/tmp\" " + ENV_TCP_BIND_NAME "=\"9418\" " + ENV_TCP_CONNECT_NAME "=\"80:443\" " + ENV_SCOPED_NAME "=\"a:s\" " + "%1$s bash -i\n" + "\n" + "This sandboxer can use Landlock features up to ABI version " + STR(LANDLOCK_ABI_LAST) ".\n"; + +/* clang-format on */ + int main(const int argc, char *const argv[], char *const *const envp) { const char *cmd_path; @@ -308,47 +345,7 @@ int main(const int argc, char *const argv[], char *const *const envp) }; if (argc < 2) { - fprintf(stderr, - "usage: %s=\"...\" %s=\"...\" %s=\"...\" %s=\"...\" %s=\"...\" %s " - " [args]...\n\n", - ENV_FS_RO_NAME, ENV_FS_RW_NAME, ENV_TCP_BIND_NAME, - ENV_TCP_CONNECT_NAME, ENV_SCOPED_NAME, argv[0]); - fprintf(stderr, - "Execute a command in a restricted environment.\n\n"); - fprintf(stderr, - "Environment variables containing paths and ports " - "each separated by a colon:\n"); - fprintf(stderr, - "* %s: list of paths allowed to be used in a read-only way.\n", - ENV_FS_RO_NAME); - fprintf(stderr, - "* %s: list of paths allowed to be used in a read-write way.\n\n", - ENV_FS_RW_NAME); - fprintf(stderr, - "Environment variables containing ports are optional " - "and could be skipped.\n"); - fprintf(stderr, - "* %s: list of ports allowed to bind (server).\n", - ENV_TCP_BIND_NAME); - fprintf(stderr, - "* %s: list of ports allowed to connect (client).\n", - ENV_TCP_CONNECT_NAME); - fprintf(stderr, "* %s: list of scoped IPCs.\n", - ENV_SCOPED_NAME); - fprintf(stderr, - "\nexample:\n" - "%s=\"${PATH}:/lib:/usr:/proc:/etc:/dev/urandom\" " - "%s=\"/dev/null:/dev/full:/dev/zero:/dev/pts:/tmp\" " - "%s=\"9418\" " - "%s=\"80:443\" " - "%s=\"a:s\" " - "%s bash -i\n\n", - ENV_FS_RO_NAME, ENV_FS_RW_NAME, ENV_TCP_BIND_NAME, - ENV_TCP_CONNECT_NAME, ENV_SCOPED_NAME, argv[0]); - fprintf(stderr, - "This sandboxer can use Landlock features " - "up to ABI version %d.\n", - LANDLOCK_ABI_LAST); + fprintf(stderr, help, argv[0]); return 1; } From 53b9d789df983790015ef04b0283ac5a33917cad Mon Sep 17 00:00:00 2001 From: Matthieu Buffet Date: Sat, 19 Oct 2024 17:15:34 +0200 Subject: [PATCH 027/235] samples/landlock: Clarify option parsing behaviour MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Clarify the distinction between filesystem variables (mandatory) and all others (optional). For optional variables, explain the difference between unset variables (no access check performed) and empty variables (nothing allowed for lists of allowed paths/ports, or no effect for lists of scopes). List the known LL_SCOPED values and their effect. Signed-off-by: Matthieu Buffet Link: https://lore.kernel.org/r/20241019151534.1400605-4-matthieu@buffet.re [mic: Add a missing colon] Signed-off-by: Mickaël Salaün --- samples/landlock/sandboxer.c | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/samples/landlock/sandboxer.c b/samples/landlock/sandboxer.c index dba143f62bf5..57565dfd74a2 100644 --- a/samples/landlock/sandboxer.c +++ b/samples/landlock/sandboxer.c @@ -296,25 +296,26 @@ static bool check_ruleset_scope(const char *const env_var, /* clang-format off */ static const char help[] = - "usage: " - ENV_FS_RO_NAME "=\"...\" " - ENV_FS_RW_NAME "=\"...\" " - ENV_TCP_BIND_NAME "=\"...\" " - ENV_TCP_CONNECT_NAME "=\"...\" " - ENV_SCOPED_NAME "=\"...\" %1$s [args]...\n" + "usage: " ENV_FS_RO_NAME "=\"...\" " ENV_FS_RW_NAME "=\"...\" " + "[other environment variables] %1$s [args]...\n" "\n" - "Execute a command in a restricted environment.\n" + "Execute the given command in a restricted environment.\n" + "Multi-valued settings (lists of ports, paths, scopes) are colon-delimited.\n" "\n" - "Environment variables containing paths and ports each separated by a colon:\n" - "* " ENV_FS_RO_NAME ": list of paths allowed to be used in a read-only way.\n" - "* " ENV_FS_RW_NAME ": list of paths allowed to be used in a read-write way.\n" + "Mandatory settings:\n" + "* " ENV_FS_RO_NAME ": paths allowed to be used in a read-only way\n" + "* " ENV_FS_RW_NAME ": paths allowed to be used in a read-write way\n" "\n" - "Environment variables containing ports are optional and could be skipped.\n" - "* " ENV_TCP_BIND_NAME ": list of ports allowed to bind (server).\n" - "* " ENV_TCP_CONNECT_NAME ": list of ports allowed to connect (client).\n" - "* " ENV_SCOPED_NAME ": list of scoped IPCs.\n" + "Optional settings (when not set, their associated access check " + "is always allowed, which is different from an empty string which " + "means an empty list):\n" + "* " ENV_TCP_BIND_NAME ": ports allowed to bind (server)\n" + "* " ENV_TCP_CONNECT_NAME ": ports allowed to connect (client)\n" + "* " ENV_SCOPED_NAME ": actions denied on the outside of the landlock domain\n" + " - \"a\" to restrict opening abstract unix sockets\n" + " - \"s\" to restrict sending signals\n" "\n" - "example:\n" + "Example:\n" ENV_FS_RO_NAME "=\"${PATH}:/lib:/usr:/proc:/etc:/dev/urandom\" " ENV_FS_RW_NAME "=\"/dev/null:/dev/full:/dev/zero:/dev/pts:/tmp\" " ENV_TCP_BIND_NAME "=\"9418\" " From e7f37a7d16310d3c9474825de26a67f00983ebea Mon Sep 17 00:00:00 2001 From: Abel Vesa Date: Mon, 21 Oct 2024 15:46:25 +0300 Subject: [PATCH 028/235] clk: qcom: gcc-x1e80100: Fix USB MP SS1 PHY GDSC pwrsts flags Allowing these GDSCs to collapse makes the QMP combo PHYs lose their configuration on machine suspend. Currently, the QMP combo PHY driver doesn't reinitialise the HW on resume. Under such conditions, the USB SuperSpeed support is broken. To avoid this, mark the pwrsts flags with RET_ON. This has been already done for USB 0 and 1 SS PHY GDSCs, Do this also for the USB MP SS1 PHY GDSC config. The USB MP SS0 PHY GDSC already has it. Fixes: 161b7c401f4b ("clk: qcom: Add Global Clock controller (GCC) driver for X1E80100") Reviewed-by: Johan Hovold Signed-off-by: Abel Vesa Link: https://lore.kernel.org/r/20241021-x1e80100-clk-gcc-fix-usb-mp-phy-gdsc-pwrsts-flags-v2-1-0bfd64556238@linaro.org Signed-off-by: Bjorn Andersson --- drivers/clk/qcom/gcc-x1e80100.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/clk/qcom/gcc-x1e80100.c b/drivers/clk/qcom/gcc-x1e80100.c index 81ba5ceab342..8ea25aa25dff 100644 --- a/drivers/clk/qcom/gcc-x1e80100.c +++ b/drivers/clk/qcom/gcc-x1e80100.c @@ -6155,7 +6155,7 @@ static struct gdsc gcc_usb3_mp_ss1_phy_gdsc = { .pd = { .name = "gcc_usb3_mp_ss1_phy_gdsc", }, - .pwrsts = PWRSTS_OFF_ON, + .pwrsts = PWRSTS_RET_ON, .flags = POLL_CFG_GDSCR | RETAIN_FF_ENABLE, }; From 2feb023110843acce790e9089e72e9a9503d9fa5 Mon Sep 17 00:00:00 2001 From: ChiYuan Huang Date: Fri, 25 Oct 2024 13:59:18 +0800 Subject: [PATCH 029/235] regulator: rtq2208: Fix uninitialized use of regulator_config Fix rtq2208 driver uninitialized use to cause kernel error. Fixes: 85a11f55621a ("regulator: rtq2208: Add Richtek RTQ2208 SubPMIC") Signed-off-by: ChiYuan Huang Link: https://patch.msgid.link/00d691cfcc0eae9ce80a37b62e99851e8fdcffe2.1729829243.git.cy_huang@richtek.com Signed-off-by: Mark Brown --- drivers/regulator/rtq2208-regulator.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/regulator/rtq2208-regulator.c b/drivers/regulator/rtq2208-regulator.c index a5c126afc648..5925fa7a9a06 100644 --- a/drivers/regulator/rtq2208-regulator.c +++ b/drivers/regulator/rtq2208-regulator.c @@ -568,7 +568,7 @@ static int rtq2208_probe(struct i2c_client *i2c) struct regmap *regmap; struct rtq2208_regulator_desc *rdesc[RTQ2208_LDO_MAX]; struct regulator_dev *rdev; - struct regulator_config cfg; + struct regulator_config cfg = {}; struct rtq2208_rdev_map *rdev_map; int i, ret = 0, idx, n_regulator = 0; unsigned int regulator_idx_table[RTQ2208_LDO_MAX], From 3abab905b14f4ba756d413f37f1fb02b708eee93 Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Mon, 28 Oct 2024 08:28:30 +0900 Subject: [PATCH 030/235] ksmbd: Fix the missing xa_store error check xa_store() can fail, it return xa_err(-EINVAL) if the entry cannot be stored in an XArray, or xa_err(-ENOMEM) if memory allocation failed, so check error for xa_store() to fix it. Cc: stable@vger.kernel.org Fixes: b685757c7b08 ("ksmbd: Implements sess->rpc_handle_list as xarray") Signed-off-by: Jinjie Ruan Acked-by: Namjae Jeon Signed-off-by: Steve French --- fs/smb/server/mgmt/user_session.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/smb/server/mgmt/user_session.c b/fs/smb/server/mgmt/user_session.c index 1e4624e9d434..9756a4bbfe54 100644 --- a/fs/smb/server/mgmt/user_session.c +++ b/fs/smb/server/mgmt/user_session.c @@ -90,7 +90,7 @@ static int __rpc_method(char *rpc_name) int ksmbd_session_rpc_open(struct ksmbd_session *sess, char *rpc_name) { - struct ksmbd_session_rpc *entry; + struct ksmbd_session_rpc *entry, *old; struct ksmbd_rpc_command *resp; int method; @@ -106,16 +106,19 @@ int ksmbd_session_rpc_open(struct ksmbd_session *sess, char *rpc_name) entry->id = ksmbd_ipc_id_alloc(); if (entry->id < 0) goto free_entry; - xa_store(&sess->rpc_handle_list, entry->id, entry, GFP_KERNEL); + old = xa_store(&sess->rpc_handle_list, entry->id, entry, GFP_KERNEL); + if (xa_is_err(old)) + goto free_id; resp = ksmbd_rpc_open(sess, entry->id); if (!resp) - goto free_id; + goto erase_xa; kvfree(resp); return entry->id; -free_id: +erase_xa: xa_erase(&sess->rpc_handle_list, entry->id); +free_id: ksmbd_rpc_id_free(entry->id); free_entry: kfree(entry); From 96d8569563916fe2f8fe17317e20e43f54f9ba4b Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Thu, 24 Oct 2024 10:21:30 +0200 Subject: [PATCH 031/235] media: vivid: fix buffer overwrite when using > 32 buffers The maximum number of buffers that can be requested was increased to 64 for the video capture queue. But video capture used a must_blank array that was still sized for 32 (VIDEO_MAX_FRAME). This caused an out-of-bounds write when using buffer indices >= 32. Create a new define MAX_VID_CAP_BUFFERS that is used to access the must_blank array and set max_num_buffers for the video capture queue. This solves a crash reported by: https://bugzilla.kernel.org/show_bug.cgi?id=219258 Signed-off-by: Hans Verkuil Fixes: cea70ed416b4 ("media: test-drivers: vivid: Increase max supported buffers for capture queues") Cc: stable@vger.kernel.org --- drivers/media/test-drivers/vivid/vivid-core.c | 2 +- drivers/media/test-drivers/vivid/vivid-core.h | 4 +++- drivers/media/test-drivers/vivid/vivid-ctrls.c | 2 +- drivers/media/test-drivers/vivid/vivid-vid-cap.c | 2 +- 4 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/media/test-drivers/vivid/vivid-core.c b/drivers/media/test-drivers/vivid/vivid-core.c index 00e0d08af357..4f330f4fc6be 100644 --- a/drivers/media/test-drivers/vivid/vivid-core.c +++ b/drivers/media/test-drivers/vivid/vivid-core.c @@ -910,7 +910,7 @@ static int vivid_create_queue(struct vivid_dev *dev, * videobuf2-core.c to MAX_BUFFER_INDEX. */ if (buf_type == V4L2_BUF_TYPE_VIDEO_CAPTURE) - q->max_num_buffers = 64; + q->max_num_buffers = MAX_VID_CAP_BUFFERS; if (buf_type == V4L2_BUF_TYPE_SDR_CAPTURE) q->max_num_buffers = 1024; if (buf_type == V4L2_BUF_TYPE_VBI_CAPTURE) diff --git a/drivers/media/test-drivers/vivid/vivid-core.h b/drivers/media/test-drivers/vivid/vivid-core.h index cc18a3bc6dc0..d2d52763b119 100644 --- a/drivers/media/test-drivers/vivid/vivid-core.h +++ b/drivers/media/test-drivers/vivid/vivid-core.h @@ -26,6 +26,8 @@ #define MAX_INPUTS 16 /* The maximum number of outputs */ #define MAX_OUTPUTS 16 +/* The maximum number of video capture buffers */ +#define MAX_VID_CAP_BUFFERS 64 /* The maximum up or down scaling factor is 4 */ #define MAX_ZOOM 4 /* The maximum image width/height are set to 4K DMT */ @@ -481,7 +483,7 @@ struct vivid_dev { /* video capture */ struct tpg_data tpg; unsigned ms_vid_cap; - bool must_blank[VIDEO_MAX_FRAME]; + bool must_blank[MAX_VID_CAP_BUFFERS]; const struct vivid_fmt *fmt_cap; struct v4l2_fract timeperframe_vid_cap; diff --git a/drivers/media/test-drivers/vivid/vivid-ctrls.c b/drivers/media/test-drivers/vivid/vivid-ctrls.c index 8bb38bc7b8cc..2b5c8fbcd0a2 100644 --- a/drivers/media/test-drivers/vivid/vivid-ctrls.c +++ b/drivers/media/test-drivers/vivid/vivid-ctrls.c @@ -553,7 +553,7 @@ static int vivid_vid_cap_s_ctrl(struct v4l2_ctrl *ctrl) break; case VIVID_CID_PERCENTAGE_FILL: tpg_s_perc_fill(&dev->tpg, ctrl->val); - for (i = 0; i < VIDEO_MAX_FRAME; i++) + for (i = 0; i < MAX_VID_CAP_BUFFERS; i++) dev->must_blank[i] = ctrl->val < 100; break; case VIVID_CID_INSERT_SAV: diff --git a/drivers/media/test-drivers/vivid/vivid-vid-cap.c b/drivers/media/test-drivers/vivid/vivid-vid-cap.c index 69620e0a35a0..6a790ac8cbe6 100644 --- a/drivers/media/test-drivers/vivid/vivid-vid-cap.c +++ b/drivers/media/test-drivers/vivid/vivid-vid-cap.c @@ -213,7 +213,7 @@ static int vid_cap_start_streaming(struct vb2_queue *vq, unsigned count) dev->vid_cap_seq_count = 0; dprintk(dev, 1, "%s\n", __func__); - for (i = 0; i < VIDEO_MAX_FRAME; i++) + for (i = 0; i < MAX_VID_CAP_BUFFERS; i++) dev->must_blank[i] = tpg_g_perc_fill(&dev->tpg) < 100; if (dev->start_streaming_error) { dev->start_streaming_error = false; From bf791751162ac875a9439426d13f8d4d18151549 Mon Sep 17 00:00:00 2001 From: Mika Westerberg Date: Thu, 24 Oct 2024 12:26:53 +0300 Subject: [PATCH 032/235] thunderbolt: Add only on-board retimers when !CONFIG_USB4_DEBUGFS_MARGINING Normally there is no need to enumerate retimers on the other side of the cable. This is only needed in special cases where user wants to run receiver lane margining against the downstream facing port of a retimer. Furthermore this might confuse the userspace tools such as fwupd because it cannot read the information it expects from these retimers. Fix this by changing the retimer enumeration code to add only on-board retimers when CONFIG_USB4_DEBUGFS_MARGINING is not enabled. Reported-by: AceLan Kao Tested-by: AceLan Kao Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219420 Cc: stable@vger.kernel.org Fixes: ff6ab055e070 ("thunderbolt: Add receiver lane margining support for retimers") Signed-off-by: Mika Westerberg --- drivers/thunderbolt/retimer.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/thunderbolt/retimer.c b/drivers/thunderbolt/retimer.c index 721319329afa..bdf641489f82 100644 --- a/drivers/thunderbolt/retimer.c +++ b/drivers/thunderbolt/retimer.c @@ -531,6 +531,8 @@ int tb_retimer_scan(struct tb_port *port, bool add) max = i; ret = 0; + if (!IS_ENABLED(CONFIG_USB4_DEBUGFS_MARGINING)) + max = min(last_idx, max); /* Add retimers if they do not exist already */ for (i = 1; i <= max; i++) { From 393c74ccbd847bacf18865a01b422586fc7341cf Mon Sep 17 00:00:00 2001 From: Reinhard Speyerer Date: Fri, 18 Oct 2024 23:07:06 +0200 Subject: [PATCH 033/235] USB: serial: option: add Fibocom FG132 0x0112 composition Add Fibocom FG132 0x0112 composition: T: Bus=03 Lev=02 Prnt=06 Port=01 Cnt=02 Dev#= 10 Spd=12 MxCh= 0 D: Ver= 2.01 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=2cb7 ProdID=0112 Rev= 5.15 S: Manufacturer=Fibocom Wireless Inc. S: Product=Fibocom Module S: SerialNumber=xxxxxxxx C:* #Ifs= 4 Cfg#= 1 Atr=a0 MxPwr=500mA I:* If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan E: Ad=82(I) Atr=03(Int.) MxPS= 8 Ivl=32ms E: Ad=81(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=01(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option E: Ad=85(I) Atr=03(Int.) MxPS= 10 Ivl=32ms E: Ad=84(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=03(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms I:* If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=86(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=04(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms Signed-off-by: Reinhard Speyerer Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold --- drivers/usb/serial/option.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/usb/serial/option.c b/drivers/usb/serial/option.c index 4f18f189f309..4eaab8b843c1 100644 --- a/drivers/usb/serial/option.c +++ b/drivers/usb/serial/option.c @@ -2320,6 +2320,9 @@ static const struct usb_device_id option_ids[] = { { USB_DEVICE_AND_INTERFACE_INFO(0x2cb7, 0x010b, 0xff, 0xff, 0x30) }, /* Fibocom FG150 Diag */ { USB_DEVICE_AND_INTERFACE_INFO(0x2cb7, 0x010b, 0xff, 0, 0) }, /* Fibocom FG150 AT */ { USB_DEVICE_INTERFACE_CLASS(0x2cb7, 0x0111, 0xff) }, /* Fibocom FM160 (MBIM mode) */ + { USB_DEVICE_AND_INTERFACE_INFO(0x2cb7, 0x0112, 0xff, 0xff, 0x30) }, /* Fibocom FG132 Diag */ + { USB_DEVICE_AND_INTERFACE_INFO(0x2cb7, 0x0112, 0xff, 0xff, 0x40) }, /* Fibocom FG132 AT */ + { USB_DEVICE_AND_INTERFACE_INFO(0x2cb7, 0x0112, 0xff, 0, 0) }, /* Fibocom FG132 NMEA */ { USB_DEVICE_INTERFACE_CLASS(0x2cb7, 0x0115, 0xff), /* Fibocom FM135 (laptop MBIM) */ .driver_info = RSVD(5) }, { USB_DEVICE_INTERFACE_CLASS(0x2cb7, 0x01a0, 0xff) }, /* Fibocom NL668-AM/NL652-EU (laptop MBIM) */ From 3b05949ba39f305b585452d0e177470607842165 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Beno=C3=AEt=20Monin?= Date: Thu, 24 Oct 2024 17:09:19 +0200 Subject: [PATCH 034/235] USB: serial: option: add Quectel RG650V MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add support for Quectel RG650V which is based on Qualcomm SDX65 chip. The composition is DIAG / NMEA / AT / AT / QMI. T: Bus=02 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#= 4 Spd=5000 MxCh= 0 D: Ver= 3.20 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 9 #Cfgs= 1 P: Vendor=2c7c ProdID=0122 Rev=05.15 S: Manufacturer=Quectel S: Product=RG650V-EU S: SerialNumber=xxxxxxx C: #Ifs= 5 Cfg#= 1 Atr=a0 MxPwr=896mA I: If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=01(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms I: If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=82(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms I: If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=03(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=9ms I: If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=04(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=85(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=9ms I: If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan E: Ad=05(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=87(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=88(I) Atr=03(Int.) MxPS= 8 Ivl=9ms Signed-off-by: Benoît Monin Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold --- drivers/usb/serial/option.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/usb/serial/option.c b/drivers/usb/serial/option.c index 4eaab8b843c1..9ba5584061c8 100644 --- a/drivers/usb/serial/option.c +++ b/drivers/usb/serial/option.c @@ -251,6 +251,7 @@ static void option_instat_callback(struct urb *urb); #define QUECTEL_VENDOR_ID 0x2c7c /* These Quectel products use Quectel's vendor ID */ #define QUECTEL_PRODUCT_EC21 0x0121 +#define QUECTEL_PRODUCT_RG650V 0x0122 #define QUECTEL_PRODUCT_EM061K_LTA 0x0123 #define QUECTEL_PRODUCT_EM061K_LMS 0x0124 #define QUECTEL_PRODUCT_EC25 0x0125 @@ -1273,6 +1274,8 @@ static const struct usb_device_id option_ids[] = { { USB_DEVICE_AND_INTERFACE_INFO(QUECTEL_VENDOR_ID, QUECTEL_PRODUCT_EG912Y, 0xff, 0, 0) }, { USB_DEVICE_AND_INTERFACE_INFO(QUECTEL_VENDOR_ID, QUECTEL_PRODUCT_EG916Q, 0xff, 0x00, 0x00) }, { USB_DEVICE_AND_INTERFACE_INFO(QUECTEL_VENDOR_ID, QUECTEL_PRODUCT_RM500K, 0xff, 0x00, 0x00) }, + { USB_DEVICE_AND_INTERFACE_INFO(QUECTEL_VENDOR_ID, QUECTEL_PRODUCT_RG650V, 0xff, 0xff, 0x30) }, + { USB_DEVICE_AND_INTERFACE_INFO(QUECTEL_VENDOR_ID, QUECTEL_PRODUCT_RG650V, 0xff, 0, 0) }, { USB_DEVICE(CMOTECH_VENDOR_ID, CMOTECH_PRODUCT_6001) }, { USB_DEVICE(CMOTECH_VENDOR_ID, CMOTECH_PRODUCT_CMU_300) }, From 37bb5628379295c1254c113a407cab03a0f4d0b4 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Thu, 31 Oct 2024 12:48:30 +0300 Subject: [PATCH 035/235] USB: serial: io_edgeport: fix use after free in debug printk The "dev_dbg(&urb->dev->dev, ..." which happens after usb_free_urb(urb) is a use after free of the "urb" pointer. Store the "dev" pointer at the start of the function to avoid this issue. Fixes: 984f68683298 ("USB: serial: io_edgeport.c: remove dbg() usage") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter Signed-off-by: Johan Hovold --- drivers/usb/serial/io_edgeport.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/usb/serial/io_edgeport.c b/drivers/usb/serial/io_edgeport.c index c7d6b5e3f898..28c71d99e857 100644 --- a/drivers/usb/serial/io_edgeport.c +++ b/drivers/usb/serial/io_edgeport.c @@ -770,11 +770,12 @@ static void edge_bulk_out_data_callback(struct urb *urb) static void edge_bulk_out_cmd_callback(struct urb *urb) { struct edgeport_port *edge_port = urb->context; + struct device *dev = &urb->dev->dev; int status = urb->status; atomic_dec(&CmdUrbs); - dev_dbg(&urb->dev->dev, "%s - FREE URB %p (outstanding %d)\n", - __func__, urb, atomic_read(&CmdUrbs)); + dev_dbg(dev, "%s - FREE URB %p (outstanding %d)\n", __func__, urb, + atomic_read(&CmdUrbs)); /* clean up the transfer buffer */ @@ -784,8 +785,7 @@ static void edge_bulk_out_cmd_callback(struct urb *urb) usb_free_urb(urb); if (status) { - dev_dbg(&urb->dev->dev, - "%s - nonzero write bulk status received: %d\n", + dev_dbg(dev, "%s - nonzero write bulk status received: %d\n", __func__, status); return; } From c9363bbb0f68dd1ddb8be7bbfe958cdfcd38d851 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jaros=C5=82aw=20Janik?= Date: Wed, 30 Oct 2024 18:18:12 +0100 Subject: [PATCH 036/235] Revert "ALSA: hda/conexant: Mute speakers at suspend / shutdown" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 4f61c8fe3520 ("ALSA: hda/conexant: Mute speakers at suspend / shutdown") mutes speakers on system shutdown or whenever HDA controller is suspended by PM; this however interacts badly with Thinkpad's ACPI firmware behavior which uses beeps to signal various events (enter/leave suspend or hibernation, AC power connect/disconnect, low battery, etc.); now those beeps are either muted altogether (for suspend/hibernate/ shutdown related events) or work more or less randomly (eg. AC plug/unplug is only audible when you are playing music at the moment, because HDA device is likely in suspend mode otherwise). Since the original bug report mentioned in 4f61c8fe3520 complained about Lenovo's Thinkpad laptop - revert this commit altogether. Fixes: 4f61c8fe3520 ("ALSA: hda/conexant: Mute speakers at suspend / shutdown") Signed-off-by: Jarosław Janik Link: https://patch.msgid.link/20241030171813.18941-2-jaroslaw.janik@gmail.com Signed-off-by: Takashi Iwai --- sound/pci/hda/patch_conexant.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/sound/pci/hda/patch_conexant.c b/sound/pci/hda/patch_conexant.c index c74f6742c359..b2bcdf76da30 100644 --- a/sound/pci/hda/patch_conexant.c +++ b/sound/pci/hda/patch_conexant.c @@ -205,8 +205,6 @@ static void cx_auto_shutdown(struct hda_codec *codec) { struct conexant_spec *spec = codec->spec; - snd_hda_gen_shutup_speakers(codec); - /* Turn the problematic codec into D3 to avoid spurious noises from the internal speaker during (and after) reboot */ cx_auto_turn_eapd(codec, spec->num_eapds, spec->eapds, false); From 5e53e4a66bc7430dd2d11c18a86410e3a38d2940 Mon Sep 17 00:00:00 2001 From: Mikhail Rudenko Date: Thu, 17 Oct 2024 21:37:28 +0300 Subject: [PATCH 037/235] regulator: rk808: Add apply_bit for BUCK3 on RK809 Currently, RK809's BUCK3 regulator is modelled in the driver as a configurable regulator with 0.5-2.4V voltage range. But the voltage setting is not actually applied, because when bit 6 of PMIC_POWER_CONFIG register is set to 0 (default), BUCK3 output voltage is determined by the external feedback resistor. Fix this, by setting bit 6 when voltage selection is set. Existing users which do not specify voltage constraints in their device trees will not be affected by this change, since no voltage setting is applied in those cases, and bit 6 is not enabled. Signed-off-by: Mikhail Rudenko Link: https://patch.msgid.link/20241017-rk809-dcdc3-v1-1-e3c3de92f39c@gmail.com Signed-off-by: Mark Brown --- drivers/regulator/rk808-regulator.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/regulator/rk808-regulator.c b/drivers/regulator/rk808-regulator.c index 14b60abd6afc..01a8d0487918 100644 --- a/drivers/regulator/rk808-regulator.c +++ b/drivers/regulator/rk808-regulator.c @@ -1379,6 +1379,8 @@ static const struct regulator_desc rk809_reg[] = { .n_linear_ranges = ARRAY_SIZE(rk817_buck1_voltage_ranges), .vsel_reg = RK817_BUCK3_ON_VSEL_REG, .vsel_mask = RK817_BUCK_VSEL_MASK, + .apply_reg = RK817_POWER_CONFIG, + .apply_bit = RK817_BUCK3_FB_RES_INTER, .enable_reg = RK817_POWER_EN_REG(0), .enable_mask = ENABLE_MASK(RK817_ID_DCDC3), .enable_val = ENABLE_MASK(RK817_ID_DCDC3), From 7ce3e6107103214d354a16729a472f588be60572 Mon Sep 17 00:00:00 2001 From: Johannes Thumshirn Date: Wed, 30 Oct 2024 12:02:53 +0100 Subject: [PATCH 038/235] scsi: sd_zbc: Use kvzalloc() to allocate REPORT ZONES buffer We have two reports of failed memory allocation in btrfs' code which is calling into report zones. Both of these reports have the following signature coming from __vmalloc_area_node(): kworker/u17:5: vmalloc error: size 0, failed to allocate pages, mode:0x10dc2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NORETRY|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 Further debugging showed these where allocations of one sector (512 bytes) and at least one of the reporter's systems where low on memory, so going through the overhead of allocating a vm area failed. Switching the allocation from __vmalloc() to kvzalloc() avoids the overhead of vmalloc() on small allocations and succeeds. Note: the buffer is already freed using kvfree() so there's no need to adjust the free path. Cc: Qu Wenru Cc: Naohiro Aota Link: https://github.com/kdave/btrfs-progs/issues/779 Link: https://github.com/kdave/btrfs-progs/issues/915 Fixes: 23a50861adda ("scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer()") Signed-off-by: Johannes Thumshirn Link: https://lore.kernel.org/r/20241030110253.11718-1-jth@kernel.org Reviewed-by: Damien Le Moal Signed-off-by: Martin K. Petersen --- drivers/scsi/sd_zbc.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index c8b9654d30f0..a4d17f3da25d 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -188,8 +188,7 @@ static void *sd_zbc_alloc_report_buffer(struct scsi_disk *sdkp, bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT); while (bufsize >= SECTOR_SIZE) { - buf = __vmalloc(bufsize, - GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY); + buf = kvzalloc(bufsize, GFP_KERNEL | __GFP_NORETRY); if (buf) { *buflen = bufsize; return buf; From ef7134c7fc48e1441b398e55a862232868a6f0a7 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Sat, 2 Nov 2024 14:24:38 -0700 Subject: [PATCH 039/235] smb: client: Fix use-after-free of network namespace. Recently, we got a customer report that CIFS triggers oops while reconnecting to a server. [0] The workload runs on Kubernetes, and some pods mount CIFS servers in non-root network namespaces. The problem rarely happened, but it was always while the pod was dying. The root cause is wrong reference counting for network namespace. CIFS uses kernel sockets, which do not hold refcnt of the netns that the socket belongs to. That means CIFS must ensure the socket is always freed before its netns; otherwise, use-after-free happens. The repro steps are roughly: 1. mount CIFS in a non-root netns 2. drop packets from the netns 3. destroy the netns 4. unmount CIFS We can reproduce the issue quickly with the script [1] below and see the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled. When the socket is TCP, it is hard to guarantee the netns lifetime without holding refcnt due to async timers. Let's hold netns refcnt for each socket as done for SMC in commit 9744d2bf1976 ("smc: Fix use-after-free in tcp_write_timer_handler()."). Note that we need to move put_net() from cifs_put_tcp_session() to clean_demultiplex_info(); otherwise, __sock_create() still could touch a freed netns while cifsd tries to reconnect from cifs_demultiplex_thread(). Also, maybe_get_net() cannot be put just before __sock_create() because the code is not under RCU and there is a small chance that the same address happened to be reallocated to another netns. [0]: CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting... CIFS: Serverclose failed 4 times, giving up Unable to handle kernel paging request at virtual address 14de99e461f84a07 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [14de99e461f84a07] address between user and kernel address ranges Internal error: Oops: 0000000096000004 [#1] SMP Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1 Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fib_rules_lookup+0x44/0x238 lr : __fib_lookup+0x64/0xbc sp : ffff8000265db790 x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01 x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580 x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500 x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002 x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294 x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000 x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0 x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500 Call trace: fib_rules_lookup+0x44/0x238 __fib_lookup+0x64/0xbc ip_route_output_key_hash_rcu+0x2c4/0x398 ip_route_output_key_hash+0x60/0x8c tcp_v4_connect+0x290/0x488 __inet_stream_connect+0x108/0x3d0 inet_stream_connect+0x50/0x78 kernel_connect+0x6c/0xac generic_ip_connect+0x10c/0x6c8 [cifs] __reconnect_target_unlocked+0xa0/0x214 [cifs] reconnect_dfs_server+0x144/0x460 [cifs] cifs_reconnect+0x88/0x148 [cifs] cifs_readv_from_socket+0x230/0x430 [cifs] cifs_read_from_socket+0x74/0xa8 [cifs] cifs_demultiplex_thread+0xf8/0x704 [cifs] kthread+0xd0/0xd4 Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264) [1]: CIFS_CRED="/root/cred.cifs" CIFS_USER="Administrator" CIFS_PASS="Password" CIFS_IP="X.X.X.X" CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST" CIFS_MNT="/mnt/smb" DEV="enp0s3" cat < ${CIFS_CRED} username=${CIFS_USER} password=${CIFS_PASS} domain=EXAMPLE.COM EOF unshare -n bash -c " mkdir -p ${CIFS_MNT} ip netns attach root 1 ip link add eth0 type veth peer veth0 netns root ip link set eth0 up ip -n root link set veth0 up ip addr add 192.168.0.2/24 dev eth0 ip -n root addr add 192.168.0.1/24 dev veth0 ip route add default via 192.168.0.1 dev eth0 ip netns exec root sysctl net.ipv4.ip_forward=1 ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1 touch ${CIFS_MNT}/a.txt ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE " umount ${CIFS_MNT} [2]: ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1576) generic_ip_connect (fs/smb/client/connect.c:3075) cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798) cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366) dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285) cifs_mount (fs/smb/client/connect.c:3622) cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949) smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794) vfs_get_tree (fs/super.c:1800) path_mount (fs/namespace.c:3508 fs/namespace.c:3834) __x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Signed-off-by: Kuniyuki Iwashima Acked-by: Tom Talpey Signed-off-by: Steve French --- fs/smb/client/connect.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c index 15d94ac4095e..0ce2d704b1f3 100644 --- a/fs/smb/client/connect.c +++ b/fs/smb/client/connect.c @@ -1037,6 +1037,7 @@ clean_demultiplex_info(struct TCP_Server_Info *server) */ } + put_net(cifs_net_ns(server)); kfree(server->leaf_fullpath); kfree(server); @@ -1635,8 +1636,6 @@ cifs_put_tcp_session(struct TCP_Server_Info *server, int from_reconnect) /* srv_count can never go negative */ WARN_ON(server->srv_count < 0); - put_net(cifs_net_ns(server)); - list_del_init(&server->tcp_ses_list); spin_unlock(&cifs_tcp_ses_lock); @@ -3070,13 +3069,22 @@ generic_ip_connect(struct TCP_Server_Info *server) if (server->ssocket) { socket = server->ssocket; } else { - rc = __sock_create(cifs_net_ns(server), sfamily, SOCK_STREAM, + struct net *net = cifs_net_ns(server); + struct sock *sk; + + rc = __sock_create(net, sfamily, SOCK_STREAM, IPPROTO_TCP, &server->ssocket, 1); if (rc < 0) { cifs_server_dbg(VFS, "Error %d creating socket\n", rc); return rc; } + sk = server->ssocket->sk; + __netns_tracker_free(net, &sk->ns_tracker, false); + sk->sk_net_refcnt = 1; + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); + sock_inuse_add(net, 1); + /* BB other socket options to set KEEPALIVE, NODELAY? */ cifs_dbg(FYI, "Socket created\n"); socket = server->ssocket; From b0ef514bc6bbdeb8cc7492c0f473e14cb06b14d4 Mon Sep 17 00:00:00 2001 From: Brendan King Date: Fri, 18 Oct 2024 15:41:36 +0000 Subject: [PATCH 040/235] drm/imagination: Add a per-file PVR context list This adds a linked list of VM contexts which is needed for the next patch to be able to correctly track VM contexts for destruction on file close. It is only safe for VM contexts to be removed from the list and destroyed when not in interrupt context. Signed-off-by: Brendan King Signed-off-by: Matt Coster Reviewed-by: Frank Binns Cc: stable@vger.kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/e57128ea-f0ce-4e93-a9d4-3f033a8b06fa@imgtec.com --- drivers/gpu/drm/imagination/pvr_context.c | 14 ++++++++++++++ drivers/gpu/drm/imagination/pvr_context.h | 3 +++ drivers/gpu/drm/imagination/pvr_device.h | 10 ++++++++++ drivers/gpu/drm/imagination/pvr_drv.c | 3 +++ 4 files changed, 30 insertions(+) diff --git a/drivers/gpu/drm/imagination/pvr_context.c b/drivers/gpu/drm/imagination/pvr_context.c index eded5e955cc0..255c93582734 100644 --- a/drivers/gpu/drm/imagination/pvr_context.c +++ b/drivers/gpu/drm/imagination/pvr_context.c @@ -17,10 +17,14 @@ #include #include + +#include #include #include +#include #include #include +#include #include #include #include @@ -354,6 +358,10 @@ int pvr_context_create(struct pvr_file *pvr_file, struct drm_pvr_ioctl_create_co return err; } + spin_lock(&pvr_dev->ctx_list_lock); + list_add_tail(&ctx->file_link, &pvr_file->contexts); + spin_unlock(&pvr_dev->ctx_list_lock); + return 0; err_destroy_fw_obj: @@ -380,6 +388,11 @@ pvr_context_release(struct kref *ref_count) container_of(ref_count, struct pvr_context, ref_count); struct pvr_device *pvr_dev = ctx->pvr_dev; + WARN_ON(in_interrupt()); + spin_lock(&pvr_dev->ctx_list_lock); + list_del(&ctx->file_link); + spin_unlock(&pvr_dev->ctx_list_lock); + xa_erase(&pvr_dev->ctx_ids, ctx->ctx_id); pvr_context_destroy_queues(ctx); pvr_fw_object_destroy(ctx->fw_obj); @@ -451,6 +464,7 @@ void pvr_destroy_contexts_for_file(struct pvr_file *pvr_file) void pvr_context_device_init(struct pvr_device *pvr_dev) { xa_init_flags(&pvr_dev->ctx_ids, XA_FLAGS_ALLOC1); + spin_lock_init(&pvr_dev->ctx_list_lock); } /** diff --git a/drivers/gpu/drm/imagination/pvr_context.h b/drivers/gpu/drm/imagination/pvr_context.h index 0c7b97dfa6ba..a5b0a82a54a1 100644 --- a/drivers/gpu/drm/imagination/pvr_context.h +++ b/drivers/gpu/drm/imagination/pvr_context.h @@ -85,6 +85,9 @@ struct pvr_context { /** @compute: Transfer queue. */ struct pvr_queue *transfer; } queues; + + /** @file_link: pvr_file PVR context list link. */ + struct list_head file_link; }; static __always_inline struct pvr_queue * diff --git a/drivers/gpu/drm/imagination/pvr_device.h b/drivers/gpu/drm/imagination/pvr_device.h index b574e23d484b..6d0dfacb677b 100644 --- a/drivers/gpu/drm/imagination/pvr_device.h +++ b/drivers/gpu/drm/imagination/pvr_device.h @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -293,6 +294,12 @@ struct pvr_device { /** @sched_wq: Workqueue for schedulers. */ struct workqueue_struct *sched_wq; + + /** + * @ctx_list_lock: Lock to be held when accessing the context list in + * struct pvr_file. + */ + spinlock_t ctx_list_lock; }; /** @@ -344,6 +351,9 @@ struct pvr_file { * This array is used to allocate handles returned to userspace. */ struct xarray vm_ctx_handles; + + /** @contexts: PVR context list. */ + struct list_head contexts; }; /** diff --git a/drivers/gpu/drm/imagination/pvr_drv.c b/drivers/gpu/drm/imagination/pvr_drv.c index 1a0cb7aa9cea..fb17196e05f4 100644 --- a/drivers/gpu/drm/imagination/pvr_drv.c +++ b/drivers/gpu/drm/imagination/pvr_drv.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -1326,6 +1327,8 @@ pvr_drm_driver_open(struct drm_device *drm_dev, struct drm_file *file) */ pvr_file->pvr_dev = pvr_dev; + INIT_LIST_HEAD(&pvr_file->contexts); + xa_init_flags(&pvr_file->ctx_handles, XA_FLAGS_ALLOC1); xa_init_flags(&pvr_file->free_list_handles, XA_FLAGS_ALLOC1); xa_init_flags(&pvr_file->hwrt_handles, XA_FLAGS_ALLOC1); From b04ce1e718bd55302b52d05d6873e233cb3ec7a1 Mon Sep 17 00:00:00 2001 From: Brendan King Date: Fri, 18 Oct 2024 15:41:40 +0000 Subject: [PATCH 041/235] drm/imagination: Break an object reference loop MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When remaining resources are being cleaned up on driver close, outstanding VM mappings may result in resources being leaked, due to an object reference loop, as shown below, with each object (or set of objects) referencing the object below it: PVR GEM Object GPU scheduler "finished" fence GPU scheduler “scheduled” fence PVR driver “done” fence PVR Context PVR VM Context PVR VM Mappings PVR GEM Object The reference that the PVR VM Context has on the VM mappings is a soft one, in the sense that the freeing of outstanding VM mappings is done as part of VM context destruction; no reference counts are involved, as is the case for all the other references in the loop. To break the reference loop during cleanup, free the outstanding VM mappings before destroying the PVR Context associated with the VM context. Signed-off-by: Brendan King Signed-off-by: Matt Coster Reviewed-by: Frank Binns Cc: stable@vger.kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/8a25924f-1bb7-4d9a-a346-58e871dfb1d1@imgtec.com --- drivers/gpu/drm/imagination/pvr_context.c | 19 +++++++++++++++++++ drivers/gpu/drm/imagination/pvr_context.h | 18 ++++++++++++++++++ drivers/gpu/drm/imagination/pvr_vm.c | 22 ++++++++++++++++++---- drivers/gpu/drm/imagination/pvr_vm.h | 1 + 4 files changed, 56 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/imagination/pvr_context.c b/drivers/gpu/drm/imagination/pvr_context.c index 255c93582734..4cb3494c0bb2 100644 --- a/drivers/gpu/drm/imagination/pvr_context.c +++ b/drivers/gpu/drm/imagination/pvr_context.c @@ -450,11 +450,30 @@ pvr_context_destroy(struct pvr_file *pvr_file, u32 handle) */ void pvr_destroy_contexts_for_file(struct pvr_file *pvr_file) { + struct pvr_device *pvr_dev = pvr_file->pvr_dev; struct pvr_context *ctx; unsigned long handle; xa_for_each(&pvr_file->ctx_handles, handle, ctx) pvr_context_destroy(pvr_file, handle); + + spin_lock(&pvr_dev->ctx_list_lock); + ctx = list_first_entry(&pvr_file->contexts, struct pvr_context, file_link); + + while (!list_entry_is_head(ctx, &pvr_file->contexts, file_link)) { + list_del_init(&ctx->file_link); + + if (pvr_context_get_if_referenced(ctx)) { + spin_unlock(&pvr_dev->ctx_list_lock); + + pvr_vm_unmap_all(ctx->vm_ctx); + + pvr_context_put(ctx); + spin_lock(&pvr_dev->ctx_list_lock); + } + ctx = list_first_entry(&pvr_file->contexts, struct pvr_context, file_link); + } + spin_unlock(&pvr_dev->ctx_list_lock); } /** diff --git a/drivers/gpu/drm/imagination/pvr_context.h b/drivers/gpu/drm/imagination/pvr_context.h index a5b0a82a54a1..07afa179cdf4 100644 --- a/drivers/gpu/drm/imagination/pvr_context.h +++ b/drivers/gpu/drm/imagination/pvr_context.h @@ -126,6 +126,24 @@ pvr_context_get(struct pvr_context *ctx) return ctx; } +/** + * pvr_context_get_if_referenced() - Take an additional reference on a still + * referenced context. + * @ctx: Context pointer. + * + * Call pvr_context_put() to release. + * + * Returns: + * * True on success, or + * * false if no context pointer passed, or the context wasn't still + * * referenced. + */ +static __always_inline bool +pvr_context_get_if_referenced(struct pvr_context *ctx) +{ + return ctx != NULL && kref_get_unless_zero(&ctx->ref_count) != 0; +} + /** * pvr_context_lookup() - Lookup context pointer from handle and file. * @pvr_file: Pointer to pvr_file structure. diff --git a/drivers/gpu/drm/imagination/pvr_vm.c b/drivers/gpu/drm/imagination/pvr_vm.c index 97c0f772ed65..7bd6ba4c6e8a 100644 --- a/drivers/gpu/drm/imagination/pvr_vm.c +++ b/drivers/gpu/drm/imagination/pvr_vm.c @@ -14,6 +14,7 @@ #include #include +#include #include #include #include @@ -597,12 +598,26 @@ pvr_vm_create_context(struct pvr_device *pvr_dev, bool is_userspace_context) } /** - * pvr_vm_context_release() - Teardown a VM context. - * @ref_count: Pointer to reference counter of the VM context. + * pvr_vm_unmap_all() - Unmap all mappings associated with a VM context. + * @vm_ctx: Target VM context. * * This function ensures that no mappings are left dangling by unmapping them * all in order of ascending device-virtual address. */ +void +pvr_vm_unmap_all(struct pvr_vm_context *vm_ctx) +{ + WARN_ON(pvr_vm_unmap(vm_ctx, vm_ctx->gpuvm_mgr.mm_start, + vm_ctx->gpuvm_mgr.mm_range)); +} + +/** + * pvr_vm_context_release() - Teardown a VM context. + * @ref_count: Pointer to reference counter of the VM context. + * + * This function also ensures that no mappings are left dangling by calling + * pvr_vm_unmap_all. + */ static void pvr_vm_context_release(struct kref *ref_count) { @@ -612,8 +627,7 @@ pvr_vm_context_release(struct kref *ref_count) if (vm_ctx->fw_mem_ctx_obj) pvr_fw_object_destroy(vm_ctx->fw_mem_ctx_obj); - WARN_ON(pvr_vm_unmap(vm_ctx, vm_ctx->gpuvm_mgr.mm_start, - vm_ctx->gpuvm_mgr.mm_range)); + pvr_vm_unmap_all(vm_ctx); pvr_mmu_context_destroy(vm_ctx->mmu_ctx); drm_gem_private_object_fini(&vm_ctx->dummy_gem); diff --git a/drivers/gpu/drm/imagination/pvr_vm.h b/drivers/gpu/drm/imagination/pvr_vm.h index f2a6463f2b05..79406243617c 100644 --- a/drivers/gpu/drm/imagination/pvr_vm.h +++ b/drivers/gpu/drm/imagination/pvr_vm.h @@ -39,6 +39,7 @@ int pvr_vm_map(struct pvr_vm_context *vm_ctx, struct pvr_gem_object *pvr_obj, u64 pvr_obj_offset, u64 device_addr, u64 size); int pvr_vm_unmap(struct pvr_vm_context *vm_ctx, u64 device_addr, u64 size); +void pvr_vm_unmap_all(struct pvr_vm_context *vm_ctx); dma_addr_t pvr_vm_get_page_table_root_addr(struct pvr_vm_context *vm_ctx); struct dma_resv *pvr_vm_get_dma_resv(struct pvr_vm_context *vm_ctx); From c2d188e137e77294323132a760a4608321a36a70 Mon Sep 17 00:00:00 2001 From: Takashi Iwai Date: Mon, 4 Nov 2024 11:07:34 +0100 Subject: [PATCH 042/235] ALSA: ump: Don't enumeration invalid groups for legacy rawmidi The legacy rawmidi tries to enumerate all possible UMP groups belonging to the UMP endpoint. But currently it shows all 16 ports when the UMP endpoint is configured with static blocks, although most of them may be unused. There was already a fix for the sequencer client side to ignore such groups in the commit 3bfd7c0ba184 ("ALSA: seq: ump: Skip useless ports for static blocks"), and this commit is a similar fix for UMP rawmidi devices; it adds simply the check for the validity of each group that has been already parsed. (Note that the group info was moved to snd_ump_endpoint.groups[] by the commit 0642a3c5cacc0321c755 ("ALSA: ump: Update substream name from assigned FB names")). Link: https://patch.msgid.link/20241104100735.16127-1-tiwai@suse.de Signed-off-by: Takashi Iwai --- sound/core/ump.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/core/ump.c b/sound/core/ump.c index cf22a17e38dd..7d59a0a9b037 100644 --- a/sound/core/ump.c +++ b/sound/core/ump.c @@ -1233,7 +1233,7 @@ static int fill_legacy_mapping(struct snd_ump_endpoint *ump) num = 0; for (i = 0; i < SNDRV_UMP_MAX_GROUPS; i++) - if (group_maps & (1U << i)) + if ((group_maps & (1U << i)) && ump->groups[i].valid) ump->legacy_mapping[num++] = i; return num; From 8abbf1f01d6a2ef9f911f793e30f7382154b5a3a Mon Sep 17 00:00:00 2001 From: Murad Masimov Date: Fri, 1 Nov 2024 21:55:13 +0300 Subject: [PATCH 043/235] ALSA: firewire-lib: fix return value on fail in amdtp_tscm_init() If amdtp_stream_init() fails in amdtp_tscm_init(), the latter returns zero, though it's supposed to return error code, which is checked inside init_stream() in file tascam-stream.c. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 47faeea25ef3 ("ALSA: firewire-tascam: add data block processing layer") Signed-off-by: Murad Masimov Reviewed-by: Takashi Sakamoto Signed-off-by: Takashi Iwai Link: https://patch.msgid.link/20241101185517.1819-1-m.masimov@maxima.ru --- sound/firewire/tascam/amdtp-tascam.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/firewire/tascam/amdtp-tascam.c b/sound/firewire/tascam/amdtp-tascam.c index 0b42d6559008..079afa4bd381 100644 --- a/sound/firewire/tascam/amdtp-tascam.c +++ b/sound/firewire/tascam/amdtp-tascam.c @@ -238,7 +238,7 @@ int amdtp_tscm_init(struct amdtp_stream *s, struct fw_unit *unit, err = amdtp_stream_init(s, unit, dir, flags, fmt, process_ctx_payloads, sizeof(struct amdtp_tscm)); if (err < 0) - return 0; + return err; if (dir == AMDTP_OUT_STREAM) { // Use fixed value for FDF field. From fe09de2db2365eed8b44b572cff7d421eaf1754a Mon Sep 17 00:00:00 2001 From: Shenghao Ding Date: Mon, 4 Nov 2024 18:00:55 +0800 Subject: [PATCH 044/235] ASoC: tas2781: Add new driver version for tas2563 & tas2781 qfn chip Add new driver version to support tas2563 & tas2781 qfn chip Signed-off-by: Shenghao Ding Link: https://patch.msgid.link/20241104100055.48-1-shenghao-ding@ti.com Signed-off-by: Mark Brown --- sound/soc/codecs/tas2781-fmwlib.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/soc/codecs/tas2781-fmwlib.c b/sound/soc/codecs/tas2781-fmwlib.c index ae360c97fe1e..0aeb88abbf52 100644 --- a/sound/soc/codecs/tas2781-fmwlib.c +++ b/sound/soc/codecs/tas2781-fmwlib.c @@ -1992,6 +1992,7 @@ static int tasdevice_dspfw_ready(const struct firmware *fmw, break; case 0x202: case 0x400: + case 0x401: tas_priv->fw_parse_variable_header = fw_parse_variable_header_git; tas_priv->fw_parse_program_data = From ebdcba2126a817da4efc085c9d4dce0c51942eba Mon Sep 17 00:00:00 2001 From: Raju Rangoju Date: Mon, 4 Nov 2024 11:53:27 +0530 Subject: [PATCH 045/235] MAINTAINERS: update AMD SPI maintainer 'Sanjay R Mehta' is no longer with AMD, I will take over as the maintainer of the AMD SPI driver moving forward. I request to be added as the new maintainer. Signed-off-by: Raju Rangoju Link: https://patch.msgid.link/20241104062327.1228521-1-Raju.Rangoju@amd.com Signed-off-by: Mark Brown --- MAINTAINERS | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index a097afd76ded..3e251b357d54 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1181,8 +1181,9 @@ F: Documentation/hid/amd-sfh* F: drivers/hid/amd-sfh-hid/ AMD SPI DRIVER -M: Sanjay R Mehta -S: Maintained +M: Raju Rangoju +L: linux-spi@vger.kernel.org +S: Supported F: drivers/spi/spi-amd.c AMD XGBE DRIVER From f16beaaee248eaa37ad40b5905924fcf70ae02e3 Mon Sep 17 00:00:00 2001 From: Dmitry Baryshkov Date: Fri, 11 Oct 2024 08:48:39 +0300 Subject: [PATCH 046/235] thermal/drivers/qcom/lmh: Remove false lockdep backtrace Annotate LMH IRQs with lockdep classes so that the lockdep doesn't report possible recursive locking issue between LMH and GIC interrupts. For the reference: CPU0 ---- lock(&irq_desc_lock_class); lock(&irq_desc_lock_class); *** DEADLOCK *** Call trace: dump_backtrace+0x98/0xf0 show_stack+0x18/0x24 dump_stack_lvl+0x90/0xd0 dump_stack+0x18/0x24 print_deadlock_bug+0x258/0x348 __lock_acquire+0x1078/0x1f44 lock_acquire+0x1fc/0x32c _raw_spin_lock_irqsave+0x60/0x88 __irq_get_desc_lock+0x58/0x98 enable_irq+0x38/0xa0 lmh_enable_interrupt+0x2c/0x38 irq_enable+0x40/0x8c __irq_startup+0x78/0xa4 irq_startup+0x78/0x168 __enable_irq+0x70/0x7c enable_irq+0x4c/0xa0 qcom_cpufreq_ready+0x20/0x2c cpufreq_online+0x2a8/0x988 cpufreq_add_dev+0x80/0x98 subsys_interface_register+0x104/0x134 cpufreq_register_driver+0x150/0x234 qcom_cpufreq_hw_driver_probe+0x2a8/0x388 platform_probe+0x68/0xc0 really_probe+0xbc/0x298 __driver_probe_device+0x78/0x12c driver_probe_device+0x3c/0x160 __device_attach_driver+0xb8/0x138 bus_for_each_drv+0x84/0xe0 __device_attach+0x9c/0x188 device_initial_probe+0x14/0x20 bus_probe_device+0xac/0xb0 deferred_probe_work_func+0x8c/0xc8 process_one_work+0x20c/0x62c worker_thread+0x1bc/0x36c kthread+0x120/0x124 ret_from_fork+0x10/0x20 Fixes: 53bca371cdf7 ("thermal/drivers/qcom: Add support for LMh driver") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Baryshkov Link: https://lore.kernel.org/r/20241011-lmh-lockdep-v1-1-495cbbe6fef1@linaro.org Signed-off-by: Daniel Lezcano --- drivers/thermal/qcom/lmh.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/thermal/qcom/lmh.c b/drivers/thermal/qcom/lmh.c index 5225b3621a56..d2d49264cf83 100644 --- a/drivers/thermal/qcom/lmh.c +++ b/drivers/thermal/qcom/lmh.c @@ -73,7 +73,14 @@ static struct irq_chip lmh_irq_chip = { static int lmh_irq_map(struct irq_domain *d, unsigned int irq, irq_hw_number_t hw) { struct lmh_hw_data *lmh_data = d->host_data; + static struct lock_class_key lmh_lock_key; + static struct lock_class_key lmh_request_key; + /* + * This lock class tells lockdep that GPIO irqs are in a different + * category than their parents, so it won't report false recursion. + */ + irq_set_lockdep_class(irq, &lmh_lock_key, &lmh_request_key); irq_set_chip_and_handler(irq, &lmh_irq_chip, handle_simple_irq); irq_set_chip_data(irq, lmh_data); From fcd54cf480c87b96313a97dbf898c644b7bb3a2e Mon Sep 17 00:00:00 2001 From: Emil Dahl Juhl Date: Tue, 15 Oct 2024 19:18:26 +0200 Subject: [PATCH 047/235] tools/lib/thermal: Fix sampling handler context ptr The sampling handler, provided by the user alongside a void* context, was invoked with an internal structure instead of the user context. Correct the invocation of the sampling handler to pass the user context pointer instead. Note that the approach taken is similar to that in events.c, and will reduce the chances of this mistake happening if additional sampling callbacks are added. Fixes: 47c4b0de080a ("tools/lib/thermal: Add a thermal library") Signed-off-by: Emil Dahl Juhl Link: https://lore.kernel.org/r/20241015171826.170154-1-emdj@bang-olufsen.dk Signed-off-by: Daniel Lezcano --- tools/lib/thermal/sampling.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/lib/thermal/sampling.c b/tools/lib/thermal/sampling.c index 70577423a9f0..f67c1f9ea1d7 100644 --- a/tools/lib/thermal/sampling.c +++ b/tools/lib/thermal/sampling.c @@ -16,6 +16,8 @@ static int handle_thermal_sample(struct nl_msg *n, void *arg) struct thermal_handler_param *thp = arg; struct thermal_handler *th = thp->th; + arg = thp->arg; + genlmsg_parse(nlh, 0, attrs, THERMAL_GENL_ATTR_MAX, NULL); switch (genlhdr->cmd) { From c5426dcc5a3a064bbd2de383e29035a14fe933e0 Mon Sep 17 00:00:00 2001 From: zhang jiao Date: Thu, 12 Sep 2024 12:50:31 +0800 Subject: [PATCH 048/235] tools/lib/thermal: Remove the thermal.h soft link when doing make clean Run "make -C tools thermal" can create a soft link for thermal.h in tools/include/uapi/linux. Just rm it when make clean. Signed-off-by: zhang jiao Link: https://lore.kernel.org/r/20240912045031.18426-1-zhangjiao2@cmss.chinamobile.com Signed-off-by: Daniel Lezcano --- tools/lib/thermal/Makefile | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/lib/thermal/Makefile b/tools/lib/thermal/Makefile index 2d0d255fd0e1..8890fd57b110 100644 --- a/tools/lib/thermal/Makefile +++ b/tools/lib/thermal/Makefile @@ -121,7 +121,9 @@ all: fixdep clean: $(call QUIET_CLEAN, libthermal) $(RM) $(LIBTHERMAL_A) \ - *.o *~ *.a *.so *.so.$(VERSION) *.so.$(LIBTHERMAL_VERSION) .*.d .*.cmd LIBTHERMAL-CFLAGS $(LIBTHERMAL_PC) + *.o *~ *.a *.so *.so.$(VERSION) *.so.$(LIBTHERMAL_VERSION) \ + .*.d .*.cmd LIBTHERMAL-CFLAGS $(LIBTHERMAL_PC) \ + $(srctree)/tools/$(THERMAL_UAPI) $(LIBTHERMAL_PC): $(QUIET_GEN)sed -e "s|@PREFIX@|$(prefix)|" \ From 725f31f300e300a9d94976bd8f1db6e746f95f63 Mon Sep 17 00:00:00 2001 From: Icenowy Zheng Date: Fri, 18 Oct 2024 15:31:36 +0800 Subject: [PATCH 049/235] thermal/of: support thermal zones w/o trips subnode Although the current device tree binding of thermal zones require the trips subnode, the binding in kernel v5.15 does not require it, and many device trees shipped with the kernel, for example, allwinner/sun50i-a64.dtsi and mediatek/mt8183-kukui.dtsi in ARM64, still comply to the old binding and contain no trips subnode. Allow the code to successfully register thermal zones w/o trips subnode for DT binding compatibility now. Furtherly, the inconsistency between DTs and bindings should be resolved by either adding empty trips subnode or dropping the trips subnode requirement. Fixes: d0c75fa2c17f ("thermal/of: Initialize trip points separately") Signed-off-by: Icenowy Zheng [wenst@chromium.org: Reworked logic and kernel log messages] Signed-off-by: Chen-Yu Tsai Reviewed-by: Rafael J. Wysocki Link: https://lore.kernel.org/r/20241018073139.1268995-1-wenst@chromium.org Signed-off-by: Daniel Lezcano --- drivers/thermal/thermal_of.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c index a4caf7899f8e..07e09897165f 100644 --- a/drivers/thermal/thermal_of.c +++ b/drivers/thermal/thermal_of.c @@ -99,18 +99,15 @@ static struct thermal_trip *thermal_of_trips_init(struct device_node *np, int *n struct device_node *trips; int ret, count; + *ntrips = 0; + trips = of_get_child_by_name(np, "trips"); - if (!trips) { - pr_err("Failed to find 'trips' node\n"); - return ERR_PTR(-EINVAL); - } + if (!trips) + return NULL; count = of_get_child_count(trips); - if (!count) { - pr_err("No trip point defined\n"); - ret = -EINVAL; - goto out_of_node_put; - } + if (!count) + return NULL; tt = kzalloc(sizeof(*tt) * count, GFP_KERNEL); if (!tt) { @@ -133,7 +130,6 @@ static struct thermal_trip *thermal_of_trips_init(struct device_node *np, int *n out_kfree: kfree(tt); - *ntrips = 0; out_of_node_put: of_node_put(trips); @@ -401,11 +397,14 @@ static struct thermal_zone_device *thermal_of_zone_register(struct device_node * trips = thermal_of_trips_init(np, &ntrips); if (IS_ERR(trips)) { - pr_err("Failed to find trip points for %pOFn id=%d\n", sensor, id); + pr_err("Failed to parse trip points for %pOFn id=%d\n", sensor, id); ret = PTR_ERR(trips); goto out_of_node_put; } + if (!trips) + pr_info("No trip points found for %pOFn id=%d\n", sensor, id); + ret = thermal_of_monitor_init(np, &delay, &pdelay); if (ret) { pr_err("Failed to initialize monitoring delays from %pOFn\n", np); From 7fd3fa006fa56c0ec299c61ecf5c572c723adad5 Mon Sep 17 00:00:00 2001 From: Balasubramani Vivekanandan Date: Tue, 8 Oct 2024 13:06:27 +0530 Subject: [PATCH 050/235] drm/xe: Set mask bits for CCS_MODE register CCS_MODE register requires setting mask bits from Xe2+ platforms. Set the mask bits unconditionally, as those bits are unused for older platforms. Signed-off-by: Balasubramani Vivekanandan Cc: stable@vger.kernel.org # v6.11+ Reviewed-by: Lucas De Marchi Link: https://patchwork.freedesktop.org/patch/msgid/20241008073628.377433-2-balasubramani.vivekanandan@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit 23ea2c7572d4735ef66beb1e4feb8ae510b78247) [ Fix conflict with mmio refactors ] Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/regs/xe_gt_regs.h | 2 +- drivers/gpu/drm/xe/xe_gt_ccs_mode.c | 6 ++++++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/regs/xe_gt_regs.h b/drivers/gpu/drm/xe/regs/xe_gt_regs.h index 00ad34ed73a5..bd604b9f08e4 100644 --- a/drivers/gpu/drm/xe/regs/xe_gt_regs.h +++ b/drivers/gpu/drm/xe/regs/xe_gt_regs.h @@ -517,7 +517,7 @@ * [4-6] RSVD * [7] Disabled */ -#define CCS_MODE XE_REG(0x14804) +#define CCS_MODE XE_REG(0x14804, XE_REG_OPTION_MASKED) #define CCS_MODE_CSLICE_0_3_MASK REG_GENMASK(11, 0) /* 3 bits per cslice */ #define CCS_MODE_CSLICE_MASK 0x7 /* CCS0-3 + rsvd */ #define CCS_MODE_CSLICE_WIDTH ilog2(CCS_MODE_CSLICE_MASK + 1) diff --git a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c index d2e4dc3aaf61..b8d832c8f907 100644 --- a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c +++ b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c @@ -68,6 +68,12 @@ static void __xe_gt_apply_ccs_mode(struct xe_gt *gt, u32 num_engines) } } + /* + * Mask bits need to be set for the register. Though only Xe2+ + * platforms require setting of mask bits, it won't harm for older + * platforms as these bits are unused there. + */ + mode |= CCS_MODE_CSLICE_0_3_MASK << 16; xe_mmio_write32(gt, CCS_MODE, mode); xe_gt_dbg(gt, "CCS_MODE=%x config:%08x, num_engines:%d, num_slices:%d\n", From 4b468a92ddb2985da66823910a1643349fe6447d Mon Sep 17 00:00:00 2001 From: Balasubramani Vivekanandan Date: Tue, 8 Oct 2024 13:06:28 +0530 Subject: [PATCH 051/235] drm/xe: Use the filelist from drm for ccs_mode change Drop the exclusive client count tracking and use the filelist from the drm to track the active clients. This also ensures the clients created internally by the driver won't block changing the ccs mode. Fixes: ce8c161cbad4 ("drm/xe: Add ref counting for xe_file") Signed-off-by: Balasubramani Vivekanandan Reviewed-by: Lucas De Marchi Link: https://patchwork.freedesktop.org/patch/msgid/20241008073628.377433-3-balasubramani.vivekanandan@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit 1c35f1ed1fe3c649f8c16214d0d3dd828b5265d9) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_device.c | 10 ---------- drivers/gpu/drm/xe/xe_device_types.h | 9 --------- drivers/gpu/drm/xe/xe_gt_ccs_mode.c | 9 +++++---- 3 files changed, 5 insertions(+), 23 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index 10fd4601b9f2..a1987b554a8d 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -87,10 +87,6 @@ static int xe_file_open(struct drm_device *dev, struct drm_file *file) mutex_init(&xef->exec_queue.lock); xa_init_flags(&xef->exec_queue.xa, XA_FLAGS_ALLOC1); - spin_lock(&xe->clients.lock); - xe->clients.count++; - spin_unlock(&xe->clients.lock); - file->driver_priv = xef; kref_init(&xef->refcount); @@ -107,17 +103,12 @@ static int xe_file_open(struct drm_device *dev, struct drm_file *file) static void xe_file_destroy(struct kref *ref) { struct xe_file *xef = container_of(ref, struct xe_file, refcount); - struct xe_device *xe = xef->xe; xa_destroy(&xef->exec_queue.xa); mutex_destroy(&xef->exec_queue.lock); xa_destroy(&xef->vm.xa); mutex_destroy(&xef->vm.lock); - spin_lock(&xe->clients.lock); - xe->clients.count--; - spin_unlock(&xe->clients.lock); - xe_drm_client_put(xef->client); kfree(xef->process_name); kfree(xef); @@ -333,7 +324,6 @@ struct xe_device *xe_device_create(struct pci_dev *pdev, xe->info.force_execlist = xe_modparam.force_execlist; spin_lock_init(&xe->irq.lock); - spin_lock_init(&xe->clients.lock); init_waitqueue_head(&xe->ufence_wq); diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index 09d731a9125c..687f3a9039bb 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -353,15 +353,6 @@ struct xe_device { struct workqueue_struct *wq; } sriov; - /** @clients: drm clients info */ - struct { - /** @clients.lock: Protects drm clients info */ - spinlock_t lock; - - /** @clients.count: number of drm clients */ - u64 count; - } clients; - /** @usm: unified memory state */ struct { /** @usm.asid: convert a ASID to VM */ diff --git a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c index b8d832c8f907..ffcbd05671fc 100644 --- a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c +++ b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c @@ -139,9 +139,10 @@ ccs_mode_store(struct device *kdev, struct device_attribute *attr, } /* CCS mode can only be updated when there are no drm clients */ - spin_lock(&xe->clients.lock); - if (xe->clients.count) { - spin_unlock(&xe->clients.lock); + mutex_lock(&xe->drm.filelist_mutex); + if (!list_empty(&xe->drm.filelist)) { + mutex_unlock(&xe->drm.filelist_mutex); + xe_gt_dbg(gt, "Rejecting compute mode change as there are active drm clients\n"); return -EBUSY; } @@ -152,7 +153,7 @@ ccs_mode_store(struct device *kdev, struct device_attribute *attr, xe_gt_reset_async(gt); } - spin_unlock(&xe->clients.lock); + mutex_unlock(&xe->drm.filelist_mutex); return count; } From 55e8a3f37e54eb1c7b914d6d5565a37282ec1978 Mon Sep 17 00:00:00 2001 From: Nirmoy Das Date: Tue, 29 Oct 2024 13:01:15 +0100 Subject: [PATCH 052/235] drm/xe: Move LNL scheduling WA to xe_device.h Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed. Cc: Badal Nilawar Cc: Matthew Auld Cc: Matthew Brost Cc: Himal Prasad Ghimiray Cc: Lucas De Marchi cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison Signed-off-by: Nirmoy Das Reviewed-by: Matthew Auld Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-1-nirmoy.das@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit cbe006a6492c01a0058912ae15d473f4c149896c) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ drivers/gpu/drm/xe/xe_guc_ct.c | 11 +---------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h index 894f04770454..34620ef855c0 100644 --- a/drivers/gpu/drm/xe/xe_device.h +++ b/drivers/gpu/drm/xe/xe_device.h @@ -178,4 +178,18 @@ void xe_device_declare_wedged(struct xe_device *xe); struct xe_file *xe_file_get(struct xe_file *xef); void xe_file_put(struct xe_file *xef); +/* + * Occasionally it is seen that the G2H worker starts running after a delay of more than + * a second even after being queued and activated by the Linux workqueue subsystem. This + * leads to G2H timeout error. The root cause of issue lies with scheduling latency of + * Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS + * and this is beyond xe kmd. + * + * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU. + */ +#define LNL_FLUSH_WORKQUEUE(wq__) \ + flush_workqueue(wq__) +#define LNL_FLUSH_WORK(wrk__) \ + flush_work(wrk__) + #endif diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c index 17986bfd8818..9c505d3517cd 100644 --- a/drivers/gpu/drm/xe/xe_guc_ct.c +++ b/drivers/gpu/drm/xe/xe_guc_ct.c @@ -897,17 +897,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len, ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ); - /* - * Occasionally it is seen that the G2H worker starts running after a delay of more than - * a second even after being queued and activated by the Linux workqueue subsystem. This - * leads to G2H timeout error. The root cause of issue lies with scheduling latency of - * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS - * and this is beyond xe kmd. - * - * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU. - */ if (!ret) { - flush_work(&ct->g2h_worker); + LNL_FLUSH_WORK(&ct->g2h_worker); if (g2h_fence.done) { xe_gt_warn(gt, "G2H fence %u, action %04x, done\n", g2h_fence.seqno, action[0]); From 7d1e2580ed166f36949b468373b468d188880cd3 Mon Sep 17 00:00:00 2001 From: Nirmoy Das Date: Tue, 29 Oct 2024 13:01:16 +0100 Subject: [PATCH 053/235] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to recent scheduling issue with E-cores. This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is a E-core scheduling fix for LNL. v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani) v3: Remove gfx platform check as the issue related to cpu platform(John) v4: Use the Common macro(John) and print when the flush resolves timeout(Matt B) Cc: Badal Nilawar Cc: Matthew Auld Cc: John Harrison Cc: Himal Prasad Ghimiray Cc: Lucas De Marchi Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost Signed-off-by: Nirmoy Das Reviewed-by: Matthew Auld Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-2-nirmoy.das@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit 38c4c8722bd74452280951edc44c23de47612001) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_wait_user_fence.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..5b4264ea38bd 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -155,6 +155,13 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) { + LNL_FLUSH_WORKQUEUE(xe->ordered_wq); + err = do_compare(addr, args->value, args->mask, + args->op); + if (err <= 0) { + drm_dbg(&xe->drm, "LNL_FLUSH_WORKQUEUE resolved ufence timeout\n"); + break; + } err = -ETIME; break; } From 1491efb39acee3848b61fcb3e5cc4be8de304352 Mon Sep 17 00:00:00 2001 From: Nirmoy Das Date: Tue, 29 Oct 2024 13:01:17 +0100 Subject: [PATCH 054/235] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Flush the g2h worker explicitly if TLB timeout happens which is observed on LNL and that points to the recent scheduling issue with E-cores on LNL. This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix. v2: Add platform check(Himal) v3: Remove gfx platform check as the issue related to cpu platform(John) Use the common WA macro(John) and print when the flush resolves timeout(Matt B) v4: Remove the resolves log and do the flush before taking pending_lock(Matt A) Cc: Badal Nilawar Cc: Matthew Brost Cc: Matthew Auld Cc: John Harrison Cc: Himal Prasad Ghimiray Cc: Lucas De Marchi Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687 Signed-off-by: Nirmoy Das Reviewed-by: Matthew Auld Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-3-nirmoy.das@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit e1f6fa55664a0eeb0a641f497e1adfcf6672e995) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c index bbb9e411d21f..9d82ea30f4df 100644 --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c @@ -72,6 +72,8 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work) struct xe_device *xe = gt_to_xe(gt); struct xe_gt_tlb_invalidation_fence *fence, *next; + LNL_FLUSH_WORK(>->uc.guc.ct.g2h_worker); + spin_lock_irq(>->tlb_invalidation.pending_lock); list_for_each_entry_safe(fence, next, >->tlb_invalidation.pending_fences, link) { From 4f26c95ffc21a91281429ed60180619bae19ae92 Mon Sep 17 00:00:00 2001 From: Tom Chung Date: Wed, 9 Oct 2024 17:09:38 +0800 Subject: [PATCH 055/235] drm/amd/display: Fix brightness level not retained over reboot [Why] During boot up and resume the DC layer will reset the panel brightness to fix a flicker issue. It will cause the dm->actual_brightness is not the current panel brightness level. (the dm->brightness is the correct panel level) [How] Set the backlight level after do the set mode. Cc: Mario Limonciello Cc: Alex Deucher Fixes: d9e865826c20 ("drm/amd/display: Simplify brightness initialization") Reported-by: Mark Herbert Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3655 Reviewed-by: Sun peng Li Signed-off-by: Tom Chung Signed-off-by: Zaeem Mohamed Tested-by: Daniel Wheeler Signed-off-by: Alex Deucher (cherry picked from commit 7875afafba84817b791be6d2282b836695146060) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 13421a58210d..07e9ce99694f 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -9429,6 +9429,7 @@ static void amdgpu_dm_commit_streams(struct drm_atomic_state *state, bool mode_set_reset_required = false; u32 i; struct dc_commit_streams_params params = {dc_state->streams, dc_state->stream_count}; + bool set_backlight_level = false; /* Disable writeback */ for_each_old_connector_in_state(state, connector, old_con_state, i) { @@ -9548,6 +9549,7 @@ static void amdgpu_dm_commit_streams(struct drm_atomic_state *state, acrtc->hw_mode = new_crtc_state->mode; crtc->hwmode = new_crtc_state->mode; mode_set_reset_required = true; + set_backlight_level = true; } else if (modereset_required(new_crtc_state)) { drm_dbg_atomic(dev, "Atomic commit: RESET. crtc id %d:[%p]\n", @@ -9599,6 +9601,19 @@ static void amdgpu_dm_commit_streams(struct drm_atomic_state *state, acrtc->otg_inst = status->primary_otg_inst; } } + + /* During boot up and resume the DC layer will reset the panel brightness + * to fix a flicker issue. + * It will cause the dm->actual_brightness is not the current panel brightness + * level. (the dm->brightness is the correct panel level) + * So we set the backlight level with dm->brightness value after set mode + */ + if (set_backlight_level) { + for (i = 0; i < dm->num_of_edps; i++) { + if (dm->backlight_dev[i]) + amdgpu_dm_backlight_set_level(dm, i, dm->brightness[i]); + } + } } static void dm_set_writeback(struct amdgpu_display_manager *dm, From 694c79769cb384bca8b1ec1d1e84156e726bd106 Mon Sep 17 00:00:00 2001 From: Aurabindo Pillai Date: Fri, 18 Oct 2024 10:52:16 -0400 Subject: [PATCH 056/235] drm/amd/display: parse umc_info or vram_info based on ASIC An upstream bug report suggests that there are production dGPUs that are older than DCN401 but still have a umc_info in VBIOS tables with the same version as expected for a DCN401 product. Hence, reading this tables should be guarded with a version check. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3678 Reviewed-by: Dillon Varone Signed-off-by: Aurabindo Pillai Signed-off-by: Zaeem Mohamed Tested-by: Daniel Wheeler Signed-off-by: Alex Deucher (cherry picked from commit 2551b4a321a68134360b860113dd460133e856e5) Fixes: 00c391102abc ("drm/amd/display: Add misc DC changes for DCN401") Cc: stable@vger.kernel.org # 6.11.x --- drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c b/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c index 0d8498ab9b23..be8fbb04ad98 100644 --- a/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c +++ b/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c @@ -3127,7 +3127,9 @@ static enum bp_result bios_parser_get_vram_info( struct atom_data_revision revision; // vram info moved to umc_info for DCN4x - if (info && DATA_TABLES(umc_info)) { + if (dcb->ctx->dce_version >= DCN_VERSION_4_01 && + dcb->ctx->dce_version < DCN_VERSION_MAX && + info && DATA_TABLES(umc_info)) { header = GET_IMAGE(struct atom_common_table_header, DATA_TABLES(umc_info)); From a6dd15981c03f2cdc9a351a278f09b5479d53d2e Mon Sep 17 00:00:00 2001 From: Antonio Quartulli Date: Thu, 31 Oct 2024 16:28:48 +0100 Subject: [PATCH 057/235] drm/amdgpu: prevent NULL pointer dereference if ATIF is not supported acpi_evaluate_object() may return AE_NOT_FOUND (failure), which would result in dereferencing buffer.pointer (obj) while being NULL. Although this case may be unrealistic for the current code, it is still better to protect against possible bugs. Bail out also when status is AE_NOT_FOUND. This fixes 1 FORWARD_NULL issue reported by Coverity Report: CID 1600951: Null pointer dereferences (FORWARD_NULL) Signed-off-by: Antonio Quartulli Fixes: c9b7c809b89f ("drm/amd: Guard against bad data for ATIF ACPI method") Reviewed-by: Mario Limonciello Link: https://lore.kernel.org/r/20241031152848.4716-1-antonio@mandelbit.com Signed-off-by: Mario Limonciello Signed-off-by: Alex Deucher (cherry picked from commit 91c9e221fe2553edf2db71627d8453f083de87a1) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c index 1f5a296f5ed2..7dd55ed57c1d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c @@ -172,8 +172,8 @@ static union acpi_object *amdgpu_atif_call(struct amdgpu_atif *atif, &buffer); obj = (union acpi_object *)buffer.pointer; - /* Fail if calling the method fails and ATIF is supported */ - if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) { + /* Fail if calling the method fails */ + if (ACPI_FAILURE(status)) { DRM_DEBUG_DRIVER("failed to evaluate ATIF got %s\n", acpi_format_exception(status)); kfree(obj); From 1356bfc54c8d4c8e7c9fb8553dc8c28e9714b07b Mon Sep 17 00:00:00 2001 From: Kenneth Feng Date: Fri, 1 Nov 2024 11:55:25 +0800 Subject: [PATCH 058/235] drm/amd/pm: always pick the pptable from IFWI always pick the pptable from IFWI on smu v14.0.2/3 Signed-off-by: Kenneth Feng Reviewed-by: Yang Wang Signed-off-by: Alex Deucher (cherry picked from commit 136ce12bd5907388cb4e9aa63ee5c9c8c441640b) Cc: stable@vger.kernel.org # 6.11.x --- .../drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c | 65 +------------------ 1 file changed, 1 insertion(+), 64 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c index e83ea2bc7f9c..1e16a281f2dc 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c @@ -367,54 +367,6 @@ static int smu_v14_0_2_store_powerplay_table(struct smu_context *smu) return 0; } -#ifndef atom_smc_dpm_info_table_14_0_0 -struct atom_smc_dpm_info_table_14_0_0 { - struct atom_common_table_header table_header; - BoardTable_t BoardTable; -}; -#endif - -static int smu_v14_0_2_append_powerplay_table(struct smu_context *smu) -{ - struct smu_table_context *table_context = &smu->smu_table; - PPTable_t *smc_pptable = table_context->driver_pptable; - struct atom_smc_dpm_info_table_14_0_0 *smc_dpm_table; - BoardTable_t *BoardTable = &smc_pptable->BoardTable; - int index, ret; - - index = get_index_into_master_table(atom_master_list_of_data_tables_v2_1, - smc_dpm_info); - - ret = amdgpu_atombios_get_data_table(smu->adev, index, NULL, NULL, NULL, - (uint8_t **)&smc_dpm_table); - if (ret) - return ret; - - memcpy(BoardTable, &smc_dpm_table->BoardTable, sizeof(BoardTable_t)); - - return 0; -} - -#if 0 -static int smu_v14_0_2_get_pptable_from_pmfw(struct smu_context *smu, - void **table, - uint32_t *size) -{ - struct smu_table_context *smu_table = &smu->smu_table; - void *combo_pptable = smu_table->combo_pptable; - int ret = 0; - - ret = smu_cmn_get_combo_pptable(smu); - if (ret) - return ret; - - *table = combo_pptable; - *size = sizeof(struct smu_14_0_powerplay_table); - - return 0; -} -#endif - static int smu_v14_0_2_get_pptable_from_pmfw(struct smu_context *smu, void **table, uint32_t *size) @@ -436,16 +388,12 @@ static int smu_v14_0_2_get_pptable_from_pmfw(struct smu_context *smu, static int smu_v14_0_2_setup_pptable(struct smu_context *smu) { struct smu_table_context *smu_table = &smu->smu_table; - struct amdgpu_device *adev = smu->adev; int ret = 0; if (amdgpu_sriov_vf(smu->adev)) return 0; - if (!adev->scpm_enabled) - ret = smu_v14_0_setup_pptable(smu); - else - ret = smu_v14_0_2_get_pptable_from_pmfw(smu, + ret = smu_v14_0_2_get_pptable_from_pmfw(smu, &smu_table->power_play_table, &smu_table->power_play_table_size); if (ret) @@ -455,16 +403,6 @@ static int smu_v14_0_2_setup_pptable(struct smu_context *smu) if (ret) return ret; - /* - * With SCPM enabled, the operation below will be handled - * by PSP. Driver involvment is unnecessary and useless. - */ - if (!adev->scpm_enabled) { - ret = smu_v14_0_2_append_powerplay_table(smu); - if (ret) - return ret; - } - ret = smu_v14_0_2_check_powerplay_table(smu); if (ret) return ret; @@ -2799,7 +2737,6 @@ static const struct pptable_funcs smu_v14_0_2_ppt_funcs = { .check_fw_status = smu_v14_0_check_fw_status, .setup_pptable = smu_v14_0_2_setup_pptable, .check_fw_version = smu_v14_0_check_fw_version, - .write_pptable = smu_cmn_write_pptable, .set_driver_table_location = smu_v14_0_set_driver_table_location, .system_features_control = smu_v14_0_system_features_control, .set_allowed_mask = smu_v14_0_set_allowed_mask, From 74e1006430a5377228e49310f6d915628609929e Mon Sep 17 00:00:00 2001 From: Kenneth Feng Date: Wed, 30 Oct 2024 13:22:44 +0800 Subject: [PATCH 059/235] drm/amd/pm: correct the workload setting Correct the workload setting in order not to mix the setting with the end user. Update the workload mask accordingly. v2: changes as below: 1. the end user can not erase the workload from driver except default workload. 2. always shows the real highest priority workoad to the end user. 3. the real workload mask is combined with driver workload mask and end user workload mask. v3: apply this to the other ASICs as well. v4: simplify the code v5: refine the code based on the review comments. Signed-off-by: Kenneth Feng Acked-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit 8cc438be5d49b8326b2fcade0bdb7e6a97df9e0b) Cc: stable@vger.kernel.org # 6.11.x --- drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 49 +++++++++++++------ drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 4 +- .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c | 5 +- .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c | 5 +- .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 5 +- .../gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 4 +- .../gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c | 4 +- .../drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c | 20 ++++++-- .../drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c | 5 +- .../drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c | 9 ++-- drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++ drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 2 + 12 files changed, 84 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c index 80e60ea2d11e..ee1bcfaae3e3 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c @@ -1259,26 +1259,33 @@ static int smu_sw_init(void *handle) smu->watermarks_bitmap = 0; smu->power_profile_mode = PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT; smu->default_power_profile_mode = PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT; + smu->user_dpm_profile.user_workload_mask = 0; atomic_set(&smu->smu_power.power_gate.vcn_gated, 1); atomic_set(&smu->smu_power.power_gate.jpeg_gated, 1); atomic_set(&smu->smu_power.power_gate.vpe_gated, 1); atomic_set(&smu->smu_power.power_gate.umsch_mm_gated, 1); - smu->workload_prority[PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT] = 0; - smu->workload_prority[PP_SMC_POWER_PROFILE_FULLSCREEN3D] = 1; - smu->workload_prority[PP_SMC_POWER_PROFILE_POWERSAVING] = 2; - smu->workload_prority[PP_SMC_POWER_PROFILE_VIDEO] = 3; - smu->workload_prority[PP_SMC_POWER_PROFILE_VR] = 4; - smu->workload_prority[PP_SMC_POWER_PROFILE_COMPUTE] = 5; - smu->workload_prority[PP_SMC_POWER_PROFILE_CUSTOM] = 6; + smu->workload_priority[PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT] = 0; + smu->workload_priority[PP_SMC_POWER_PROFILE_FULLSCREEN3D] = 1; + smu->workload_priority[PP_SMC_POWER_PROFILE_POWERSAVING] = 2; + smu->workload_priority[PP_SMC_POWER_PROFILE_VIDEO] = 3; + smu->workload_priority[PP_SMC_POWER_PROFILE_VR] = 4; + smu->workload_priority[PP_SMC_POWER_PROFILE_COMPUTE] = 5; + smu->workload_priority[PP_SMC_POWER_PROFILE_CUSTOM] = 6; if (smu->is_apu || - !smu_is_workload_profile_available(smu, PP_SMC_POWER_PROFILE_FULLSCREEN3D)) - smu->workload_mask = 1 << smu->workload_prority[PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT]; - else - smu->workload_mask = 1 << smu->workload_prority[PP_SMC_POWER_PROFILE_FULLSCREEN3D]; + !smu_is_workload_profile_available(smu, PP_SMC_POWER_PROFILE_FULLSCREEN3D)) { + smu->driver_workload_mask = + 1 << smu->workload_priority[PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT]; + } else { + smu->driver_workload_mask = + 1 << smu->workload_priority[PP_SMC_POWER_PROFILE_FULLSCREEN3D]; + smu->default_power_profile_mode = PP_SMC_POWER_PROFILE_FULLSCREEN3D; + } + smu->workload_mask = smu->driver_workload_mask | + smu->user_dpm_profile.user_workload_mask; smu->workload_setting[0] = PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT; smu->workload_setting[1] = PP_SMC_POWER_PROFILE_FULLSCREEN3D; smu->workload_setting[2] = PP_SMC_POWER_PROFILE_POWERSAVING; @@ -2348,17 +2355,20 @@ static int smu_switch_power_profile(void *handle, return -EINVAL; if (!en) { - smu->workload_mask &= ~(1 << smu->workload_prority[type]); + smu->driver_workload_mask &= ~(1 << smu->workload_priority[type]); index = fls(smu->workload_mask); index = index > 0 && index <= WORKLOAD_POLICY_MAX ? index - 1 : 0; workload[0] = smu->workload_setting[index]; } else { - smu->workload_mask |= (1 << smu->workload_prority[type]); + smu->driver_workload_mask |= (1 << smu->workload_priority[type]); index = fls(smu->workload_mask); index = index <= WORKLOAD_POLICY_MAX ? index - 1 : 0; workload[0] = smu->workload_setting[index]; } + smu->workload_mask = smu->driver_workload_mask | + smu->user_dpm_profile.user_workload_mask; + if (smu_dpm_ctx->dpm_level != AMD_DPM_FORCED_LEVEL_MANUAL && smu_dpm_ctx->dpm_level != AMD_DPM_FORCED_LEVEL_PERF_DETERMINISM) smu_bump_power_profile_mode(smu, workload, 0); @@ -3049,12 +3059,23 @@ static int smu_set_power_profile_mode(void *handle, uint32_t param_size) { struct smu_context *smu = handle; + int ret; if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled || !smu->ppt_funcs->set_power_profile_mode) return -EOPNOTSUPP; - return smu_bump_power_profile_mode(smu, param, param_size); + if (smu->user_dpm_profile.user_workload_mask & + (1 << smu->workload_priority[param[param_size]])) + return 0; + + smu->user_dpm_profile.user_workload_mask = + (1 << smu->workload_priority[param[param_size]]); + smu->workload_mask = smu->user_dpm_profile.user_workload_mask | + smu->driver_workload_mask; + ret = smu_bump_power_profile_mode(smu, param, param_size); + + return ret; } static int smu_get_fan_control_mode(void *handle, u32 *fan_mode) diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h index b44a185d07e8..d60d9a12a47e 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h @@ -240,6 +240,7 @@ struct smu_user_dpm_profile { /* user clock state information */ uint32_t clk_mask[SMU_CLK_COUNT]; uint32_t clk_dependency; + uint32_t user_workload_mask; }; #define SMU_TABLE_INIT(tables, table_id, s, a, d) \ @@ -557,7 +558,8 @@ struct smu_context { bool disable_uclk_switch; uint32_t workload_mask; - uint32_t workload_prority[WORKLOAD_POLICY_MAX]; + uint32_t driver_workload_mask; + uint32_t workload_priority[WORKLOAD_POLICY_MAX]; uint32_t workload_setting[WORKLOAD_POLICY_MAX]; uint32_t power_profile_mode; uint32_t default_power_profile_mode; diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c index c0f6b59369b7..31fe512028f4 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c @@ -1455,7 +1455,6 @@ static int arcturus_set_power_profile_mode(struct smu_context *smu, return -EINVAL; } - if ((profile_mode == PP_SMC_POWER_PROFILE_CUSTOM) && (smu->smc_fw_version >= 0x360d00)) { if (size != 10) @@ -1523,14 +1522,14 @@ static int arcturus_set_power_profile_mode(struct smu_context *smu, ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_SetWorkloadMask, - 1 << workload_type, + smu->workload_mask, NULL); if (ret) { dev_err(smu->adev->dev, "Fail to set workload type %d\n", workload_type); return ret; } - smu->power_profile_mode = profile_mode; + smu_cmn_assign_power_profile(smu); return 0; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c index 16af1a329621..12223f507977 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c @@ -2081,10 +2081,13 @@ static int navi10_set_power_profile_mode(struct smu_context *smu, long *input, u smu->power_profile_mode); if (workload_type < 0) return -EINVAL; + ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_SetWorkloadMask, - 1 << workload_type, NULL); + smu->workload_mask, NULL); if (ret) dev_err(smu->adev->dev, "[%s] Failed to set work load mask!", __func__); + else + smu_cmn_assign_power_profile(smu); return ret; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c index 9c3c48297cba..3b7b2ec8319a 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c @@ -1786,10 +1786,13 @@ static int sienna_cichlid_set_power_profile_mode(struct smu_context *smu, long * smu->power_profile_mode); if (workload_type < 0) return -EINVAL; + ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_SetWorkloadMask, - 1 << workload_type, NULL); + smu->workload_mask, NULL); if (ret) dev_err(smu->adev->dev, "[%s] Failed to set work load mask!", __func__); + else + smu_cmn_assign_power_profile(smu); return ret; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c index 1fe020f1f4db..952ee22cbc90 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c @@ -1079,7 +1079,7 @@ static int vangogh_set_power_profile_mode(struct smu_context *smu, long *input, } ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_ActiveProcessNotify, - 1 << workload_type, + smu->workload_mask, NULL); if (ret) { dev_err_once(smu->adev->dev, "Fail to set workload type %d\n", @@ -1087,7 +1087,7 @@ static int vangogh_set_power_profile_mode(struct smu_context *smu, long *input, return ret; } - smu->power_profile_mode = profile_mode; + smu_cmn_assign_power_profile(smu); return 0; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c index cc0504b063fa..62316a6707ef 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c @@ -890,14 +890,14 @@ static int renoir_set_power_profile_mode(struct smu_context *smu, long *input, u } ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_ActiveProcessNotify, - 1 << workload_type, + smu->workload_mask, NULL); if (ret) { dev_err_once(smu->adev->dev, "Fail to set workload type %d\n", workload_type); return ret; } - smu->power_profile_mode = profile_mode; + smu_cmn_assign_power_profile(smu); return 0; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c index d53e162dcd8d..5dd7ceca64fe 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c @@ -2485,7 +2485,7 @@ static int smu_v13_0_0_set_power_profile_mode(struct smu_context *smu, DpmActivityMonitorCoeffInt_t *activity_monitor = &(activity_monitor_external.DpmActivityMonitorCoeffInt); int workload_type, ret = 0; - u32 workload_mask, selected_workload_mask; + u32 workload_mask; smu->power_profile_mode = input[size]; @@ -2552,7 +2552,7 @@ static int smu_v13_0_0_set_power_profile_mode(struct smu_context *smu, if (workload_type < 0) return -EINVAL; - selected_workload_mask = workload_mask = 1 << workload_type; + workload_mask = 1 << workload_type; /* Add optimizations for SMU13.0.0/10. Reuse the power saving profile */ if ((amdgpu_ip_version(smu->adev, MP1_HWIP, 0) == IP_VERSION(13, 0, 0) && @@ -2567,12 +2567,22 @@ static int smu_v13_0_0_set_power_profile_mode(struct smu_context *smu, workload_mask |= 1 << workload_type; } + smu->workload_mask |= workload_mask; ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_SetWorkloadMask, - workload_mask, + smu->workload_mask, NULL); - if (!ret) - smu->workload_mask = selected_workload_mask; + if (!ret) { + smu_cmn_assign_power_profile(smu); + if (smu->power_profile_mode == PP_SMC_POWER_PROFILE_POWERSAVING) { + workload_type = smu_cmn_to_asic_specific_index(smu, + CMN2ASIC_MAPPING_WORKLOAD, + PP_SMC_POWER_PROFILE_FULLSCREEN3D); + smu->power_profile_mode = smu->workload_mask & (1 << workload_type) + ? PP_SMC_POWER_PROFILE_FULLSCREEN3D + : PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT; + } + } return ret; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c index b891a5e0a396..9d0b19419de0 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c @@ -2499,13 +2499,14 @@ static int smu_v13_0_7_set_power_profile_mode(struct smu_context *smu, long *inp smu->power_profile_mode); if (workload_type < 0) return -EINVAL; + ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_SetWorkloadMask, - 1 << workload_type, NULL); + smu->workload_mask, NULL); if (ret) dev_err(smu->adev->dev, "[%s] Failed to set work load mask!", __func__); else - smu->workload_mask = (1 << workload_type); + smu_cmn_assign_power_profile(smu); return ret; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c index 1e16a281f2dc..1aa13d32ceb2 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c @@ -1807,12 +1807,11 @@ static int smu_v14_0_2_set_power_profile_mode(struct smu_context *smu, if (workload_type < 0) return -EINVAL; - ret = smu_cmn_send_smc_msg_with_param(smu, - SMU_MSG_SetWorkloadMask, - 1 << workload_type, - NULL); + ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_SetWorkloadMask, + smu->workload_mask, NULL); + if (!ret) - smu->workload_mask = 1 << workload_type; + smu_cmn_assign_power_profile(smu); return ret; } diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c index 91ad434bcdae..bdfc5e617333 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c @@ -1138,6 +1138,14 @@ int smu_cmn_set_mp1_state(struct smu_context *smu, return ret; } +void smu_cmn_assign_power_profile(struct smu_context *smu) +{ + uint32_t index; + index = fls(smu->workload_mask); + index = index > 0 && index <= WORKLOAD_POLICY_MAX ? index - 1 : 0; + smu->power_profile_mode = smu->workload_setting[index]; +} + bool smu_cmn_is_audio_func_enabled(struct amdgpu_device *adev) { struct pci_dev *p = NULL; diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h index 1de685defe85..8a801e389659 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h @@ -130,6 +130,8 @@ void smu_cmn_init_soft_gpu_metrics(void *table, uint8_t frev, uint8_t crev); int smu_cmn_set_mp1_state(struct smu_context *smu, enum pp_mp1_state mp1_state); +void smu_cmn_assign_power_profile(struct smu_context *smu); + /* * Helper function to make sysfs_emit_at() happy. Align buf to * the current page boundary and record the offset. From 6d1c69945ce63a9fba22a4abf646cf960d878782 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 4 Nov 2024 04:24:40 -0800 Subject: [PATCH 060/235] nvme/host: Fix RCU list traversal to use SRCU primitive The code currently uses list_for_each_entry_rcu() while holding an SRCU lock, triggering false positive warnings with CONFIG_PROVE_RCU=y enabled: drivers/nvme/host/core.c:3770 RCU-list traversed in non-reader section!! While the list is properly protected by SRCU lock, the code uses the wrong list traversal primitive. Replace list_for_each_entry_rcu() with list_for_each_entry_srcu() to correctly indicate SRCU-based protection and eliminate the false warning. Fixes: be647e2c76b2 ("nvme: use srcu for iterating namespace list") Signed-off-by: Breno Leitao Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch --- drivers/nvme/host/core.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index ce20b916301a..4a83fd120064 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3795,7 +3795,8 @@ struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid) int srcu_idx; srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) { if (ns->head->ns_id == nsid) { if (!nvme_get_ns(ns)) continue; @@ -4879,7 +4880,8 @@ void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl) int srcu_idx; srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) blk_mark_disk_dead(ns->disk); srcu_read_unlock(&ctrl->srcu, srcu_idx); } @@ -4891,7 +4893,8 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl) int srcu_idx; srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) blk_mq_unfreeze_queue(ns->queue); srcu_read_unlock(&ctrl->srcu, srcu_idx); clear_bit(NVME_CTRL_FROZEN, &ctrl->flags); @@ -4904,7 +4907,8 @@ int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout) int srcu_idx; srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) { timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout); if (timeout <= 0) break; @@ -4920,7 +4924,8 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl) int srcu_idx; srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) blk_mq_freeze_queue_wait(ns->queue); srcu_read_unlock(&ctrl->srcu, srcu_idx); } @@ -4933,7 +4938,8 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl) set_bit(NVME_CTRL_FROZEN, &ctrl->flags); srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) blk_freeze_queue_start(ns->queue); srcu_read_unlock(&ctrl->srcu, srcu_idx); } @@ -4981,7 +4987,8 @@ void nvme_sync_io_queues(struct nvme_ctrl *ctrl) int srcu_idx; srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) blk_sync_queue(ns->queue); srcu_read_unlock(&ctrl->srcu, srcu_idx); } From a97e293e077a3e8f41e8972e593b34d0052b9e25 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 4 Nov 2024 19:51:28 +0100 Subject: [PATCH 061/235] cpufreq: intel_pstate: Clear hybrid_max_perf_cpu before driver registration Modify intel_pstate_register_driver() to clear hybrid_max_perf_cpu before calling cpufreq_register_driver(), so that asymmetric CPU capacity scaling is not updated until hybrid_init_cpu_capacity_scaling() runs down the road. This is done in preparation for a subsequent change adding asymmetric CPU capacity computation to the CPU init path to handle CPUs that are initially offline. The information on whether or not hybrid_max_perf_cpu was NULL before it has been cleared is passed to hybrid_init_cpu_capacity_scaling(), so full initialization of CPU capacity scaling can be skipped if it has been carried out already. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Link: https://patch.msgid.link/4616631.LvFx2qVVIh@rjwysocki.net --- drivers/cpufreq/intel_pstate.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index b0018f371ea3..4e816f0857f2 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1034,7 +1034,7 @@ static void __hybrid_init_cpu_capacity_scaling(void) hybrid_update_cpu_capacity_scaling(); } -static void hybrid_init_cpu_capacity_scaling(void) +static void hybrid_init_cpu_capacity_scaling(bool refresh) { bool disable_itmt = false; @@ -1045,7 +1045,7 @@ static void hybrid_init_cpu_capacity_scaling(void) * scaling has been enabled already and the driver is just changing the * operation mode. */ - if (hybrid_max_perf_cpu) { + if (refresh) { __hybrid_init_cpu_capacity_scaling(); goto unlock; } @@ -1071,6 +1071,18 @@ static void hybrid_init_cpu_capacity_scaling(void) sched_clear_itmt_support(); } +static bool hybrid_clear_max_perf_cpu(void) +{ + bool ret; + + guard(mutex)(&hybrid_capacity_lock); + + ret = !!hybrid_max_perf_cpu; + hybrid_max_perf_cpu = NULL; + + return ret; +} + static void __intel_pstate_get_hwp_cap(struct cpudata *cpu) { u64 cap; @@ -3352,6 +3364,7 @@ static void intel_pstate_driver_cleanup(void) static int intel_pstate_register_driver(struct cpufreq_driver *driver) { + bool refresh_cpu_cap_scaling; int ret; if (driver == &intel_pstate) @@ -3364,6 +3377,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver) arch_set_max_freq_ratio(global.turbo_disabled); + refresh_cpu_cap_scaling = hybrid_clear_max_perf_cpu(); + intel_pstate_driver = driver; ret = cpufreq_register_driver(intel_pstate_driver); if (ret) { @@ -3373,7 +3388,7 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver) global.min_perf_pct = min_perf_pct_min(); - hybrid_init_cpu_capacity_scaling(); + hybrid_init_cpu_capacity_scaling(refresh_cpu_cap_scaling); return 0; } From 92447aa5f6e7fbad9427a3fd1bb9e0679c403206 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 4 Nov 2024 19:53:53 +0100 Subject: [PATCH 062/235] cpufreq: intel_pstate: Update asym capacity for CPUs that were offline initially Commit 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems") overlooked a corner case in which some CPUs may be offline to start with and brought back online later, after the intel_pstate driver has been registered, so their asymmetric capacity will not be set. Address this by calling hybrid_update_capacity() in the CPU initialization path that is executed instead of the online path for those CPUs. Note that this asymmetric capacity update will be skipped during driver initialization and mode switches because hybrid_max_perf_cpu is NULL in those cases. Fixes: 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems") Signed-off-by: Rafael J. Wysocki Link: https://patch.msgid.link/1913414.tdWV9SEqCh@rjwysocki.net --- drivers/cpufreq/intel_pstate.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 4e816f0857f2..cd2ac1ba53d2 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -2275,6 +2275,11 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu) } else { cpu->pstate.scaling = perf_ctl_scaling; } + /* + * If the CPU is going online for the first time and it was + * offline initially, asym capacity scaling needs to be updated. + */ + hybrid_update_capacity(cpu); } else { cpu->pstate.scaling = perf_ctl_scaling; cpu->pstate.max_pstate = pstate_funcs.get_max(cpu->cpu); From 0a77715db22611df50b178374c51e2ba0d58866e Mon Sep 17 00:00:00 2001 From: Namjae Jeon Date: Sat, 2 Nov 2024 18:46:38 +0900 Subject: [PATCH 063/235] ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create There is a race condition between ksmbd_smb2_session_create and ksmbd_expire_session. This patch add missing sessions_table_lock while adding/deleting session from global session table. Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei Tested-by: Norbert Szetei Signed-off-by: Namjae Jeon Signed-off-by: Steve French --- fs/smb/server/mgmt/user_session.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/smb/server/mgmt/user_session.c b/fs/smb/server/mgmt/user_session.c index 9756a4bbfe54..ad02fe555fda 100644 --- a/fs/smb/server/mgmt/user_session.c +++ b/fs/smb/server/mgmt/user_session.c @@ -178,6 +178,7 @@ static void ksmbd_expire_session(struct ksmbd_conn *conn) unsigned long id; struct ksmbd_session *sess; + down_write(&sessions_table_lock); down_write(&conn->session_lock); xa_for_each(&conn->sessions, id, sess) { if (atomic_read(&sess->refcnt) == 0 && @@ -191,6 +192,7 @@ static void ksmbd_expire_session(struct ksmbd_conn *conn) } } up_write(&conn->session_lock); + up_write(&sessions_table_lock); } int ksmbd_session_register(struct ksmbd_conn *conn, @@ -232,7 +234,6 @@ void ksmbd_sessions_deregister(struct ksmbd_conn *conn) } } } - up_write(&sessions_table_lock); down_write(&conn->session_lock); xa_for_each(&conn->sessions, id, sess) { @@ -252,6 +253,7 @@ void ksmbd_sessions_deregister(struct ksmbd_conn *conn) } } up_write(&conn->session_lock); + up_write(&sessions_table_lock); } struct ksmbd_session *ksmbd_session_lookup(struct ksmbd_conn *conn, From b8fc56fbca7482c1e5c0e3351c6ae78982e25ada Mon Sep 17 00:00:00 2001 From: Namjae Jeon Date: Mon, 4 Nov 2024 13:40:41 +0900 Subject: [PATCH 064/235] ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp ksmbd_user_session_put should be called under smb3_preauth_hash_rsp(). It will avoid freeing session before calling smb3_preauth_hash_rsp(). Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei Tested-by: Norbert Szetei Signed-off-by: Namjae Jeon Signed-off-by: Steve French --- fs/smb/server/server.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/smb/server/server.c b/fs/smb/server/server.c index 9670c97f14b3..e7f14f8df943 100644 --- a/fs/smb/server/server.c +++ b/fs/smb/server/server.c @@ -238,11 +238,11 @@ static void __handle_ksmbd_work(struct ksmbd_work *work, } while (is_chained == true); send: - if (work->sess) - ksmbd_user_session_put(work->sess); if (work->tcon) ksmbd_tree_connect_put(work->tcon); smb3_preauth_hash_rsp(work); + if (work->sess) + ksmbd_user_session_put(work->sess); if (work->sess && work->sess->enc && work->encrypted && conn->ops->encrypt_resp) { rc = conn->ops->encrypt_resp(work); From 0a77d947f599b1f39065015bec99390d0c0022ee Mon Sep 17 00:00:00 2001 From: Namjae Jeon Date: Mon, 4 Nov 2024 13:43:06 +0900 Subject: [PATCH 065/235] ksmbd: check outstanding simultaneous SMB operations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If Client send simultaneous SMB operations to ksmbd, It exhausts too much memory through the "ksmbd_work_cache”. It will cause OOM issue. ksmbd has a credit mechanism but it can't handle this problem. This patch add the check if it exceeds max credits to prevent this problem by assuming that one smb request consumes at least one credit. Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei Tested-by: Norbert Szetei Signed-off-by: Namjae Jeon Signed-off-by: Steve French --- fs/smb/server/connection.c | 1 + fs/smb/server/connection.h | 1 + fs/smb/server/server.c | 16 ++++++++++------ fs/smb/server/smb_common.c | 10 +++++++--- fs/smb/server/smb_common.h | 2 +- 5 files changed, 20 insertions(+), 10 deletions(-) diff --git a/fs/smb/server/connection.c b/fs/smb/server/connection.c index aa2a37a7ce84..e6a72f75ab94 100644 --- a/fs/smb/server/connection.c +++ b/fs/smb/server/connection.c @@ -70,6 +70,7 @@ struct ksmbd_conn *ksmbd_conn_alloc(void) atomic_set(&conn->req_running, 0); atomic_set(&conn->r_count, 0); atomic_set(&conn->refcnt, 1); + atomic_set(&conn->mux_smb_requests, 0); conn->total_credits = 1; conn->outstanding_credits = 0; diff --git a/fs/smb/server/connection.h b/fs/smb/server/connection.h index b379ae4fdcdf..8ddd5a3c7baf 100644 --- a/fs/smb/server/connection.h +++ b/fs/smb/server/connection.h @@ -107,6 +107,7 @@ struct ksmbd_conn { __le16 signing_algorithm; bool binding; atomic_t refcnt; + atomic_t mux_smb_requests; }; struct ksmbd_conn_ops { diff --git a/fs/smb/server/server.c b/fs/smb/server/server.c index e7f14f8df943..e6cfedba9992 100644 --- a/fs/smb/server/server.c +++ b/fs/smb/server/server.c @@ -270,6 +270,7 @@ static void handle_ksmbd_work(struct work_struct *wk) ksmbd_conn_try_dequeue_request(work); ksmbd_free_work_struct(work); + atomic_dec(&conn->mux_smb_requests); /* * Checking waitqueue to dropping pending requests on * disconnection. waitqueue_active is safe because it @@ -291,6 +292,15 @@ static int queue_ksmbd_work(struct ksmbd_conn *conn) struct ksmbd_work *work; int err; + err = ksmbd_init_smb_server(conn); + if (err) + return 0; + + if (atomic_inc_return(&conn->mux_smb_requests) >= conn->vals->max_credits) { + atomic_dec_return(&conn->mux_smb_requests); + return -ENOSPC; + } + work = ksmbd_alloc_work_struct(); if (!work) { pr_err("allocation for work failed\n"); @@ -301,12 +311,6 @@ static int queue_ksmbd_work(struct ksmbd_conn *conn) work->request_buf = conn->request_buf; conn->request_buf = NULL; - err = ksmbd_init_smb_server(work); - if (err) { - ksmbd_free_work_struct(work); - return 0; - } - ksmbd_conn_enqueue_request(work); atomic_inc(&conn->r_count); /* update activity on connection */ diff --git a/fs/smb/server/smb_common.c b/fs/smb/server/smb_common.c index a2ebbe604c8c..75b4eb856d32 100644 --- a/fs/smb/server/smb_common.c +++ b/fs/smb/server/smb_common.c @@ -388,6 +388,10 @@ static struct smb_version_ops smb1_server_ops = { .set_rsp_status = set_smb1_rsp_status, }; +static struct smb_version_values smb1_server_values = { + .max_credits = SMB2_MAX_CREDITS, +}; + static int smb1_negotiate(struct ksmbd_work *work) { return ksmbd_smb_negotiate_common(work, SMB_COM_NEGOTIATE); @@ -399,18 +403,18 @@ static struct smb_version_cmds smb1_server_cmds[1] = { static int init_smb1_server(struct ksmbd_conn *conn) { + conn->vals = &smb1_server_values; conn->ops = &smb1_server_ops; conn->cmds = smb1_server_cmds; conn->max_cmds = ARRAY_SIZE(smb1_server_cmds); return 0; } -int ksmbd_init_smb_server(struct ksmbd_work *work) +int ksmbd_init_smb_server(struct ksmbd_conn *conn) { - struct ksmbd_conn *conn = work->conn; __le32 proto; - proto = *(__le32 *)((struct smb_hdr *)work->request_buf)->Protocol; + proto = *(__le32 *)((struct smb_hdr *)conn->request_buf)->Protocol; if (conn->need_neg == false) { if (proto == SMB1_PROTO_NUMBER) return -EINVAL; diff --git a/fs/smb/server/smb_common.h b/fs/smb/server/smb_common.h index cc1d6dfe29d5..a3d8a905b07e 100644 --- a/fs/smb/server/smb_common.h +++ b/fs/smb/server/smb_common.h @@ -427,7 +427,7 @@ bool ksmbd_smb_request(struct ksmbd_conn *conn); int ksmbd_lookup_dialect_by_id(__le16 *cli_dialects, __le16 dialects_count); -int ksmbd_init_smb_server(struct ksmbd_work *work); +int ksmbd_init_smb_server(struct ksmbd_conn *conn); struct ksmbd_kstat; int ksmbd_populate_dot_dotdot_entries(struct ksmbd_work *work, From 54c814c8b23bc7617be3d46abdb896937695dbfa Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 31 Oct 2024 14:26:24 -0700 Subject: [PATCH 066/235] scsi: ufs: core: Start the RTC update work later The RTC update work involves runtime resuming the UFS controller. Hence, only start the RTC update work after runtime power management in the UFS driver has been fully initialized. This patch fixes the following kernel crash: Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP Workqueue: events ufshcd_rtc_work Call trace: _raw_spin_lock_irqsave+0x34/0x8c (P) pm_runtime_get_if_active+0x24/0x9c (L) pm_runtime_get_if_active+0x24/0x9c ufshcd_rtc_work+0x138/0x1b4 process_one_work+0x148/0x288 worker_thread+0x2cc/0x3d4 kthread+0x110/0x114 ret_from_fork+0x10/0x20 Reported-by: Neil Armstrong Closes: https://lore.kernel.org/linux-scsi/0c0bc528-fdc2-4106-bc99-f23ae377f6f5@linaro.org/ Fixes: 6bf999e0eb41 ("scsi: ufs: core: Add UFS RTC support") Cc: Bean Huo Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche Link: https://lore.kernel.org/r/20241031212632.2799127-1-bvanassche@acm.org Reviewed-by: Peter Wang Reviewed-by: Bean Huo Tested-by: Neil Armstrong # on SM8650-HDK Signed-off-by: Martin K. Petersen --- drivers/ufs/core/ufshcd.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index 0e22bbb78239..3b32803b7a68 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -8636,6 +8636,14 @@ static int ufshcd_add_lus(struct ufs_hba *hba) ufshcd_init_clk_scaling_sysfs(hba); } + /* + * The RTC update code accesses the hba->ufs_device_wlun->sdev_gendev + * pointer and hence must only be started after the WLUN pointer has + * been initialized by ufshcd_scsi_add_wlus(). + */ + schedule_delayed_work(&hba->ufs_rtc_update_work, + msecs_to_jiffies(UFS_RTC_UPDATE_INTERVAL_MS)); + ufs_bsg_probe(hba); scsi_scan_host(hba->host); @@ -8795,8 +8803,6 @@ static int ufshcd_device_init(struct ufs_hba *hba, bool init_dev_params) ufshcd_force_reset_auto_bkops(hba); ufshcd_set_timestamp_attr(hba); - schedule_delayed_work(&hba->ufs_rtc_update_work, - msecs_to_jiffies(UFS_RTC_UPDATE_INTERVAL_MS)); /* Gear up to HS gear if supported */ if (hba->max_pwr_info.is_valid) { From 2d0f2a648147d6bbf0655e03500586a6712a7281 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Fri, 4 Oct 2024 18:01:53 -0400 Subject: [PATCH 067/235] KVM: selftests: memslot_perf_test: increase guest sync timeout When memslot_perf_test is run nested, first iteration of test_memslot_rw_loop testcase, sometimes takes more than 2 seconds due to build of shadow page tables. Following iterations are fast. To be on the safe side, bump the timeout to 10 seconds. Signed-off-by: Maxim Levitsky Tested-by: Liam Merwick Reviewed-by: Liam Merwick Link: https://lore.kernel.org/r/20241004220153.287459-1-mlevitsk@redhat.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/memslot_perf_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/memslot_perf_test.c b/tools/testing/selftests/kvm/memslot_perf_test.c index 989ffe0d047f..e3711beff7f3 100644 --- a/tools/testing/selftests/kvm/memslot_perf_test.c +++ b/tools/testing/selftests/kvm/memslot_perf_test.c @@ -417,7 +417,7 @@ static bool _guest_should_exit(void) */ static noinline void host_perform_sync(struct sync_area *sync) { - alarm(2); + alarm(10); atomic_store_explicit(&sync->sync_flag, true, memory_order_release); while (atomic_load_explicit(&sync->sync_flag, memory_order_acquire)) From 945bdae20be5a13f1fcdcb14ec356dcbeee35839 Mon Sep 17 00:00:00 2001 From: Patrick Roy Date: Thu, 24 Oct 2024 10:59:53 +0100 Subject: [PATCH 068/235] KVM: selftests: fix unintentional noop test in guest_memfd_test.c The loop in test_create_guest_memfd_invalid() that is supposed to test that nothing is accepted as a valid flag to KVM_CREATE_GUEST_MEMFD was initializing `flag` as 0 instead of BIT(0). This caused the loop to immediately exit instead of iterating over BIT(0), BIT(1), ... . Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()") Signed-off-by: Patrick Roy Reviewed-by: James Gowans Reviewed-by: Muhammad Usama Anjum Link: https://lore.kernel.org/r/20241024095956.3668818-1-roypat@amazon.co.uk Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/guest_memfd_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index ba0c8e996035..ce687f8d248f 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm) size); } - for (flag = 0; flag; flag <<= 1) { + for (flag = BIT(0); flag; flag <<= 1) { fd = __vm_create_guest_memfd(vm, page_size, flag); TEST_ASSERT(fd == -1 && errno == EINVAL, "guest_memfd() with flag '0x%lx' should fail with EINVAL", From 5b188cc4866aaf712e896f92ac42c7802135e507 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 9 Oct 2024 08:49:41 -0700 Subject: [PATCH 069/235] KVM: selftests: Disable strict aliasing Disable strict aliasing, as has been done in the kernel proper for decades (literally since before git history) to fix issues where gcc will optimize away loads in code that looks 100% correct, but is _technically_ undefined behavior, and thus can be thrown away by the compiler. E.g. arm64's vPMU counter access test casts a uint64_t (unsigned long) pointer to a u64 (unsigned long long) pointer when setting PMCR.N via u64p_replace_bits(), which gcc-13 detects and optimizes away, i.e. ignores the result and uses the original PMCR. The issue is most easily observed by making set_pmcr_n() noinline and wrapping the call with printf(), e.g. sans comments, for this code: printf("orig = %lx, next = %lx, want = %lu\n", pmcr_orig, pmcr, pmcr_n); set_pmcr_n(&pmcr, pmcr_n); printf("orig = %lx, next = %lx, want = %lu\n", pmcr_orig, pmcr, pmcr_n); gcc-13 generates: 0000000000401c90 : 401c90: f9400002 ldr x2, [x0] 401c94: b3751022 bfi x2, x1, #11, #5 401c98: f9000002 str x2, [x0] 401c9c: d65f03c0 ret 0000000000402660 : 402724: aa1403e3 mov x3, x20 402728: aa1503e2 mov x2, x21 40272c: aa1603e0 mov x0, x22 402730: aa1503e1 mov x1, x21 402734: 940060ff bl 41ab30 <_IO_printf> 402738: aa1403e1 mov x1, x20 40273c: 910183e0 add x0, sp, #0x60 402740: 97fffd54 bl 401c90 402744: aa1403e3 mov x3, x20 402748: aa1503e2 mov x2, x21 40274c: aa1503e1 mov x1, x21 402750: aa1603e0 mov x0, x22 402754: 940060f7 bl 41ab30 <_IO_printf> with the value stored in [sp + 0x60] ignored by both printf() above and in the test proper, resulting in a false failure due to vcpu_set_reg() simply storing the original value, not the intended value. $ ./vpmu_counter_access Random seed: 0x6b8b4567 orig = 3040, next = 3040, want = 0 orig = 3040, next = 3040, want = 0 ==== Test Assertion Failure ==== aarch64/vpmu_counter_access.c:505: pmcr_n == get_pmcr_n(pmcr) pid=71578 tid=71578 errno=9 - Bad file descriptor 1 0x400673: run_access_test at vpmu_counter_access.c:522 2 (inlined by) main at vpmu_counter_access.c:643 3 0x4132d7: __libc_start_call_main at libc-start.o:0 4 0x413653: __libc_start_main at ??:0 5 0x40106f: _start at ??:0 Failed to update PMCR.N to 0 (received: 6) Somewhat bizarrely, gcc-11 also exhibits the same behavior, but only if set_pmcr_n() is marked noinline, whereas gcc-13 fails even if set_pmcr_n() is inlined in its sole caller. Cc: stable@vger.kernel.org Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116912 Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 156fbfae940f..1896ef383cae 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -241,10 +241,10 @@ CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \ -Wno-gnu-variable-sized-type-not-at-end -MD -MP -DCONFIG_64BIT \ -fno-builtin-memcmp -fno-builtin-memcpy \ -fno-builtin-memset -fno-builtin-strnlen \ - -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \ - -I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \ - -I$( Date: Wed, 30 Oct 2024 21:53:33 -0700 Subject: [PATCH 070/235] KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported Force -march=x86-64-v2 to avoid SSE/AVX instructions if and only if the uarch definition is supported by the compiler, e.g. gcc 7.5 only supports x86-64. Fixes: 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions") Cc: Vitaly Kuznetsov Reviewed-and-tested-by: Vitaly Kuznetsov Link: https://lore.kernel.org/r/20241031045333.1209195-1-seanjc@google.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 1896ef383cae..48645a2e29da 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -249,8 +249,10 @@ ifeq ($(ARCH),s390) CFLAGS += -march=z10 endif ifeq ($(ARCH),x86) +ifeq ($(shell echo "void foo(void) { }" | $(CC) -march=x86-64-v2 -x c - -c -o /dev/null 2>/dev/null; echo "$$?"),0) CFLAGS += -march=x86-64-v2 endif +endif ifeq ($(ARCH),arm64) tools_dir := $(top_srcdir)/tools arm64_tools_dir := $(tools_dir)/arch/arm64/tools/ From 2657b82a78f18528bef56dc1b017158490970873 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 31 Oct 2024 13:20:11 -0700 Subject: [PATCH 071/235] KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled When getting the current VPID, e.g. to emulate a guest TLB flush, return vpid01 if L2 is running but with VPID disabled, i.e. if VPID is disabled in vmcs12. Architecturally, if VPID is disabled, then the guest and host effectively share VPID=0. KVM emulates this behavior by using vpid01 when running an L2 with VPID disabled (see prepare_vmcs02_early_rare()), and so KVM must also treat vpid01 as the current VPID while L2 is active. Unconditionally treating vpid02 as the current VPID when L2 is active causes KVM to flush TLB entries for vpid02 instead of vpid01, which results in TLB entries from L1 being incorrectly preserved across nested VM-Enter to L2 (L2=>L1 isn't problematic, because the TLB flush after nested VM-Exit flushes vpid01). The bug manifests as failures in the vmx_apicv_test KVM-Unit-Test, as KVM incorrectly retains TLB entries for the APIC-access page across a nested VM-Enter. Opportunisticaly add comments at various touchpoints to explain the architectural requirements, and also why KVM uses vpid01 instead of vpid02. All credit goes to Chao, who root caused the issue and identified the fix. Link: https://lore.kernel.org/all/ZwzczkIlYGX+QXJz@intel.com Fixes: 2b4a5a5d5688 ("KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST") Cc: stable@vger.kernel.org Cc: Like Xu Debugged-by: Chao Gao Reviewed-by: Chao Gao Tested-by: Chao Gao Link: https://lore.kernel.org/r/20241031202011.1580522-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 30 +++++++++++++++++++++++++----- arch/x86/kvm/vmx/vmx.c | 2 +- 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index a8e7bc04d9bf..931a7361c30f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1197,11 +1197,14 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, kvm_hv_nested_transtion_tlb_flush(vcpu, enable_ept); /* - * If vmcs12 doesn't use VPID, L1 expects linear and combined mappings - * for *all* contexts to be flushed on VM-Enter/VM-Exit, i.e. it's a - * full TLB flush from the guest's perspective. This is required even - * if VPID is disabled in the host as KVM may need to synchronize the - * MMU in response to the guest TLB flush. + * If VPID is disabled, then guest TLB accesses use VPID=0, i.e. the + * same VPID as the host, and so architecturally, linear and combined + * mappings for VPID=0 must be flushed at VM-Enter and VM-Exit. KVM + * emulates L2 sharing L1's VPID=0 by using vpid01 while running L2, + * and so KVM must also emulate TLB flush of VPID=0, i.e. vpid01. This + * is required if VPID is disabled in KVM, as a TLB flush (there are no + * VPIDs) still occurs from L1's perspective, and KVM may need to + * synchronize the MMU in response to the guest TLB flush. * * Note, using TLB_FLUSH_GUEST is correct even if nested EPT is in use. * EPT is a special snowflake, as guest-physical mappings aren't @@ -2315,6 +2318,17 @@ static void prepare_vmcs02_early_rare(struct vcpu_vmx *vmx, vmcs_write64(VMCS_LINK_POINTER, INVALID_GPA); + /* + * If VPID is disabled, then guest TLB accesses use VPID=0, i.e. the + * same VPID as the host. Emulate this behavior by using vpid01 for L2 + * if VPID is disabled in vmcs12. Note, if VPID is disabled, VM-Enter + * and VM-Exit are architecturally required to flush VPID=0, but *only* + * VPID=0. I.e. using vpid02 would be ok (so long as KVM emulates the + * required flushes), but doing so would cause KVM to over-flush. E.g. + * if L1 runs L2 X with VPID12=1, then runs L2 Y with VPID12 disabled, + * and then runs L2 X again, then KVM can and should retain TLB entries + * for VPID12=1. + */ if (enable_vpid) { if (nested_cpu_has_vpid(vmcs12) && vmx->nested.vpid02) vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02); @@ -5950,6 +5964,12 @@ static int handle_invvpid(struct kvm_vcpu *vcpu) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + /* + * Always flush the effective vpid02, i.e. never flush the current VPID + * and never explicitly flush vpid01. INVVPID targets a VPID, not a + * VMCS, and so whether or not the current vmcs12 has VPID enabled is + * irrelevant (and there may not be a loaded vmcs12). + */ vpid02 = nested_get_vpid02(vcpu); switch (type) { case VMX_VPID_EXTENT_INDIVIDUAL_ADDR: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 81ed596e4454..9886d67d9512 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3216,7 +3216,7 @@ void vmx_flush_tlb_all(struct kvm_vcpu *vcpu) static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu) && nested_cpu_has_vpid(get_vmcs12(vcpu))) return nested_get_vpid02(vcpu); return to_vmx(vcpu)->vpid; } From e5d253c60e9627a22940e00a05a6115d722f07ed Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 31 Oct 2024 13:32:14 -0700 Subject: [PATCH 072/235] KVM: SVM: Propagate error from snp_guest_req_init() to userspace If snp_guest_req_init() fails, return the provided error code up the stack to userspace, e.g. so that userspace can log that KVM_SEV_INIT2 failed, as opposed to some random operation later in VM setup failing because SNP wasn't actually enabled for the VM. Note, KVM itself doesn't consult the return value from __sev_guest_init(), i.e. the fallout is purely that userspace may be confused. Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event") Reported-by: kernel test robot Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202410192220.MeTyHPxI-lkp@intel.com Link: https://lore.kernel.org/r/20241031203214.1585751-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 0b851ef937f2..03c7db3b4cc3 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -450,8 +450,11 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp, goto e_free; /* This needs to happen after SEV/SNP firmware initialization. */ - if (vm_type == KVM_X86_SNP_VM && snp_guest_req_init(kvm)) - goto e_free; + if (vm_type == KVM_X86_SNP_VM) { + ret = snp_guest_req_init(kvm); + if (ret) + goto e_free; + } INIT_LIST_HEAD(&sev->regions_list); INIT_LIST_HEAD(&sev->mirror_vms); From dabc44c28f118910dea96244d903f0c270225669 Mon Sep 17 00:00:00 2001 From: Takashi Iwai Date: Tue, 5 Nov 2024 13:02:17 +0100 Subject: [PATCH 073/235] ALSA: usb-audio: Add quirk for HP 320 FHD Webcam HP 320 FHD Webcam (03f0:654a) seems to have flaky firmware like other webcam devices that don't like the frequency inquiries. Also, Mic Capture Volume has an invalid resolution, hence fix it to be 16 (as a blind shot). Link: https://bugzilla.suse.com/show_bug.cgi?id=1232768 Cc: Link: https://patch.msgid.link/20241105120220.5740-1-tiwai@suse.de Signed-off-by: Takashi Iwai --- sound/usb/mixer.c | 1 + sound/usb/quirks.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/sound/usb/mixer.c b/sound/usb/mixer.c index 9945ae55b0d0..bd67027c7677 100644 --- a/sound/usb/mixer.c +++ b/sound/usb/mixer.c @@ -1205,6 +1205,7 @@ static void volume_control_quirks(struct usb_mixer_elem_info *cval, } break; case USB_ID(0x1bcf, 0x2283): /* NexiGo N930AF FHD Webcam */ + case USB_ID(0x03f0, 0x654a): /* HP 320 FHD Webcam */ if (!strcmp(kctl->id.name, "Mic Capture Volume")) { usb_audio_info(chip, "set resolution quirk: cval->res = 16\n"); diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c index e6278a245795..c5fd180357d1 100644 --- a/sound/usb/quirks.c +++ b/sound/usb/quirks.c @@ -2114,6 +2114,8 @@ struct usb_audio_quirk_flags_table { static const struct usb_audio_quirk_flags_table quirk_flags_table[] = { /* Device matches */ + DEVICE_FLG(0x03f0, 0x654a, /* HP 320 FHD Webcam */ + QUIRK_FLAG_GET_SAMPLE_RATE), DEVICE_FLG(0x041e, 0x3000, /* Creative SB Extigy */ QUIRK_FLAG_IGNORE_CTL_ERROR), DEVICE_FLG(0x041e, 0x4080, /* Creative Live Cam VF0610 */ From 498dbd9aea205db9da674994b74c7bf8e18448bd Mon Sep 17 00:00:00 2001 From: Zijun Hu Date: Tue, 29 Oct 2024 23:13:38 +0800 Subject: [PATCH 074/235] usb: musb: sunxi: Fix accessing an released usb phy Commit 6ed05c68cbca ("usb: musb: sunxi: Explicitly release USB PHY on exit") will cause that usb phy @glue->xceiv is accessed after released. 1) register platform driver @sunxi_musb_driver // get the usb phy @glue->xceiv sunxi_musb_probe() -> devm_usb_get_phy(). 2) register and unregister platform driver @musb_driver musb_probe() -> sunxi_musb_init() use the phy here //the phy is released here musb_remove() -> sunxi_musb_exit() -> devm_usb_put_phy() 3) register @musb_driver again musb_probe() -> sunxi_musb_init() use the phy here but the phy has been released at 2). ... Fixed by reverting the commit, namely, removing devm_usb_put_phy() from sunxi_musb_exit(). Fixes: 6ed05c68cbca ("usb: musb: sunxi: Explicitly release USB PHY on exit") Cc: stable@vger.kernel.org Signed-off-by: Zijun Hu Link: https://lore.kernel.org/r/20241029-sunxi_fix-v1-1-9431ed2ab826@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/musb/sunxi.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/usb/musb/sunxi.c b/drivers/usb/musb/sunxi.c index d54283fd026b..05b6e7e52e02 100644 --- a/drivers/usb/musb/sunxi.c +++ b/drivers/usb/musb/sunxi.c @@ -293,8 +293,6 @@ static int sunxi_musb_exit(struct musb *musb) if (test_bit(SUNXI_MUSB_FL_HAS_SRAM, &glue->flags)) sunxi_sram_release(musb->controller->parent); - devm_usb_put_phy(glue->dev, glue->xceiv); - return 0; } From 029778a4fd2c90c2e76a902b797c2348a722f1b8 Mon Sep 17 00:00:00 2001 From: Rex Nie Date: Wed, 30 Oct 2024 21:36:32 +0800 Subject: [PATCH 075/235] usb: typec: qcom-pmic: init value of hdr_len/txbuf_len earlier If the read of USB_PDPHY_RX_ACKNOWLEDGE_REG failed, then hdr_len and txbuf_len are uninitialized. This commit stops to print uninitialized value and misleading/false data. Cc: stable@vger.kernel.org Fixes: a4422ff22142 (" usb: typec: qcom: Add Qualcomm PMIC Type-C driver") Signed-off-by: Rex Nie Reviewed-by: Heikki Krogerus Reviewed-by: Bjorn Andersson Acked-by: Bryan O'Donoghue Link: https://lore.kernel.org/r/20241030133632.2116-1-rex.nie@jaguarmicro.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/typec/tcpm/qcom/qcom_pmic_typec_pdphy.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec_pdphy.c b/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec_pdphy.c index 5b7f52b74a40..726423684bae 100644 --- a/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec_pdphy.c +++ b/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec_pdphy.c @@ -227,6 +227,10 @@ qcom_pmic_typec_pdphy_pd_transmit_payload(struct pmic_typec_pdphy *pmic_typec_pd spin_lock_irqsave(&pmic_typec_pdphy->lock, flags); + hdr_len = sizeof(msg->header); + txbuf_len = pd_header_cnt_le(msg->header) * 4; + txsize_len = hdr_len + txbuf_len - 1; + ret = regmap_read(pmic_typec_pdphy->regmap, pmic_typec_pdphy->base + USB_PDPHY_RX_ACKNOWLEDGE_REG, &val); @@ -244,10 +248,6 @@ qcom_pmic_typec_pdphy_pd_transmit_payload(struct pmic_typec_pdphy *pmic_typec_pd if (ret) goto done; - hdr_len = sizeof(msg->header); - txbuf_len = pd_header_cnt_le(msg->header) * 4; - txsize_len = hdr_len + txbuf_len - 1; - /* Write message header sizeof(u16) to USB_PDPHY_TX_BUFFER_HDR_REG */ ret = regmap_bulk_write(pmic_typec_pdphy->regmap, pmic_typec_pdphy->base + USB_PDPHY_TX_BUFFER_HDR_REG, From 9cfb31e4c89d200d8ab7cb1e0bb9e6e8d621ca0b Mon Sep 17 00:00:00 2001 From: Roger Quadros Date: Mon, 4 Nov 2024 16:00:11 +0200 Subject: [PATCH 076/235] usb: dwc3: fix fault at system suspend if device was already runtime suspended If the device was already runtime suspended then during system suspend we cannot access the device registers else it will crash. Also we cannot access any registers after dwc3_core_exit() on some platforms so move the dwc3_enable_susphy() call to the top. Cc: stable@vger.kernel.org # v5.15+ Reported-by: William McVicker Closes: https://lore.kernel.org/all/ZyVfcUuPq56R2m1Y@google.com Fixes: 705e3ce37bcc ("usb: dwc3: core: Fix system suspend on TI AM62 platforms") Signed-off-by: Roger Quadros Acked-by: Thinh Nguyen Tested-by: Will McVicker Link: https://lore.kernel.org/r/20241104-am62-lpm-usb-fix-v1-1-e93df73a4f0d@kernel.org Signed-off-by: Greg Kroah-Hartman --- drivers/usb/dwc3/core.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c index 427e5660f87c..98114c2827c0 100644 --- a/drivers/usb/dwc3/core.c +++ b/drivers/usb/dwc3/core.c @@ -2342,10 +2342,18 @@ static int dwc3_suspend_common(struct dwc3 *dwc, pm_message_t msg) u32 reg; int i; - dwc->susphy_state = (dwc3_readl(dwc->regs, DWC3_GUSB2PHYCFG(0)) & - DWC3_GUSB2PHYCFG_SUSPHY) || - (dwc3_readl(dwc->regs, DWC3_GUSB3PIPECTL(0)) & - DWC3_GUSB3PIPECTL_SUSPHY); + if (!pm_runtime_suspended(dwc->dev) && !PMSG_IS_AUTO(msg)) { + dwc->susphy_state = (dwc3_readl(dwc->regs, DWC3_GUSB2PHYCFG(0)) & + DWC3_GUSB2PHYCFG_SUSPHY) || + (dwc3_readl(dwc->regs, DWC3_GUSB3PIPECTL(0)) & + DWC3_GUSB3PIPECTL_SUSPHY); + /* + * TI AM62 platform requires SUSPHY to be + * enabled for system suspend to work. + */ + if (!dwc->susphy_state) + dwc3_enable_susphy(dwc, true); + } switch (dwc->current_dr_role) { case DWC3_GCTL_PRTCAP_DEVICE: @@ -2398,15 +2406,6 @@ static int dwc3_suspend_common(struct dwc3 *dwc, pm_message_t msg) break; } - if (!PMSG_IS_AUTO(msg)) { - /* - * TI AM62 platform requires SUSPHY to be - * enabled for system suspend to work. - */ - if (!dwc->susphy_state) - dwc3_enable_susphy(dwc, true); - } - return 0; } From 7dd08a0b4193087976db6b3ee7807de7e8316f96 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Mon, 4 Nov 2024 20:16:42 +0300 Subject: [PATCH 077/235] usb: typec: fix potential out of bounds in ucsi_ccg_update_set_new_cam_cmd() The "*cmd" variable can be controlled by the user via debugfs. That means "new_cam" can be as high as 255 while the size of the uc->updated[] array is UCSI_MAX_ALTMODES (30). The call tree is: ucsi_cmd() // val comes from simple_attr_write_xsigned() -> ucsi_send_command() -> ucsi_send_command_common() -> ucsi_run_command() // calls ucsi->ops->sync_control() -> ucsi_ccg_sync_control() Fixes: 170a6726d0e2 ("usb: typec: ucsi: add support for separate DP altmode devices") Cc: stable Signed-off-by: Dan Carpenter Reviewed-by: Heikki Krogerus Link: https://lore.kernel.org/r/325102b3-eaa8-4918-a947-22aca1146586@stanley.mountain Signed-off-by: Greg Kroah-Hartman --- drivers/usb/typec/ucsi/ucsi_ccg.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/usb/typec/ucsi/ucsi_ccg.c b/drivers/usb/typec/ucsi/ucsi_ccg.c index ba58d11907bc..bccfc03b5986 100644 --- a/drivers/usb/typec/ucsi/ucsi_ccg.c +++ b/drivers/usb/typec/ucsi/ucsi_ccg.c @@ -482,6 +482,8 @@ static void ucsi_ccg_update_set_new_cam_cmd(struct ucsi_ccg *uc, port = uc->orig; new_cam = UCSI_SET_NEW_CAM_GET_AM(*cmd); + if (new_cam >= ARRAY_SIZE(uc->updated)) + return; new_port = &uc->updated[new_cam]; cam = new_port->linked_idx; enter_new_mode = UCSI_SET_NEW_CAM_ENTER(*cmd); From 08a3b241adfd90361c16c3e92f5275b816a73f04 Mon Sep 17 00:00:00 2001 From: Kuninori Morimoto Date: Tue, 5 Nov 2024 01:00:00 +0000 Subject: [PATCH 078/235] MAINTAINERS: Generic Sound Card section ALSA SoC Sound has Generic Sound Card (Simple-Card, Audio-Graph-Card, Audio-Graph-Card2). Adds its Maintainer. Signed-off-by: Kuninori Morimoto Link: https://patch.msgid.link/87ikt2a41c.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown --- MAINTAINERS | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 9d6272c00fbd..388626c3df1f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -21697,6 +21697,15 @@ S: Supported W: https://github.com/thesofproject/linux/ F: sound/soc/sof/ +SOUND - GENERIC SOUND CARD (Simple-Audio-Card, Audio-Graph-Card) +M: Kuninori Morimoto +S: Supported +L: linux-sound@vger.kernel.org +F: sound/soc/generic/ +F: include/sound/simple_card* +F: Documentation/devicetree/bindings/sound/simple-card.yaml +F: Documentation/devicetree/bindings/sound/audio-graph*.yaml + SOUNDWIRE SUBSYSTEM M: Vinod Koul M: Bard Liao From ab2e5c8ff253ff612f7c6ef9441d2ff6558e5449 Mon Sep 17 00:00:00 2001 From: Yang Yingliang Date: Sat, 26 Oct 2024 11:09:42 +0800 Subject: [PATCH 079/235] i2c: muxes: Fix return value check in mule_i2c_mux_probe() If dev_get_regmap() fails, it returns NULL pointer not ERR_PTR(), replace IS_ERR() with NULL pointer check, and return -ENODEV. Fixes: d0f8e97866bf ("i2c: muxes: add support for tsd,mule-i2c multiplexer") Signed-off-by: Yang Yingliang Signed-off-by: Andi Shyti --- drivers/i2c/muxes/i2c-mux-mule.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/i2c/muxes/i2c-mux-mule.c b/drivers/i2c/muxes/i2c-mux-mule.c index 8e942470b35f..284ff4afeeac 100644 --- a/drivers/i2c/muxes/i2c-mux-mule.c +++ b/drivers/i2c/muxes/i2c-mux-mule.c @@ -66,8 +66,8 @@ static int mule_i2c_mux_probe(struct platform_device *pdev) priv = i2c_mux_priv(muxc); priv->regmap = dev_get_regmap(mux_dev->parent, NULL); - if (IS_ERR(priv->regmap)) - return dev_err_probe(mux_dev, PTR_ERR(priv->regmap), + if (!priv->regmap) + return dev_err_probe(mux_dev, -ENODEV, "No parent i2c register map\n"); platform_set_drvdata(pdev, muxc); From bd646c768a934d28e574ee940d6759c7954a024d Mon Sep 17 00:00:00 2001 From: Mika Westerberg Date: Tue, 5 Nov 2024 09:19:02 +0200 Subject: [PATCH 080/235] thunderbolt: Fix connection issue with Pluggable UD-4VPD dock Rick reported that his Pluggable USB4 dock does not work anymore after upgrading to v6.10 kernel. It looks like commit c6ca1ac9f472 ("thunderbolt: Increase sideband access polling delay") makes the device router enumeration happen later than what might be expected by the dock (although there is no such limit in the USB4 spec) which probably makes it assume there is something wrong with the high-speed link and reset it. After the link is reset the same issue happens again and again. For this reason lower the sideband access delay from 5ms to 1ms. This seems to work fine according to Rick's testing. Reported-by: Rick Lahaye Closes: https://lore.kernel.org/linux-usb/000f01db247b$d10e1520$732a3f60$@581238.xyz/ Tested-by: Rick Lahaye Fixes: c6ca1ac9f472 ("thunderbolt: Increase sideband access polling delay") Cc: stable@vger.kernel.org Acked-by: Greg Kroah-Hartman Reviewed-by: Mario Limonciello Signed-off-by: Mika Westerberg --- drivers/thunderbolt/usb4.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/thunderbolt/usb4.c b/drivers/thunderbolt/usb4.c index 0a9b4aeb3fa1..402fdf8b1cde 100644 --- a/drivers/thunderbolt/usb4.c +++ b/drivers/thunderbolt/usb4.c @@ -48,7 +48,7 @@ enum usb4_ba_index { /* Delays in us used with usb4_port_wait_for_bit() */ #define USB4_PORT_DELAY 50 -#define USB4_PORT_SB_DELAY 5000 +#define USB4_PORT_SB_DELAY 1000 static int usb4_native_switch_op(struct tb_switch *sw, u16 opcode, u32 *metadata, u8 *status, From 3ce3f85787352fa48fc02ef6cbd7a5e5aba93347 Mon Sep 17 00:00:00 2001 From: Lijo Lazar Date: Mon, 4 Nov 2024 10:36:13 +0530 Subject: [PATCH 081/235] drm/amdgpu: Fix DPX valid mode check on GC 9.4.3 For DPX mode, the number of memory partitions supported should be less than or equal to 2. Fixes: 1589c82a1085 ("drm/amdgpu: Check memory ranges for valid xcp mode") Signed-off-by: Lijo Lazar Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher (cherry picked from commit 990c4f580742de7bb78fa57420ffd182fc3ab4cd) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c index 5e8833e4fed2..ccfd2a4b4acc 100644 --- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c +++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c @@ -482,7 +482,7 @@ static bool __aqua_vanjaram_is_valid_mode(struct amdgpu_xcp_mgr *xcp_mgr, case AMDGPU_SPX_PARTITION_MODE: return adev->gmc.num_mem_partitions == 1 && num_xcc > 0; case AMDGPU_DPX_PARTITION_MODE: - return adev->gmc.num_mem_partitions != 8 && (num_xcc % 4) == 0; + return adev->gmc.num_mem_partitions <= 2 && (num_xcc % 4) == 0; case AMDGPU_TPX_PARTITION_MODE: return (adev->gmc.num_mem_partitions == 1 || adev->gmc.num_mem_partitions == 3) && From b46dadf7e3cfe26d0b109c9c3d81b278d6c75361 Mon Sep 17 00:00:00 2001 From: Alex Deucher Date: Wed, 23 Oct 2024 16:37:52 -0400 Subject: [PATCH 082/235] drm/amdgpu: Adjust debugfs register access permissions Regular users shouldn't have read access. Reviewed-by: Yang Wang Signed-off-by: Alex Deucher (cherry picked from commit c0cfd2e652553d607b910be47d0cc5a7f3a78641) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index cbef720de779..29b45e370885 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -1648,7 +1648,7 @@ int amdgpu_debugfs_regs_init(struct amdgpu_device *adev) for (i = 0; i < ARRAY_SIZE(debugfs_regs); i++) { ent = debugfs_create_file(debugfs_regs_names[i], - S_IFREG | 0444, root, + S_IFREG | 0400, root, adev, debugfs_regs[i]); if (!i && !IS_ERR_OR_NULL(ent)) i_size_write(ent->d_inode, adev->rmmio_size); From f790a2c494c4ef587eeeb9fca20124de76a1646f Mon Sep 17 00:00:00 2001 From: Alex Deucher Date: Wed, 23 Oct 2024 16:39:36 -0400 Subject: [PATCH 083/235] drm/amdgpu: Adjust debugfs eviction and IB access permissions Users should not be able to run these. Reviewed-by: Yang Wang Signed-off-by: Alex Deucher (cherry picked from commit 7ba9395430f611cfc101b1c2687732baafa239d5) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 29b45e370885..eac92130c0be 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -2100,11 +2100,11 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev) amdgpu_securedisplay_debugfs_init(adev); amdgpu_fw_attestation_debugfs_init(adev); - debugfs_create_file("amdgpu_evict_vram", 0444, root, adev, + debugfs_create_file("amdgpu_evict_vram", 0400, root, adev, &amdgpu_evict_vram_fops); - debugfs_create_file("amdgpu_evict_gtt", 0444, root, adev, + debugfs_create_file("amdgpu_evict_gtt", 0400, root, adev, &amdgpu_evict_gtt_fops); - debugfs_create_file("amdgpu_test_ib", 0444, root, adev, + debugfs_create_file("amdgpu_test_ib", 0400, root, adev, &amdgpu_debugfs_test_ib_fops); debugfs_create_file("amdgpu_vm_info", 0444, root, adev, &amdgpu_debugfs_vm_info_fops); From 4d75b9468021c73108b4439794d69e892b1d24e3 Mon Sep 17 00:00:00 2001 From: Alex Deucher Date: Wed, 23 Oct 2024 16:52:08 -0400 Subject: [PATCH 084/235] drm/amdgpu: add missing size check in amdgpu_debugfs_gprwave_read() Avoid a possible buffer overflow if size is larger than 4K. Reviewed-by: Yang Wang Signed-off-by: Alex Deucher (cherry picked from commit f5d873f5825b40d886d03bd2aede91d4cf002434) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index eac92130c0be..9da4414de617 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -402,7 +402,7 @@ static ssize_t amdgpu_debugfs_gprwave_read(struct file *f, char __user *buf, siz int r; uint32_t *data, x; - if (size & 0x3 || *pos & 0x3) + if (size > 4096 || size & 0x3 || *pos & 0x3) return -EINVAL; r = pm_runtime_get_sync(adev_to_drm(adev)->dev); From 9bb4af400c386374ab1047df44c508512c08c31f Mon Sep 17 00:00:00 2001 From: Amelie Delaunay Date: Tue, 5 Nov 2024 15:02:42 +0100 Subject: [PATCH 085/235] ASoC: stm32: spdifrx: fix dma channel release in stm32_spdifrx_remove In case of error when requesting ctrl_chan DMA channel, ctrl_chan is not null. So the release of the dma channel leads to the following issue: [ 4.879000] st,stm32-spdifrx 500d0000.audio-controller: dma_request_slave_channel error -19 [ 4.888975] Unable to handle kernel NULL pointer dereference at virtual address 000000000000003d [...] [ 5.096577] Call trace: [ 5.099099] dma_release_channel+0x24/0x100 [ 5.103235] stm32_spdifrx_remove+0x24/0x60 [snd_soc_stm32_spdifrx] [ 5.109494] stm32_spdifrx_probe+0x320/0x4c4 [snd_soc_stm32_spdifrx] To avoid this issue, release channel only if the pointer is valid. Fixes: 794df9448edb ("ASoC: stm32: spdifrx: manage rebind issue") Signed-off-by: Amelie Delaunay Signed-off-by: Olivier Moysan Link: https://patch.msgid.link/20241105140242.527279-1-olivier.moysan@foss.st.com Signed-off-by: Mark Brown --- sound/soc/stm/stm32_spdifrx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/soc/stm/stm32_spdifrx.c b/sound/soc/stm/stm32_spdifrx.c index d1b32ba1e1a2..9e30852de93c 100644 --- a/sound/soc/stm/stm32_spdifrx.c +++ b/sound/soc/stm/stm32_spdifrx.c @@ -939,7 +939,7 @@ static void stm32_spdifrx_remove(struct platform_device *pdev) { struct stm32_spdifrx_data *spdifrx = platform_get_drvdata(pdev); - if (spdifrx->ctrl_chan) + if (!IS_ERR(spdifrx->ctrl_chan)) dma_release_channel(spdifrx->ctrl_chan); if (spdifrx->dmab) From 9c9201afebea1efc7ea4b8f721ee18a05bb8aca1 Mon Sep 17 00:00:00 2001 From: Koichiro Den Date: Tue, 5 Nov 2024 11:27:47 +0900 Subject: [PATCH 086/235] mm/slab: fix warning caused by duplicate kmem_cache creation in kmem_buckets_create Commit b035f5a6d852 ("mm: slab: reduce the kmalloc() minimum alignment if DMA bouncing possible") reduced ARCH_KMALLOC_MINALIGN to 8 on arm64. However, with KASAN_HW_TAGS enabled, arch_slab_minalign() becomes 16. This causes kmalloc_caches[*][8] to be aliased to kmalloc_caches[*][16], resulting in kmem_buckets_create() attempting to create a kmem_cache for size 16 twice. This duplication triggers warnings on boot: [ 2.325108] ------------[ cut here ]------------ [ 2.325135] kmem_cache of name 'memdup_user-16' already exists [ 2.325783] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:107 __kmem_cache_create_args+0xb8/0x3b0 [ 2.327957] Modules linked in: [ 2.328550] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc5mm-unstable-arm64+ #12 [ 2.328683] Hardware name: QEMU QEMU Virtual Machine, BIOS 2024.02-2 03/11/2024 [ 2.328790] pstate: 61000009 (nZCv daif -PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 2.328911] pc : __kmem_cache_create_args+0xb8/0x3b0 [ 2.328930] lr : __kmem_cache_create_args+0xb8/0x3b0 [ 2.328942] sp : ffff800083d6fc50 [ 2.328961] x29: ffff800083d6fc50 x28: f2ff0000c1674410 x27: ffff8000820b0598 [ 2.329061] x26: 000000007fffffff x25: 0000000000000010 x24: 0000000000002000 [ 2.329101] x23: ffff800083d6fce8 x22: ffff8000832222e8 x21: ffff800083222388 [ 2.329118] x20: f2ff0000c1674410 x19: f5ff0000c16364c0 x18: ffff800083d80030 [ 2.329135] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 2.329152] x14: 0000000000000000 x13: 0a73747369786520 x12: 79646165726c6120 [ 2.329169] x11: 656820747563205b x10: 2d2d2d2d2d2d2d2d x9 : 0000000000000000 [ 2.329194] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 2.329210] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 2.329226] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 2.329291] Call trace: [ 2.329407] __kmem_cache_create_args+0xb8/0x3b0 [ 2.329499] kmem_buckets_create+0xfc/0x320 [ 2.329526] init_user_buckets+0x34/0x78 [ 2.329540] do_one_initcall+0x64/0x3c8 [ 2.329550] kernel_init_freeable+0x26c/0x578 [ 2.329562] kernel_init+0x3c/0x258 [ 2.329574] ret_from_fork+0x10/0x20 [ 2.329698] ---[ end trace 0000000000000000 ]--- [ 2.403704] ------------[ cut here ]------------ [ 2.404716] kmem_cache of name 'msg_msg-16' already exists [ 2.404801] WARNING: CPU: 2 PID: 1 at mm/slab_common.c:107 __kmem_cache_create_args+0xb8/0x3b0 [ 2.404842] Modules linked in: [ 2.404971] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.12.0-rc5mm-unstable-arm64+ #12 [ 2.405026] Tainted: [W]=WARN [ 2.405043] Hardware name: QEMU QEMU Virtual Machine, BIOS 2024.02-2 03/11/2024 [ 2.405057] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2.405079] pc : __kmem_cache_create_args+0xb8/0x3b0 [ 2.405100] lr : __kmem_cache_create_args+0xb8/0x3b0 [ 2.405111] sp : ffff800083d6fc50 [ 2.405115] x29: ffff800083d6fc50 x28: fbff0000c1674410 x27: ffff8000820b0598 [ 2.405135] x26: 000000000000ffd0 x25: 0000000000000010 x24: 0000000000006000 [ 2.405153] x23: ffff800083d6fce8 x22: ffff8000832222e8 x21: ffff800083222388 [ 2.405169] x20: fbff0000c1674410 x19: fdff0000c163d6c0 x18: ffff800083d80030 [ 2.405185] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 2.405201] x14: 0000000000000000 x13: 0a73747369786520 x12: 79646165726c6120 [ 2.405217] x11: 656820747563205b x10: 2d2d2d2d2d2d2d2d x9 : 0000000000000000 [ 2.405233] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 2.405248] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 2.405271] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 2.405287] Call trace: [ 2.405293] __kmem_cache_create_args+0xb8/0x3b0 [ 2.405305] kmem_buckets_create+0xfc/0x320 [ 2.405315] init_msg_buckets+0x34/0x78 [ 2.405326] do_one_initcall+0x64/0x3c8 [ 2.405337] kernel_init_freeable+0x26c/0x578 [ 2.405348] kernel_init+0x3c/0x258 [ 2.405360] ret_from_fork+0x10/0x20 [ 2.405370] ---[ end trace 0000000000000000 ]--- To address this, alias kmem_cache for sizes smaller than min alignment to the aligned sized kmem_cache, as done with the default system kmalloc bucket. Fixes: b32801d1255b ("mm/slab: Introduce kmem_buckets_create() and family") Cc: # v6.11+ Signed-off-by: Koichiro Den Reviewed-by: Catalin Marinas Tested-by: Catalin Marinas Signed-off-by: Vlastimil Babka --- mm/slab_common.c | 31 ++++++++++++++++++++----------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 552b92dfdac7..893d32059915 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -380,8 +380,11 @@ kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags, unsigned int usersize, void (*ctor)(void *)) { + unsigned long mask = 0; + unsigned int idx; kmem_buckets *b; - int idx; + + BUILD_BUG_ON(ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]) > BITS_PER_LONG); /* * When the separate buckets API is not built in, just return @@ -403,7 +406,7 @@ kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags, for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) { char *short_size, *cache_name; unsigned int cache_useroffset, cache_usersize; - unsigned int size; + unsigned int size, aligned_idx; if (!kmalloc_caches[KMALLOC_NORMAL][idx]) continue; @@ -416,10 +419,6 @@ kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags, if (WARN_ON(!short_size)) goto fail; - cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1); - if (WARN_ON(!cache_name)) - goto fail; - if (useroffset >= size) { cache_useroffset = 0; cache_usersize = 0; @@ -427,18 +426,28 @@ kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags, cache_useroffset = useroffset; cache_usersize = min(size - cache_useroffset, usersize); } - (*b)[idx] = kmem_cache_create_usercopy(cache_name, size, + + aligned_idx = __kmalloc_index(size, false); + if (!(*b)[aligned_idx]) { + cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1); + if (WARN_ON(!cache_name)) + goto fail; + (*b)[aligned_idx] = kmem_cache_create_usercopy(cache_name, size, 0, flags, cache_useroffset, cache_usersize, ctor); - kfree(cache_name); - if (WARN_ON(!(*b)[idx])) - goto fail; + kfree(cache_name); + if (WARN_ON(!(*b)[aligned_idx])) + goto fail; + set_bit(aligned_idx, &mask); + } + if (idx != aligned_idx) + (*b)[idx] = (*b)[aligned_idx]; } return b; fail: - for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) + for_each_set_bit(idx, &mask, ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL])) kmem_cache_destroy((*b)[idx]); kmem_cache_free(kmem_buckets_cache, b); From f7d1b585e1533e26801c13569b96b84b2ad2d3c1 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 5 Nov 2024 11:45:24 -1000 Subject: [PATCH 087/235] sched_ext: Add a missing newline at the end of an error message Signed-off-by: Tejun Heo --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 74344a43ccf1..3bdb08fc2056 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4974,7 +4974,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) if (!cpumask_equal(housekeeping_cpumask(HK_TYPE_DOMAIN), cpu_possible_mask)) { - pr_err("sched_ext: Not compatible with \"isolcpus=\" domain isolation"); + pr_err("sched_ext: Not compatible with \"isolcpus=\" domain isolation\n"); return -EINVAL; } From a759bf0dfc4db3cb6556fc79c7c98da3a46b2b80 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 5 Nov 2024 11:45:27 -1000 Subject: [PATCH 088/235] sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()") converted scx_ops_bypass_depth from an atomic to an int. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()") --- tools/sched_ext/scx_show_state.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/sched_ext/scx_show_state.py b/tools/sched_ext/scx_show_state.py index 8bc626ede1c4..c4b3fdda9a0b 100644 --- a/tools/sched_ext/scx_show_state.py +++ b/tools/sched_ext/scx_show_state.py @@ -35,6 +35,6 @@ print(f'enabled : {read_static_key("__scx_ops_enabled")}') print(f'switching_all : {read_int("scx_switching_all")}') print(f'switched_all : {read_static_key("__scx_switched_all")}') print(f'enable_state : {ops_state_str(enable_state)} ({enable_state})') -print(f'bypass_depth : {read_atomic("scx_ops_bypass_depth")}') +print(f'bypass_depth : {prog["scx_ops_bypass_depth"].value_()}') print(f'nr_rejected : {read_atomic("scx_nr_rejected")}') print(f'enable_seq : {read_atomic("scx_enable_seq")}') From 6801cf7890f2ed8fcc14859b47501f8ee7a58ec7 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Tue, 5 Nov 2024 12:30:57 +0800 Subject: [PATCH 089/235] selftests/bpf: Use -4095 as the bad address for bits iterator As reported by Byeonguk, the bad_words test in verifier_bits_iter.c occasionally fails on s390 host. Quoting Ilya's explanation: s390 kernel runs in a completely separate address space, there is no user/kernel split at TASK_SIZE. The same address may be valid in both the kernel and the user address spaces, there is no way to tell by looking at it. The config option related to this property is ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. Also, unfortunately, 0 is a valid address in the s390 kernel address space. Fix the issue by using -4095 as the bad address for bits iterator, as suggested by Ilya. Verify that bpf_iter_bits_new() returns -EINVAL for NULL address and -EFAULT for bad address. Fixes: ebafc1e535db ("selftests/bpf: Add three test cases for bits_iter") Reported-by: Byeonguk Jeong Closes: https://lore.kernel.org/bpf/ZycSXwjH4UTvx-Cn@ub22/ Signed-off-by: Hou Tao Acked-by: Ilya Leoshkevich Link: https://lore.kernel.org/r/20241105043057.3371482-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- .../selftests/bpf/progs/verifier_bits_iter.c | 32 ++++++++++++++++--- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c index 156cc278e2fc..7c881bca9af5 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c +++ b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c @@ -57,9 +57,15 @@ __description("null pointer") __success __retval(0) int null_pointer(void) { - int nr = 0; + struct bpf_iter_bits iter; + int err, nr = 0; int *bit; + err = bpf_iter_bits_new(&iter, NULL, 1); + bpf_iter_bits_destroy(&iter); + if (err != -EINVAL) + return 1; + bpf_for_each(bits, bit, NULL, 1) nr++; return nr; @@ -194,15 +200,33 @@ __description("bad words") __success __retval(0) int bad_words(void) { - void *bad_addr = (void *)(3UL << 30); - int nr = 0; + void *bad_addr = (void *)-4095; + struct bpf_iter_bits iter; + volatile int nr; int *bit; + int err; + err = bpf_iter_bits_new(&iter, bad_addr, 1); + bpf_iter_bits_destroy(&iter); + if (err != -EFAULT) + return 1; + + nr = 0; bpf_for_each(bits, bit, bad_addr, 1) nr++; + if (nr != 0) + return 2; + err = bpf_iter_bits_new(&iter, bad_addr, 4); + bpf_iter_bits_destroy(&iter); + if (err != -EFAULT) + return 3; + + nr = 0; bpf_for_each(bits, bit, bad_addr, 4) nr++; + if (nr != 0) + return 4; - return nr; + return 0; } From af797b831d8975cb4610f396dcb7f03f4b9908e7 Mon Sep 17 00:00:00 2001 From: Matthew Brost Date: Mon, 4 Nov 2024 20:35:23 -0800 Subject: [PATCH 090/235] drm/xe: Fix possible exec queue leak in exec IOCTL In a couple of places after an exec queue is looked up the exec IOCTL returns on input errors without dropping the exec queue ref. Fix this ensuring the exec queue ref is dropped on input error. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Signed-off-by: Matthew Brost Reviewed-by: Tejas Upadhyay Reviewed-by: Rodrigo Vivi Link: https://patchwork.freedesktop.org/patch/msgid/20241105043524.4062774-2-matthew.brost@intel.com (cherry picked from commit 07064a200b40ac2195cb6b7b779897d9377e5e6f) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_exec.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c index f23ac1e2ed88..6de12f91b865 100644 --- a/drivers/gpu/drm/xe/xe_exec.c +++ b/drivers/gpu/drm/xe/xe_exec.c @@ -132,12 +132,16 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file) if (XE_IOCTL_DBG(xe, !q)) return -ENOENT; - if (XE_IOCTL_DBG(xe, q->flags & EXEC_QUEUE_FLAG_VM)) - return -EINVAL; + if (XE_IOCTL_DBG(xe, q->flags & EXEC_QUEUE_FLAG_VM)) { + err = -EINVAL; + goto err_exec_queue; + } if (XE_IOCTL_DBG(xe, args->num_batch_buffer && - q->width != args->num_batch_buffer)) - return -EINVAL; + q->width != args->num_batch_buffer)) { + err = -EINVAL; + goto err_exec_queue; + } if (XE_IOCTL_DBG(xe, q->ops->reset_status(q))) { err = -ECANCELED; From 64a2b6ed4bfd890a0e91955dd8ef8422a3944ed9 Mon Sep 17 00:00:00 2001 From: Matthew Brost Date: Mon, 4 Nov 2024 20:35:24 -0800 Subject: [PATCH 091/235] drm/xe: Drop VM dma-resv lock on xe_sync_in_fence_get failure in exec IOCTL Upon failure all locks need to be dropped before returning to the user. Fixes: 58480c1c912f ("drm/xe: Skip VMAs pin when requesting signal to the last XE_EXEC") Cc: Signed-off-by: Matthew Brost Reviewed-by: Tejas Upadhyay Reviewed-by: Rodrigo Vivi Link: https://patchwork.freedesktop.org/patch/msgid/20241105043524.4062774-3-matthew.brost@intel.com (cherry picked from commit 7d1a4258e602ffdce529f56686925034c1b3b095) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_exec.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c index 6de12f91b865..756b492f13b0 100644 --- a/drivers/gpu/drm/xe/xe_exec.c +++ b/drivers/gpu/drm/xe/xe_exec.c @@ -224,6 +224,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file) fence = xe_sync_in_fence_get(syncs, num_syncs, q, vm); if (IS_ERR(fence)) { err = PTR_ERR(fence); + xe_vm_unlock(vm); goto err_unlock_list; } for (i = 0; i < num_syncs; i++) From a353c78459f4d116216393cc29032ef5fe1472d2 Mon Sep 17 00:00:00 2001 From: Michal Wajdeczko Date: Mon, 4 Nov 2024 15:49:01 +0100 Subject: [PATCH 092/235] drm/xe/pf: Fix potential GGTT allocation leak In unlikely event that we fail during sending the new VF GGTT configuration to the GuC, we will free only the GGTT node data struct but will miss to release the actual GGTT allocation. This will later lead to list corruption, GGTT space leak and finally risking crash when unloading the driver: [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-EIO) [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT [ ] list_add corruption. next->prev should be prev (ffff88813cfcd628), but was 0000000000000000. (next=ffff88813cfe2028). [ ] RIP: 0010:__list_add_valid_or_report+0x6b/0xb0 [ ] Call Trace: [ ] drm_mm_insert_node_in_range+0x2c0/0x4e0 [ ] xe_ggtt_node_insert+0x46/0x70 [xe] [ ] pf_provision_vf_ggtt+0x7f5/0xa70 [xe] [ ] xe_gt_sriov_pf_config_set_ggtt+0x5e/0x770 [xe] [ ] ggtt_set+0x4b/0x70 [xe] [ ] simple_attr_write_xsigned.constprop.0.isra.0+0xb0/0x110 [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-ENOSPC) [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT [ ] Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b7b: 0000 [#1] PREEMPT SMP NOPTI [ ] RIP: 0010:drm_mm_remove_node+0x1b7/0x390 [ ] Call Trace: [ ] [ ] ? die_addr+0x2e/0x80 [ ] ? exc_general_protection+0x1a1/0x3e0 [ ] ? asm_exc_general_protection+0x22/0x30 [ ] ? drm_mm_remove_node+0x1b7/0x390 [ ] ggtt_node_remove+0xa5/0xf0 [xe] [ ] xe_ggtt_node_remove+0x35/0x70 [xe] [ ] xe_ttm_bo_destroy+0x123/0x220 [xe] [ ] intel_user_framebuffer_destroy+0x44/0x70 [xe] [ ] intel_plane_destroy_state+0x3b/0xc0 [xe] [ ] drm_atomic_state_default_clear+0x1cd/0x2f0 [ ] intel_atomic_state_clear+0x9/0x20 [xe] [ ] __drm_atomic_state_free+0x1d/0xb0 Fix that by using pf_release_ggtt() on the error path, which now works regardless if the node has GGTT allocation or not. Fixes: 34e804220f69 ("drm/xe: Make xe_ggtt_node struct independent") Signed-off-by: Michal Wajdeczko Cc: Rodrigo Vivi Cc: Matthew Brost Cc: Matthew Auld Reviewed-by: Matthew Brost Link: https://patchwork.freedesktop.org/patch/msgid/20241104144901.1903-1-michal.wajdeczko@intel.com (cherry picked from commit 43b1dd2b550f0861ce80fbfffd5881b1b26272b1) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_gt_sriov_pf_config.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_pf_config.c b/drivers/gpu/drm/xe/xe_gt_sriov_pf_config.c index 8250ef71e685..afdb477ecf83 100644 --- a/drivers/gpu/drm/xe/xe_gt_sriov_pf_config.c +++ b/drivers/gpu/drm/xe/xe_gt_sriov_pf_config.c @@ -387,6 +387,8 @@ static void pf_release_ggtt(struct xe_tile *tile, struct xe_ggtt_node *node) * the xe_ggtt_clear() called by below xe_ggtt_remove_node(). */ xe_ggtt_node_remove(node, false); + } else { + xe_ggtt_node_fini(node); } } @@ -442,7 +444,7 @@ static int pf_provision_vf_ggtt(struct xe_gt *gt, unsigned int vfid, u64 size) config->ggtt_region = node; return 0; err: - xe_ggtt_node_fini(node); + pf_release_ggtt(tile, node); return err; } From 514447a1219021298329ce586536598c3b4b2dc0 Mon Sep 17 00:00:00 2001 From: Lucas De Marchi Date: Mon, 4 Nov 2024 06:38:12 -0800 Subject: [PATCH 093/235] drm/xe: Stop accumulating LRC timestamp on job_free The exec queue timestamp is only really useful when it's being queried through the fdinfo. There's no need to update it so often, on every job_free. Tracing a simple app like vkcube running shows an update rate of ~ 120Hz. In case of discrete, the BO is on vram, creating a lot of pcie transactions. The update on job_free() is used to cover a gap: if exec queue is created and destroyed rapidly, before a new query, the timestamp still needs to be accumulated and accounted for in the xef. Initial implementation in commit 6109f24f87d7 ("drm/xe: Add helper to accumulate exec queue runtime") couldn't do it on the exec_queue_fini since the xef could be gone at that point. However since commit ce8c161cbad4 ("drm/xe: Add ref counting for xe_file") the xef is refcounted and the exec queue always holds a reference, making this safe now. Improve the fix in commit 2149ded63079 ("drm/xe: Fix use after free when client stats are captured") by reducing the frequency in which the update is needed. Fixes: 2149ded63079 ("drm/xe: Fix use after free when client stats are captured") Reviewed-by: Nirmoy Das Reviewed-by: Jonathan Cavitt Reviewed-by: Umesh Nerlige Ramappa Link: https://patchwork.freedesktop.org/patch/msgid/20241104143815.2112272-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit 83db047d9425d9a649f01573797558eff0f632e1) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_exec_queue.c | 6 ++++++ drivers/gpu/drm/xe/xe_guc_submit.c | 2 -- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index d098d2dd1b2d..fd0f3b3c9101 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -260,8 +260,14 @@ void xe_exec_queue_fini(struct xe_exec_queue *q) { int i; + /* + * Before releasing our ref to lrc and xef, accumulate our run ticks + */ + xe_exec_queue_update_run_ticks(q); + for (i = 0; i < q->width; ++i) xe_lrc_put(q->lrc[i]); + __xe_exec_queue_free(q); } diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index f903b0772722..4f5d00aea716 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -745,8 +745,6 @@ static void guc_exec_queue_free_job(struct drm_sched_job *drm_job) { struct xe_sched_job *job = to_xe_sched_job(drm_job); - xe_exec_queue_update_run_ticks(job->q); - trace_xe_sched_job_free(job); xe_sched_job_put(job); } From e66f3185fa04ccb807c6fbf0ea066574f4308831 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 27 Oct 2024 12:59:34 -0700 Subject: [PATCH 094/235] mm/thp: fix deferred split queue not partially_mapped Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. The new unlocked list_del_init() in deferred_split_scan() is buggy. I gave bad advice, it looks plausible since that's a local on-stack list, but the fact is that it can race with a third party freeing or migrating the preceding folio (properly unqueueing it with refcount 0 while holding split_queue_lock), thereby corrupting the list linkage. The obvious answer would be to take split_queue_lock there: but it has a long history of contention, so I'm reluctant to add to that. Instead, make sure that there is always one safe (raised refcount) folio before, by delaying its folio_put(). (And of course I was wrong to suggest updating split_queue_len without the lock: leave that until the splice.) And remove two over-eager partially_mapped checks, restoring those tests to how they were before: if uncharge_folio() or free_tail_page_prepare() finds _deferred_list non-empty, it's in trouble whether or not that folio is partially_mapped (and the flag was already cleared in the latter case). Link: https://lkml.kernel.org/r/81e34a8b-113a-0701-740e-2135c97eb1d7@google.com Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins Acked-by: Usama Arif Reviewed-by: David Hildenbrand Reviewed-by: Baolin Wang Acked-by: Zi Yan Cc: Barry Song Cc: Chris Li Cc: Johannes Weiner Cc: Kefeng Wang Cc: Kirill A. Shutemov Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Cc: Ryan Roberts Cc: Shakeel Butt Cc: Wei Yang Cc: Yang Shi Signed-off-by: Andrew Morton --- mm/huge_memory.c | 21 +++++++++++++++++---- mm/memcontrol.c | 3 +-- mm/page_alloc.c | 5 ++--- 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2fb328880b50..a1d345f1680c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3718,8 +3718,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags; LIST_HEAD(list); - struct folio *folio, *next; - int split = 0; + struct folio *folio, *next, *prev = NULL; + int split = 0, removed = 0; #ifdef CONFIG_MEMCG if (sc->memcg) @@ -3775,15 +3775,28 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, */ if (!did_split && !folio_test_partially_mapped(folio)) { list_del_init(&folio->_deferred_list); - ds_queue->split_queue_len--; + removed++; + } else { + /* + * That unlocked list_del_init() above would be unsafe, + * unless its folio is separated from any earlier folios + * left on the list (which may be concurrently unqueued) + * by one safe folio with refcount still raised. + */ + swap(folio, prev); } - folio_put(folio); + if (folio) + folio_put(folio); } spin_lock_irqsave(&ds_queue->split_queue_lock, flags); list_splice_tail(&list, &ds_queue->split_queue); + ds_queue->split_queue_len -= removed; spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + if (prev) + folio_put(prev); + /* * Stop shrinker if we didn't split any page, but the queue is empty. * This can happen if pages were freed under us. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7845c64a2c57..2703227cce88 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4631,8 +4631,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); VM_BUG_ON_FOLIO(folio_order(folio) > 1 && !folio_test_hugetlb(folio) && - !list_empty(&folio->_deferred_list) && - folio_test_partially_mapped(folio), folio); + !list_empty(&folio->_deferred_list), folio); /* * Nobody should be changing or seriously looking at diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 94a2ffe28008..5e108ae755cc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -961,9 +961,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) break; case 2: /* the second tail page: deferred_list overlaps ->mapping */ - if (unlikely(!list_empty(&folio->_deferred_list) && - folio_test_partially_mapped(folio))) { - bad_page(page, "partially mapped folio on deferred list"); + if (unlikely(!list_empty(&folio->_deferred_list))) { + bad_page(page, "on deferred list"); goto out; } break; From f8f931bba0f92052cf842b7e30917b1afcc77d5a Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 27 Oct 2024 13:02:13 -0700 Subject: [PATCH 095/235] mm/thp: fix deferred split unqueue naming and locking Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins Acked-by: David Hildenbrand Reviewed-by: Yang Shi Cc: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Johannes Weiner Cc: Kefeng Wang Cc: Kirill A. Shutemov Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Cc: Ryan Roberts Cc: Shakeel Butt Cc: Usama Arif Cc: Wei Yang Cc: Zi Yan Cc: Signed-off-by: Andrew Morton --- mm/huge_memory.c | 35 ++++++++++++++++++++++++++--------- mm/internal.h | 10 +++++----- mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++ mm/memcontrol.c | 8 +++++--- mm/migrate.c | 4 ++-- mm/page_alloc.c | 1 - mm/swap.c | 4 ++-- mm/vmscan.c | 4 ++-- 8 files changed, 67 insertions(+), 24 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a1d345f1680c..03fd4bc39ea1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list) return split_huge_page_to_list_to_order(&folio->page, list, ret); } -void __folio_undo_large_rmappable(struct folio *folio) +/* + * __folio_unqueue_deferred_split() is not to be called directly: + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h + * limits its calls to those folios which may have a _deferred_list for + * queueing THP splits, and that list is (racily observed to be) non-empty. + * + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is + * zero: because even when split_queue_lock is held, a non-empty _deferred_list + * might be in use on deferred_split_scan()'s unlocked on-stack list. + * + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is + * therefore important to unqueue deferred split before changing folio memcg. + */ +bool __folio_unqueue_deferred_split(struct folio *folio) { struct deferred_split *ds_queue; unsigned long flags; + bool unqueued = false; + + WARN_ON_ONCE(folio_ref_count(folio)); + WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio)); ds_queue = get_deferred_split_queue(folio); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio) MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); } list_del_init(&folio->_deferred_list); + unqueued = true; } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + + return unqueued; /* useful for debug warnings */ } /* partially_mapped=false won't clear PG_partially_mapped folio flag */ @@ -3627,14 +3647,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped) return; /* - * The try_to_unmap() in page reclaim path might reach here too, - * this may cause a race condition to corrupt deferred split queue. - * And, if page reclaim is already handling the same folio, it is - * unnecessary to handle it again in shrinker. - * - * Check the swapcache flag to determine if the folio is being - * handled by page reclaim since THP swap would add the folio into - * swap cache before calling try_to_unmap(). + * Exclude swapcache: originally to avoid a corrupt deferred split + * queue. Nowadays that is fully prevented by mem_cgroup_swapout(); + * but if page reclaim is already handling the same folio, it is + * unnecessary to handle it again in the shrinker, so excluding + * swapcache here may still be a useful optimization. */ if (folio_test_swapcache(folio)) return; diff --git a/mm/internal.h b/mm/internal.h index 93083bbeeefa..16c1f3cd599e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) #endif } -void __folio_undo_large_rmappable(struct folio *folio); -static inline void folio_undo_large_rmappable(struct folio *folio) +bool __folio_unqueue_deferred_split(struct folio *folio); +static inline bool folio_unqueue_deferred_split(struct folio *folio) { if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio)) - return; + return false; /* * At this point, there is no one trying to add the folio to @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio) * to check without acquiring the split_queue_lock. */ if (data_race(list_empty(&folio->_deferred_list))) - return; + return false; - __folio_undo_large_rmappable(folio); + return __folio_unqueue_deferred_split(folio); } static inline struct folio *page_rmappable_folio(struct page *page) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 81d8819f13cd..f8744f5630bb 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio, css_get(&to->css); css_put(&from->css); + /* Warning should never happen, so don't worry about refcount non-0 */ + WARN_ON_ONCE(folio_unqueue_deferred_split(folio)); folio->memcg_data = (unsigned long)to; __folio_memcg_unlock(from); @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, enum mc_target_type target_type; union mc_target target; struct folio *folio; + bool tried_split_before = false; +retry_pmd: ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { if (mc.precharge < HPAGE_PMD_NR) { @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); if (target_type == MC_TARGET_PAGE) { folio = target.folio; + /* + * Deferred split queue locking depends on memcg, + * and unqueue is unsafe unless folio refcount is 0: + * split or skip if on the queue? first try to split. + */ + if (!list_empty(&folio->_deferred_list)) { + spin_unlock(ptl); + if (!tried_split_before) + split_folio(folio); + folio_unlock(folio); + folio_put(folio); + if (tried_split_before) + return 0; + tried_split_before = true; + goto retry_pmd; + } + /* + * So long as that pmd lock is held, the folio cannot + * be racily added to the _deferred_list, because + * __folio_remove_rmap() will find !partially_mapped. + */ if (folio_isolate_lru(folio)) { if (!mem_cgroup_move_account(folio, true, mc.from, mc.to)) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2703227cce88..06df2af97415 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) struct obj_cgroup *objcg; VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); - VM_BUG_ON_FOLIO(folio_order(folio) > 1 && - !folio_test_hugetlb(folio) && - !list_empty(&folio->_deferred_list), folio); /* * Nobody should be changing or seriously looking at @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) ug->nr_memory += nr_pages; ug->pgpgout++; + WARN_ON_ONCE(folio_unqueue_deferred_split(folio)); folio->memcg_data = 0; } @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) /* Transfer the charge and the css ref */ commit_charge(new, memcg); + + /* Warning should never happen, so don't worry about refcount non-0 */ + WARN_ON_ONCE(folio_unqueue_deferred_split(old)); old->memcg_data = 0; } @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry) VM_BUG_ON_FOLIO(oldid, folio); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); + folio_unqueue_deferred_split(folio); folio->memcg_data = 0; if (!mem_cgroup_is_root(memcg)) diff --git a/mm/migrate.c b/mm/migrate.c index fab84a776088..dfa24e41e8f9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -490,7 +490,7 @@ static int __folio_migrate_mapping(struct address_space *mapping, folio_test_large_rmappable(folio)) { if (!folio_ref_freeze(folio, expected_count)) return -EAGAIN; - folio_undo_large_rmappable(folio); + folio_unqueue_deferred_split(folio); folio_ref_unfreeze(folio, expected_count); } @@ -515,7 +515,7 @@ static int __folio_migrate_mapping(struct address_space *mapping, } /* Take off deferred split queue while frozen and memcg set */ - folio_undo_large_rmappable(folio); + folio_unqueue_deferred_split(folio); /* * Now we know that no one else is looking at the folio: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5e108ae755cc..8ad38cd5e574 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2681,7 +2681,6 @@ void free_unref_folios(struct folio_batch *folios) unsigned long pfn = folio_pfn(folio); unsigned int order = folio_order(folio); - folio_undo_large_rmappable(folio); if (!free_pages_prepare(&folio->page, order)) continue; /* diff --git a/mm/swap.c b/mm/swap.c index 835bdf324b76..b8e3259ea2c4 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -121,7 +121,7 @@ void __folio_put(struct folio *folio) } page_cache_release(folio); - folio_undo_large_rmappable(folio); + folio_unqueue_deferred_split(folio); mem_cgroup_uncharge(folio); free_unref_page(&folio->page, folio_order(folio)); } @@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs) free_huge_folio(folio); continue; } - folio_undo_large_rmappable(folio); + folio_unqueue_deferred_split(folio); __page_cache_release(folio, &lruvec, &flags); if (j != i) diff --git a/mm/vmscan.c b/mm/vmscan.c index ddaaff67642e..28ba2b06fc7d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1476,7 +1476,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, */ nr_reclaimed += nr_pages; - folio_undo_large_rmappable(folio); + folio_unqueue_deferred_split(folio); if (folio_batch_add(&free_folios, folio) == 0) { mem_cgroup_uncharge_folios(&free_folios); try_to_unmap_flush(); @@ -1864,7 +1864,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, if (unlikely(folio_put_testzero(folio))) { __folio_clear_lru_flags(folio); - folio_undo_large_rmappable(folio); + folio_unqueue_deferred_split(folio); if (folio_batch_add(&free_folios, folio) == 0) { spin_unlock_irq(&lruvec->lru_lock); mem_cgroup_uncharge_folios(&free_folios); From 3dd6ed34ce1f2356a77fb88edafb5ec96784e3cf Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 29 Oct 2024 18:11:44 +0000 Subject: [PATCH 096/235] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Patch series "fix error handling in mmap_region() and refactor (hotfixes)", v4. mmap_region() is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. This series goes to great lengths to simplify how mmap_region() works and to avoid unwinding errors late on in the process of setting up the VMA for the new mapping, and equally avoids such operations occurring while the VMA is in an inconsistent state. The patches in this series comprise the minimal changes required to resolve existing issues in mmap_region() error handling, in order that they can be hotfixed and backported. There is additionally a follow up series which goes further, separated out from the v1 series and sent and updated separately. This patch (of 5): After an attempted mmap() fails, we are no longer in a situation where we can safely interact with VMA hooks. This is currently not enforced, meaning that we need complicated handling to ensure we do not incorrectly call these hooks. We can avoid the whole issue by treating the VMA as suspect the moment that the file->f_ops->mmap() function reports an error by replacing whatever VMA operations were installed with a dummy empty set of VMA operations. We do so through a new helper function internal to mm - mmap_file() - which is both more logically named than the existing call_mmap() function and correctly isolates handling of the vm_op reassignment to mm. All the existing invocations of call_mmap() outside of mm are ultimately nested within the call_mmap() from mm, which we now replace. It is therefore safe to leave call_mmap() in place as a convenience function (and to avoid churn). The invokers are: ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap() coda_file_operations -> mmap -> coda_file_mmap() shm_file_operations -> shm_mmap() shm_file_operations_huge -> shm_mmap() dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops -> i915_gem_dmabuf_mmap() None of these callers interact with vm_ops or mappings in a problematic way on error, quickly exiting out. Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes Reported-by: Jann Horn Reviewed-by: Liam R. Howlett Reviewed-by: Vlastimil Babka Reviewed-by: Jann Horn Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Helge Deller Cc: James E.J. Bottomley Cc: Linus Torvalds Cc: Mark Brown Cc: Peter Xu Cc: Will Deacon Cc: Signed-off-by: Andrew Morton --- mm/internal.h | 27 +++++++++++++++++++++++++++ mm/mmap.c | 6 +++--- mm/nommu.c | 4 ++-- 3 files changed, 32 insertions(+), 5 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 16c1f3cd599e..4eab2961e69c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -108,6 +108,33 @@ static inline void *folio_raw_mapping(const struct folio *folio) return (void *)(mapping & ~PAGE_MAPPING_FLAGS); } +/* + * This is a file-backed mapping, and is about to be memory mapped - invoke its + * mmap hook and safely handle error conditions. On error, VMA hooks will be + * mutated. + * + * @file: File which backs the mapping. + * @vma: VMA which we are mapping. + * + * Returns: 0 if success, error otherwise. + */ +static inline int mmap_file(struct file *file, struct vm_area_struct *vma) +{ + int err = call_mmap(file, vma); + + if (likely(!err)) + return 0; + + /* + * OK, we tried to call the file hook for mmap(), but an error + * arose. The mapping is in an inconsistent state and we most not invoke + * any further hooks on it. + */ + vma->vm_ops = &vma_dummy_vm_ops; + + return err; +} + #ifdef CONFIG_MMU /* Flags for folio_pte_batch(). */ diff --git a/mm/mmap.c b/mm/mmap.c index 9841b41e3c76..6e3b25f7728f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1422,7 +1422,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* * clear PTEs while the vma is still in the tree so that rmap * cannot race with the freeing later in the truncate scenario. - * This is also needed for call_mmap(), which is why vm_ops + * This is also needed for mmap_file(), which is why vm_ops * close function is called. */ vms_clean_up_area(&vms, &mas_detach); @@ -1447,7 +1447,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (file) { vma->vm_file = get_file(file); - error = call_mmap(file, vma); + error = mmap_file(file, vma); if (error) goto unmap_and_free_vma; @@ -1470,7 +1470,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma_iter_config(&vmi, addr, end); /* - * If vm_flags changed after call_mmap(), we should try merge + * If vm_flags changed after mmap_file(), we should try merge * vma again as we may succeed this time. */ if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) { diff --git a/mm/nommu.c b/mm/nommu.c index 385b0c15add8..f9ccc02458ec 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -885,7 +885,7 @@ static int do_mmap_shared_file(struct vm_area_struct *vma) { int ret; - ret = call_mmap(vma->vm_file, vma); + ret = mmap_file(vma->vm_file, vma); if (ret == 0) { vma->vm_region->vm_top = vma->vm_region->vm_end; return 0; @@ -918,7 +918,7 @@ static int do_mmap_private(struct vm_area_struct *vma, * happy. */ if (capabilities & NOMMU_MAP_DIRECT) { - ret = call_mmap(vma->vm_file, vma); + ret = mmap_file(vma->vm_file, vma); /* shouldn't return success if we're not sharing */ if (WARN_ON_ONCE(!is_nommu_shared_mapping(vma->vm_flags))) ret = -ENOSYS; From 4080ef1579b2413435413988d14ac8c68e4d42c8 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 29 Oct 2024 18:11:45 +0000 Subject: [PATCH 097/235] mm: unconditionally close VMAs on error Incorrect invocation of VMA callbacks when the VMA is no longer in a consistent state is bug prone and risky to perform. With regards to the important vm_ops->close() callback We have gone to great lengths to try to track whether or not we ought to close VMAs. Rather than doing so and risking making a mistake somewhere, instead unconditionally close and reset vma->vm_ops to an empty dummy operations set with a NULL .close operator. We introduce a new function to do so - vma_close() - and simplify existing vms logic which tracked whether we needed to close or not. This simplifies the logic, avoids incorrect double-calling of the .close() callback and allows us to update error paths to simply call vma_close() unconditionally - making VMA closure idempotent. Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes Reported-by: Jann Horn Reviewed-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Reviewed-by: Jann Horn Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Helge Deller Cc: James E.J. Bottomley Cc: Linus Torvalds Cc: Mark Brown Cc: Peter Xu Cc: Will Deacon Cc: Signed-off-by: Andrew Morton --- mm/internal.h | 18 ++++++++++++++++++ mm/mmap.c | 5 ++--- mm/nommu.c | 3 +-- mm/vma.c | 14 +++++--------- mm/vma.h | 4 +--- 5 files changed, 27 insertions(+), 17 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 4eab2961e69c..64c2eb0b160e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -135,6 +135,24 @@ static inline int mmap_file(struct file *file, struct vm_area_struct *vma) return err; } +/* + * If the VMA has a close hook then close it, and since closing it might leave + * it in an inconsistent state which makes the use of any hooks suspect, clear + * them down by installing dummy empty hooks. + */ +static inline void vma_close(struct vm_area_struct *vma) +{ + if (vma->vm_ops && vma->vm_ops->close) { + vma->vm_ops->close(vma); + + /* + * The mapping is in an inconsistent state, and no further hooks + * may be invoked upon it. + */ + vma->vm_ops = &vma_dummy_vm_ops; + } +} + #ifdef CONFIG_MMU /* Flags for folio_pte_batch(). */ diff --git a/mm/mmap.c b/mm/mmap.c index 6e3b25f7728f..ac0604f146f6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1573,8 +1573,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return addr; close_and_free_vma: - if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close) - vma->vm_ops->close(vma); + vma_close(vma); if (file || vma->vm_file) { unmap_and_free_vma: @@ -1934,7 +1933,7 @@ void exit_mmap(struct mm_struct *mm) do { if (vma->vm_flags & VM_ACCOUNT) nr_accounted += vma_pages(vma); - remove_vma(vma, /* unreachable = */ true, /* closed = */ false); + remove_vma(vma, /* unreachable = */ true); count++; cond_resched(); vma = vma_next(&vmi); diff --git a/mm/nommu.c b/mm/nommu.c index f9ccc02458ec..635d028d647b 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -589,8 +589,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma) */ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma) { - if (vma->vm_ops && vma->vm_ops->close) - vma->vm_ops->close(vma); + vma_close(vma); if (vma->vm_file) fput(vma->vm_file); put_nommu_region(vma->vm_region); diff --git a/mm/vma.c b/mm/vma.c index b21ffec33f8e..7621384d64cf 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -323,11 +323,10 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg, /* * Close a vm structure and free it. */ -void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed) +void remove_vma(struct vm_area_struct *vma, bool unreachable) { might_sleep(); - if (!closed && vma->vm_ops && vma->vm_ops->close) - vma->vm_ops->close(vma); + vma_close(vma); if (vma->vm_file) fput(vma->vm_file); mpol_put(vma_policy(vma)); @@ -1115,9 +1114,7 @@ void vms_clean_up_area(struct vma_munmap_struct *vms, vms_clear_ptes(vms, mas_detach, true); mas_set(mas_detach, 0); mas_for_each(mas_detach, vma, ULONG_MAX) - if (vma->vm_ops && vma->vm_ops->close) - vma->vm_ops->close(vma); - vms->closed_vm_ops = true; + vma_close(vma); } /* @@ -1160,7 +1157,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, /* Remove and clean up vmas */ mas_set(mas_detach, 0); mas_for_each(mas_detach, vma, ULONG_MAX) - remove_vma(vma, /* = */ false, vms->closed_vm_ops); + remove_vma(vma, /* unreachable = */ false); vm_unacct_memory(vms->nr_accounted); validate_mm(mm); @@ -1684,8 +1681,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma; out_vma_link: - if (new_vma->vm_ops && new_vma->vm_ops->close) - new_vma->vm_ops->close(new_vma); + vma_close(new_vma); if (new_vma->vm_file) fput(new_vma->vm_file); diff --git a/mm/vma.h b/mm/vma.h index 55457cb68200..75558b5e9c8c 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -42,7 +42,6 @@ struct vma_munmap_struct { int vma_count; /* Number of vmas that will be removed */ bool unlock; /* Unlock after the munmap */ bool clear_ptes; /* If there are outstanding PTE to be cleared */ - bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */ /* 1 byte hole */ unsigned long nr_pages; /* Number of pages being removed */ unsigned long locked_vm; /* Number of locked pages */ @@ -198,7 +197,6 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms, vms->unmap_start = FIRST_USER_ADDRESS; vms->unmap_end = USER_PGTABLES_CEILING; vms->clear_ptes = false; - vms->closed_vm_ops = false; } #endif @@ -269,7 +267,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, bool unlock); -void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed); +void remove_vma(struct vm_area_struct *vma, bool unreachable); void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, struct vm_area_struct *prev, struct vm_area_struct *next); From 0fb4a7ad270b3b209e510eb9dc5b07bf02b7edaf Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 29 Oct 2024 18:11:46 +0000 Subject: [PATCH 098/235] mm: refactor map_deny_write_exec() Refactor the map_deny_write_exec() to not unnecessarily require a VMA parameter but rather to accept VMA flags parameters, which allows us to use this function early in mmap_region() in a subsequent commit. While we're here, we refactor the function to be more readable and add some additional documentation. Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes Reported-by: Jann Horn Reviewed-by: Liam R. Howlett Reviewed-by: Vlastimil Babka Reviewed-by: Jann Horn Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Helge Deller Cc: James E.J. Bottomley Cc: Linus Torvalds Cc: Mark Brown Cc: Peter Xu Cc: Will Deacon Cc: Signed-off-by: Andrew Morton --- include/linux/mman.h | 21 ++++++++++++++++++--- mm/mmap.c | 2 +- mm/mprotect.c | 2 +- mm/vma.h | 2 +- 4 files changed, 21 insertions(+), 6 deletions(-) diff --git a/include/linux/mman.h b/include/linux/mman.h index bcb201ab7a41..8ddca62d6460 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -188,16 +188,31 @@ static inline bool arch_memory_deny_write_exec_supported(void) * * d) mmap(PROT_READ | PROT_EXEC) * mmap(PROT_READ | PROT_EXEC | PROT_BTI) + * + * This is only applicable if the user has set the Memory-Deny-Write-Execute + * (MDWE) protection mask for the current process. + * + * @old specifies the VMA flags the VMA originally possessed, and @new the ones + * we propose to set. + * + * Return: false if proposed change is OK, true if not ok and should be denied. */ -static inline bool map_deny_write_exec(struct vm_area_struct *vma, unsigned long vm_flags) +static inline bool map_deny_write_exec(unsigned long old, unsigned long new) { + /* If MDWE is disabled, we have nothing to deny. */ if (!test_bit(MMF_HAS_MDWE, ¤t->mm->flags)) return false; - if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE)) + /* If the new VMA is not executable, we have nothing to deny. */ + if (!(new & VM_EXEC)) + return false; + + /* Under MDWE we do not accept newly writably executable VMAs... */ + if (new & VM_WRITE) return true; - if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC)) + /* ...nor previously non-executable VMAs becoming executable. */ + if (!(old & VM_EXEC)) return true; return false; diff --git a/mm/mmap.c b/mm/mmap.c index ac0604f146f6..ab71d4c3464c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1505,7 +1505,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma_set_anonymous(vma); } - if (map_deny_write_exec(vma, vma->vm_flags)) { + if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) { error = -EACCES; goto close_and_free_vma; } diff --git a/mm/mprotect.c b/mm/mprotect.c index 0c5d6d06107d..6f450af3252e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -810,7 +810,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, break; } - if (map_deny_write_exec(vma, newflags)) { + if (map_deny_write_exec(vma->vm_flags, newflags)) { error = -EACCES; break; } diff --git a/mm/vma.h b/mm/vma.h index 75558b5e9c8c..d58068c0ff2e 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -42,7 +42,7 @@ struct vma_munmap_struct { int vma_count; /* Number of vmas that will be removed */ bool unlock; /* Unlock after the munmap */ bool clear_ptes; /* If there are outstanding PTE to be cleared */ - /* 1 byte hole */ + /* 2 byte hole */ unsigned long nr_pages; /* Number of pages being removed */ unsigned long locked_vm; /* Number of locked pages */ unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */ From 5baf8b037debf4ec60108ccfeccb8636d1dbad81 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 29 Oct 2024 18:11:47 +0000 Subject: [PATCH 099/235] mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling Currently MTE is permitted in two circumstances (desiring to use MTE having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is specified, as checked by arch_calc_vm_flag_bits() and actualised by setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap hook is activated in mmap_region(). The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also set is the arm64 implementation of arch_validate_flags(). Unfortunately, we intend to refactor mmap_region() to perform this check earlier, meaning that in the case of a shmem backing we will not have invoked shmem_mmap() yet, causing the mapping to fail spuriously. It is inappropriate to set this architecture-specific flag in general mm code anyway, so a sensible resolution of this issue is to instead move the check somewhere else. We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via the arch_calc_vm_flag_bits() call. This is an appropriate place to do this as we already check for the MAP_ANONYMOUS case here, and the shmem file case is simply a variant of the same idea - we permit RAM-backed memory. This requires a modification to the arch_calc_vm_flag_bits() signature to pass in a pointer to the struct file associated with the mapping, however this is not too egregious as this is only used by two architectures anyway - arm64 and parisc. So this patch performs this adjustment and removes the unnecessary assignment of VM_MTE_ALLOWED in shmem_mmap(). [akpm@linux-foundation.org: fix whitespace, per Catalin] Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes Suggested-by: Catalin Marinas Reported-by: Jann Horn Reviewed-by: Catalin Marinas Reviewed-by: Vlastimil Babka Cc: Andreas Larsson Cc: David S. Miller Cc: Helge Deller Cc: James E.J. Bottomley Cc: Liam R. Howlett Cc: Linus Torvalds Cc: Mark Brown Cc: Peter Xu Cc: Will Deacon Cc: Signed-off-by: Andrew Morton --- arch/arm64/include/asm/mman.h | 10 +++++++--- arch/parisc/include/asm/mman.h | 5 +++-- include/linux/mman.h | 7 ++++--- mm/mmap.c | 2 +- mm/nommu.c | 2 +- mm/shmem.c | 3 --- 6 files changed, 16 insertions(+), 13 deletions(-) diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h index 9e39217b4afb..798d965760d4 100644 --- a/arch/arm64/include/asm/mman.h +++ b/arch/arm64/include/asm/mman.h @@ -6,6 +6,8 @@ #ifndef BUILD_VDSO #include +#include +#include #include static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, @@ -31,19 +33,21 @@ static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, } #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey) -static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) +static inline unsigned long arch_calc_vm_flag_bits(struct file *file, + unsigned long flags) { /* * Only allow MTE on anonymous mappings as these are guaranteed to be * backed by tags-capable memory. The vm_flags may be overridden by a * filesystem supporting MTE (RAM-based). */ - if (system_supports_mte() && (flags & MAP_ANONYMOUS)) + if (system_supports_mte() && + ((flags & MAP_ANONYMOUS) || shmem_file(file))) return VM_MTE_ALLOWED; return 0; } -#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags) +#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags) static inline bool arch_validate_prot(unsigned long prot, unsigned long addr __always_unused) diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h index 89b6beeda0b8..663f587dc789 100644 --- a/arch/parisc/include/asm/mman.h +++ b/arch/parisc/include/asm/mman.h @@ -2,6 +2,7 @@ #ifndef __ASM_MMAN_H__ #define __ASM_MMAN_H__ +#include #include /* PARISC cannot allow mdwe as it needs writable stacks */ @@ -11,7 +12,7 @@ static inline bool arch_memory_deny_write_exec_supported(void) } #define arch_memory_deny_write_exec_supported arch_memory_deny_write_exec_supported -static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) +static inline unsigned long arch_calc_vm_flag_bits(struct file *file, unsigned long flags) { /* * The stack on parisc grows upwards, so if userspace requests memory @@ -23,6 +24,6 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) return 0; } -#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags) +#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags) #endif /* __ASM_MMAN_H__ */ diff --git a/include/linux/mman.h b/include/linux/mman.h index 8ddca62d6460..a842783ffa62 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -2,6 +2,7 @@ #ifndef _LINUX_MMAN_H #define _LINUX_MMAN_H +#include #include #include @@ -94,7 +95,7 @@ static inline void vm_unacct_memory(long pages) #endif #ifndef arch_calc_vm_flag_bits -#define arch_calc_vm_flag_bits(flags) 0 +#define arch_calc_vm_flag_bits(file, flags) 0 #endif #ifndef arch_validate_prot @@ -151,13 +152,13 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey) * Combine the mmap "flags" argument into "vm_flags" used internally. */ static inline unsigned long -calc_vm_flag_bits(unsigned long flags) +calc_vm_flag_bits(struct file *file, unsigned long flags) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) | - arch_calc_vm_flag_bits(flags); + arch_calc_vm_flag_bits(file, flags); } unsigned long vm_commit_limit(void); diff --git a/mm/mmap.c b/mm/mmap.c index ab71d4c3464c..aee5fa08ae5d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -344,7 +344,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, * to. we assume access permissions have been handled by the open * of the memory object, so we don't do any here. */ - vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | + vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(file, flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; /* Obtain the address to map to. we verify (or select) it and ensure diff --git a/mm/nommu.c b/mm/nommu.c index 635d028d647b..e9b5f527ab5b 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -842,7 +842,7 @@ static unsigned long determine_vm_flags(struct file *file, { unsigned long vm_flags; - vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags); + vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(file, flags); if (!file) { /* diff --git a/mm/shmem.c b/mm/shmem.c index 4ba1d00fabda..e87f5d6799a7 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; - /* arm64 - allow memory tagging on RAM-based files */ - vm_flags_set(vma, VM_MTE_ALLOWED); - file_accessed(file); /* This is anonymous shared memory if it is unlinked at the time of mmap */ if (inode->i_nlink) From 5de195060b2e251a835f622759550e6202167641 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 29 Oct 2024 18:11:48 +0000 Subject: [PATCH 100/235] mm: resolve faulty mmap_region() error path behaviour The mmap_region() function is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. Taking advantage of previous patches in this series we move a number of checks earlier in the code, simplifying things by moving the core of the logic into a static internal function __mmap_region(). Doing this allows us to perform a number of checks up front before we do any real work, and allows us to unwind the writable unmap check unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE validation unconditionally also. We move a number of things here: 1. We preallocate memory for the iterator before we call the file-backed memory hook, allowing us to exit early and avoid having to perform complicated and error-prone close/free logic. We carefully free iterator state on both success and error paths. 2. The enclosing mmap_region() function handles the mapping_map_writable() logic early. Previously the logic had the mapping_map_writable() at the point of mapping a newly allocated file-backed VMA, and a matching mapping_unmap_writable() on success and error paths. We now do this unconditionally if this is a file-backed, shared writable mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however doing so does not invalidate the seal check we just performed, and we in any case always decrement the counter in the wrapper. We perform a debug assert to ensure a driver does not attempt to do the opposite. 3. We also move arch_validate_flags() up into the mmap_region() function. This is only relevant on arm64 and sparc64, and the check is only meaningful for SPARC with ADI enabled. We explicitly add a warning for this arch if a driver invalidates this check, though the code ought eventually to be fixed to eliminate the need for this. With all of these measures in place, we no longer need to explicitly close the VMA on error paths, as we place all checks which might fail prior to a call to any driver mmap hook. This eliminates an entire class of errors, makes the code easier to reason about and more robust. Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes Reported-by: Jann Horn Reviewed-by: Liam R. Howlett Reviewed-by: Vlastimil Babka Tested-by: Mark Brown Cc: Andreas Larsson Cc: Catalin Marinas Cc: David S. Miller Cc: Helge Deller Cc: James E.J. Bottomley Cc: Linus Torvalds Cc: Peter Xu Cc: Will Deacon Cc: Signed-off-by: Andrew Morton --- mm/mmap.c | 119 +++++++++++++++++++++++++++++------------------------- 1 file changed, 65 insertions(+), 54 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index aee5fa08ae5d..79d541f1502b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1358,20 +1358,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, return do_vmi_munmap(&vmi, mm, start, len, uf, false); } -unsigned long mmap_region(struct file *file, unsigned long addr, +static unsigned long __mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; pgoff_t pglen = PHYS_PFN(len); - struct vm_area_struct *merge; unsigned long charged = 0; struct vma_munmap_struct vms; struct ma_state mas_detach; struct maple_tree mt_detach; unsigned long end = addr + len; - bool writable_file_mapping = false; int error; VMA_ITERATOR(vmi, mm, addr); VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff); @@ -1445,28 +1443,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vm_flags_init(vma, vm_flags); vma->vm_page_prot = vm_get_page_prot(vm_flags); + if (vma_iter_prealloc(&vmi, vma)) { + error = -ENOMEM; + goto free_vma; + } + if (file) { vma->vm_file = get_file(file); error = mmap_file(file, vma); if (error) - goto unmap_and_free_vma; - - if (vma_is_shared_maywrite(vma)) { - error = mapping_map_writable(file->f_mapping); - if (error) - goto close_and_free_vma; - - writable_file_mapping = true; - } + goto unmap_and_free_file_vma; + /* Drivers cannot alter the address of the VMA. */ + WARN_ON_ONCE(addr != vma->vm_start); /* - * Expansion is handled above, merging is handled below. - * Drivers should not alter the address of the VMA. + * Drivers should not permit writability when previously it was + * disallowed. */ - if (WARN_ON((addr != vma->vm_start))) { - error = -EINVAL; - goto close_and_free_vma; - } + VM_WARN_ON_ONCE(vm_flags != vma->vm_flags && + !(vm_flags & VM_MAYWRITE) && + (vma->vm_flags & VM_MAYWRITE)); vma_iter_config(&vmi, addr, end); /* @@ -1474,6 +1470,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * vma again as we may succeed this time. */ if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) { + struct vm_area_struct *merge; + vmg.flags = vma->vm_flags; /* If this fails, state is reset ready for a reattempt. */ merge = vma_merge_new_range(&vmg); @@ -1491,7 +1489,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma = merge; /* Update vm_flags to pick up the change. */ vm_flags = vma->vm_flags; - goto unmap_writable; + goto file_expanded; } vma_iter_config(&vmi, addr, end); } @@ -1500,26 +1498,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr, } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); if (error) - goto free_vma; + goto free_iter_vma; } else { vma_set_anonymous(vma); } - if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) { - error = -EACCES; - goto close_and_free_vma; - } - - /* Allow architectures to sanity-check the vm_flags */ - if (!arch_validate_flags(vma->vm_flags)) { - error = -EINVAL; - goto close_and_free_vma; - } - - if (vma_iter_prealloc(&vmi, vma)) { - error = -ENOMEM; - goto close_and_free_vma; - } +#ifdef CONFIG_SPARC64 + /* TODO: Fix SPARC ADI! */ + WARN_ON_ONCE(!arch_validate_flags(vm_flags)); +#endif /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); @@ -1533,10 +1520,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, */ khugepaged_enter_vma(vma, vma->vm_flags); - /* Once vma denies write, undo our temporary denial count */ -unmap_writable: - if (writable_file_mapping) - mapping_unmap_writable(file->f_mapping); +file_expanded: file = vma->vm_file; ksm_add_vma(vma); expanded: @@ -1569,23 +1553,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma_set_page_prot(vma); - validate_mm(mm); return addr; -close_and_free_vma: - vma_close(vma); +unmap_and_free_file_vma: + fput(vma->vm_file); + vma->vm_file = NULL; - if (file || vma->vm_file) { -unmap_and_free_vma: - fput(vma->vm_file); - vma->vm_file = NULL; - - vma_iter_set(&vmi, vma->vm_end); - /* Undo any partial mapping done by a device driver. */ - unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); - } - if (writable_file_mapping) - mapping_unmap_writable(file->f_mapping); + vma_iter_set(&vmi, vma->vm_end); + /* Undo any partial mapping done by a device driver. */ + unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); +free_iter_vma: + vma_iter_free(&vmi); free_vma: vm_area_free(vma); unacct_error: @@ -1595,10 +1573,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr, abort_munmap: vms_abort_munmap_vmas(&vms, &mas_detach); gather_failed: - validate_mm(mm); return error; } +unsigned long mmap_region(struct file *file, unsigned long addr, + unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, + struct list_head *uf) +{ + unsigned long ret; + bool writable_file_mapping = false; + + /* Check to see if MDWE is applicable. */ + if (map_deny_write_exec(vm_flags, vm_flags)) + return -EACCES; + + /* Allow architectures to sanity-check the vm_flags. */ + if (!arch_validate_flags(vm_flags)) + return -EINVAL; + + /* Map writable and ensure this isn't a sealed memfd. */ + if (file && is_shared_maywrite(vm_flags)) { + int error = mapping_map_writable(file->f_mapping); + + if (error) + return error; + writable_file_mapping = true; + } + + ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf); + + /* Clear our write mapping regardless of error. */ + if (writable_file_mapping) + mapping_unmap_writable(file->f_mapping); + + validate_mm(current->mm); + return ret; +} + static int __vm_munmap(unsigned long start, size_t len, bool unlock) { int ret; From a373830f96db288a3eb43a8692b6bcd0bd88dfe1 Mon Sep 17 00:00:00 2001 From: Gautam Menghani Date: Mon, 28 Oct 2024 14:34:09 +0530 Subject: [PATCH 101/235] KVM: PPC: Book3S HV: Mask off LPCR_MER for a vCPU before running it to avoid spurious interrupts Running a L2 vCPU (see [1] for terminology) with LPCR_MER bit set and no pending interrupts results in that L2 vCPU getting an infinite flood of spurious interrupts. The 'if check' in kvmhv_run_single_vcpu() sets the LPCR_MER bit if there are pending interrupts. The spurious flood problem can be observed in 2 cases: 1. Crashing the guest while interrupt heavy workload is running a. Start a L2 guest and run an interrupt heavy workload (eg: ipistorm) b. While the workload is running, crash the guest (make sure kdump is configured) c. Any one of the vCPUs of the guest will start getting an infinite flood of spurious interrupts. 2. Running LTP stress tests in multiple guests at the same time a. Start 4 L2 guests. b. Start running LTP stress tests on all 4 guests at same time. c. In some time, any one/more of the vCPUs of any of the guests will start getting an infinite flood of spurious interrupts. The root cause of both the above issues is the same: 1. A NMI is sent to a running vCPU that has LPCR_MER bit set. 2. In the NMI path, all registers are refreshed, i.e, H_GUEST_GET_STATE is called for all the registers. 3. When H_GUEST_GET_STATE is called for LPCR, the vcpu->arch.vcore->lpcr of that vCPU at L1 level gets updated with LPCR_MER set to 1, and this new value is always used whenever that vCPU runs, regardless of whether there was a pending interrupt. 4. Since LPCR_MER is set, the vCPU in L2 always jumps to the external interrupt handler, and this cycle never ends. Fix the spurious flood by masking off the LPCR_MER bit before running a L2 vCPU to ensure that it is not set if there are no pending interrupts. [1] Terminology: 1. L0 : PAPR hypervisor running in HV mode 2. L1 : Linux guest (logical partition) running on top of L0 3. L2 : KVM guest running on top of L1 Fixes: ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Gautam Menghani Signed-off-by: Madhavan Srinivasan --- arch/powerpc/kvm/book3s_hv.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index ba0492f9de65..ad8dc4ccdaab 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -4898,6 +4898,18 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit, BOOK3S_INTERRUPT_EXTERNAL, 0); else lpcr |= LPCR_MER; + } else { + /* + * L1's copy of L2's LPCR (vcpu->arch.vcore->lpcr) can get its MER bit + * unexpectedly set - for e.g. during NMI handling when all register + * states are synchronized from L0 to L1. L1 needs to inform L0 about + * MER=1 only when there are pending external interrupts. + * In the above if check, MER bit is set if there are pending + * external interrupts. Hence, explicity mask off MER bit + * here as otherwise it may generate spurious interrupts in L2 KVM + * causing an endless loop, which results in L2 guest getting hung. + */ + lpcr &= ~LPCR_MER; } } else if (vcpu->arch.pending_exceptions || vcpu->arch.doorbell_request || From 6ca575374dd9a507cdd16dfa0e78c2e9e20bd05f Mon Sep 17 00:00:00 2001 From: Hyunwoo Kim Date: Tue, 22 Oct 2024 09:32:56 +0200 Subject: [PATCH 102/235] vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans During loopback communication, a dangling pointer can be created in vsk->trans, potentially leading to a Use-After-Free condition. This issue is resolved by initializing vsk->trans to NULL. Cc: stable Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Signed-off-by: Hyunwoo Kim Signed-off-by: Wongi Lee Signed-off-by: Greg Kroah-Hartman Message-Id: <2024102245-strive-crib-c8d3@gregkh> Signed-off-by: Michael S. Tsirkin --- net/vmw_vsock/virtio_transport_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index ccbd2bc0d210..fc5666c8298f 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1109,6 +1109,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk) struct virtio_vsock_sock *vvs = vsk->trans; kfree(vvs); + vsk->trans = NULL; } EXPORT_SYMBOL_GPL(virtio_transport_destruct); From 0b364cf53b20204e92bac7c6ebd1ee7d3ec62931 Mon Sep 17 00:00:00 2001 From: Philipp Stanner Date: Mon, 28 Oct 2024 08:43:59 +0100 Subject: [PATCH 103/235] vdpa: solidrun: Fix UB bug with devres In psnet_open_pf_bar() and snet_open_vf_bar() a string later passed to pcim_iomap_regions() is placed on the stack. Neither pcim_iomap_regions() nor the functions it calls copy that string. Should the string later ever be used, this, consequently, causes undefined behavior since the stack frame will by then have disappeared. Fix the bug by allocating the strings on the heap through devm_kasprintf(). Cc: stable@vger.kernel.org # v6.3 Fixes: 51a8f9d7f587 ("virtio: vdpa: new SolidNET DPU driver.") Reported-by: Christophe JAILLET Closes: https://lore.kernel.org/all/74e9109a-ac59-49e2-9b1d-d825c9c9f891@wanadoo.fr/ Suggested-by: Andy Shevchenko Signed-off-by: Philipp Stanner Reviewed-by: Stefano Garzarella Message-Id: <20241028074357.9104-3-pstanner@redhat.com> Signed-off-by: Michael S. Tsirkin --- drivers/vdpa/solidrun/snet_main.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/vdpa/solidrun/snet_main.c b/drivers/vdpa/solidrun/snet_main.c index 99428a04068d..c8b74980dbd1 100644 --- a/drivers/vdpa/solidrun/snet_main.c +++ b/drivers/vdpa/solidrun/snet_main.c @@ -555,7 +555,7 @@ static const struct vdpa_config_ops snet_config_ops = { static int psnet_open_pf_bar(struct pci_dev *pdev, struct psnet *psnet) { - char name[50]; + char *name; int ret, i, mask = 0; /* We don't know which BAR will be used to communicate.. * We will map every bar with len > 0. @@ -573,7 +573,10 @@ static int psnet_open_pf_bar(struct pci_dev *pdev, struct psnet *psnet) return -ENODEV; } - snprintf(name, sizeof(name), "psnet[%s]-bars", pci_name(pdev)); + name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "psnet[%s]-bars", pci_name(pdev)); + if (!name) + return -ENOMEM; + ret = pcim_iomap_regions(pdev, mask, name); if (ret) { SNET_ERR(pdev, "Failed to request and map PCI BARs\n"); @@ -590,10 +593,13 @@ static int psnet_open_pf_bar(struct pci_dev *pdev, struct psnet *psnet) static int snet_open_vf_bar(struct pci_dev *pdev, struct snet *snet) { - char name[50]; + char *name; int ret; - snprintf(name, sizeof(name), "snet[%s]-bar", pci_name(pdev)); + name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "snet[%s]-bars", pci_name(pdev)); + if (!name) + return -ENOMEM; + /* Request and map BAR */ ret = pcim_iomap_regions(pdev, BIT(snet->psnet->cfg.vf_bar), name); if (ret) { From 03a942f793ca33653f3fa4bdb377f5d2376e74f6 Mon Sep 17 00:00:00 2001 From: Shivam Chaudhary Date: Tue, 8 Oct 2024 20:22:04 +0530 Subject: [PATCH 104/235] Fix typo in vringh_test.c Corrected minor typo in tools/virtio/vringh_test.c: - Fixed "retreives" to "retrieves" Signed-off-by: Shivam Chaudhary Message-Id: <20241008145204.478749-1-cvam0000@gmail.com> Signed-off-by: Michael S. Tsirkin --- tools/virtio/vringh_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/virtio/vringh_test.c b/tools/virtio/vringh_test.c index 43d3a6aa1dcf..b9591223437a 100644 --- a/tools/virtio/vringh_test.c +++ b/tools/virtio/vringh_test.c @@ -519,7 +519,7 @@ int main(int argc, char *argv[]) errx(1, "virtqueue_add_sgs: %i", err); __kmalloc_fake = NULL; - /* Host retreives it. */ + /* Host retrieves it. */ vringh_iov_init(&riov, host_riov, ARRAY_SIZE(host_riov)); vringh_iov_init(&wiov, host_wiov, ARRAY_SIZE(host_wiov)); From 7f8825b2a78ac392d3fbb3a2e65e56d9e39d75e9 Mon Sep 17 00:00:00 2001 From: Yuan Can Date: Thu, 17 Oct 2024 09:38:12 +0800 Subject: [PATCH 105/235] vDPA/ifcvf: Fix pci_read_config_byte() return code handling ifcvf_init_hw() uses pci_read_config_byte() that returns PCIBIOS_* codes. The error handling, however, assumes the codes are normal errnos because it checks for < 0. Convert the error check to plain non-zero check. Fixes: 5a2414bc454e ("virtio: Intel IFC VF driver for VDPA") Signed-off-by: Yuan Can Message-Id: <20241017013812.129952-1-yuancan@huawei.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang Acked-by: Zhu Lingshan --- drivers/vdpa/ifcvf/ifcvf_base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_base.c b/drivers/vdpa/ifcvf/ifcvf_base.c index 472daa588a9d..d5507b63b6cd 100644 --- a/drivers/vdpa/ifcvf/ifcvf_base.c +++ b/drivers/vdpa/ifcvf/ifcvf_base.c @@ -108,7 +108,7 @@ int ifcvf_init_hw(struct ifcvf_hw *hw, struct pci_dev *pdev) u32 i; ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos); - if (ret < 0) { + if (ret) { IFCVF_ERR(pdev, "Failed to read PCI capability list\n"); return -EIO; } From 97ee04feb682c906a1fa973ebe586fe91567d165 Mon Sep 17 00:00:00 2001 From: Feng Liu Date: Thu, 24 Oct 2024 09:54:06 -0400 Subject: [PATCH 106/235] virtio_pci: Fix admin vq cleanup by using correct info pointer vp_modern_avq_cleanup() and vp_del_vqs() clean up admin vq resources by virtio_pci_vq_info pointer. The info pointer of admin vq is stored in vp_dev->admin_vq.info instead of vp_dev->vqs[]. Using the info pointer from vp_dev->vqs[] for admin vq causes a kernel NULL pointer dereference bug. In vp_modern_avq_cleanup() and vp_del_vqs(), get the info pointer from vp_dev->admin_vq.info for admin vq to clean up the resources. Also make info ptr as argument of vp_del_vq() to be symmetric with vp_setup_vq(). vp_reset calls vp_modern_avq_cleanup, and causes the Call Trace: ================================================================== BUG: kernel NULL pointer dereference, address:0000000000000000 ... CPU: 49 UID: 0 PID: 4439 Comm: modprobe Not tainted 6.11.0-rc5 #1 RIP: 0010:vp_reset+0x57/0x90 [virtio_pci] Call Trace: ... ? vp_reset+0x57/0x90 [virtio_pci] ? vp_reset+0x38/0x90 [virtio_pci] virtio_reset_device+0x1d/0x30 remove_vq_common+0x1c/0x1a0 [virtio_net] virtnet_remove+0xa1/0xc0 [virtio_net] virtio_dev_remove+0x46/0xa0 ... virtio_pci_driver_exit+0x14/0x810 [virtio_pci] ================================================================== Fixes: 4c3b54af907e ("virtio_pci_modern: use completion instead of busy loop to wait on admin cmd result") Signed-off-by: Feng Liu Signed-off-by: Jiri Pirko Reviewed-by: Parav Pandit Message-Id: <20241024135406.81388-1-feliu@nvidia.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_pci_common.c | 24 ++++++++++++++++++------ drivers/virtio/virtio_pci_common.h | 1 + drivers/virtio/virtio_pci_modern.c | 12 +----------- 3 files changed, 20 insertions(+), 17 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c index c44d8ba00c02..88074451dd61 100644 --- a/drivers/virtio/virtio_pci_common.c +++ b/drivers/virtio/virtio_pci_common.c @@ -24,6 +24,16 @@ MODULE_PARM_DESC(force_legacy, "Force legacy mode for transitional virtio 1 devices"); #endif +bool vp_is_avq(struct virtio_device *vdev, unsigned int index) +{ + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + + if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ)) + return false; + + return index == vp_dev->admin_vq.vq_index; +} + /* wait for pending irq handlers */ void vp_synchronize_vectors(struct virtio_device *vdev) { @@ -234,10 +244,9 @@ static struct virtqueue *vp_setup_vq(struct virtio_device *vdev, unsigned int in return vq; } -static void vp_del_vq(struct virtqueue *vq) +static void vp_del_vq(struct virtqueue *vq, struct virtio_pci_vq_info *info) { struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev); - struct virtio_pci_vq_info *info = vp_dev->vqs[vq->index]; unsigned long flags; /* @@ -258,13 +267,16 @@ static void vp_del_vq(struct virtqueue *vq) void vp_del_vqs(struct virtio_device *vdev) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); + struct virtio_pci_vq_info *info; struct virtqueue *vq, *n; int i; list_for_each_entry_safe(vq, n, &vdev->vqs, list) { - if (vp_dev->per_vq_vectors) { - int v = vp_dev->vqs[vq->index]->msix_vector; + info = vp_is_avq(vdev, vq->index) ? vp_dev->admin_vq.info : + vp_dev->vqs[vq->index]; + if (vp_dev->per_vq_vectors) { + int v = info->msix_vector; if (v != VIRTIO_MSI_NO_VECTOR && !vp_is_slow_path_vector(v)) { int irq = pci_irq_vector(vp_dev->pci_dev, v); @@ -273,7 +285,7 @@ void vp_del_vqs(struct virtio_device *vdev) free_irq(irq, vq); } } - vp_del_vq(vq); + vp_del_vq(vq, info); } vp_dev->per_vq_vectors = false; @@ -354,7 +366,7 @@ vp_find_one_vq_msix(struct virtio_device *vdev, int queue_idx, vring_interrupt, 0, vp_dev->msix_names[msix_vec], vq); if (err) { - vp_del_vq(vq); + vp_del_vq(vq, *p_info); return ERR_PTR(err); } diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index 1d9c49947f52..8beecf23ec85 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -178,6 +178,7 @@ struct virtio_device *virtio_pci_vf_get_pf_dev(struct pci_dev *pdev); #define VIRTIO_ADMIN_CMD_BITMAP 0 #endif +bool vp_is_avq(struct virtio_device *vdev, unsigned int index); void vp_modern_avq_done(struct virtqueue *vq); int vp_modern_admin_cmd_exec(struct virtio_device *vdev, struct virtio_admin_cmd *cmd); diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index 9193c30d640a..4fbcbc7a9ae1 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -43,16 +43,6 @@ static int vp_avq_index(struct virtio_device *vdev, u16 *index, u16 *num) return 0; } -static bool vp_is_avq(struct virtio_device *vdev, unsigned int index) -{ - struct virtio_pci_device *vp_dev = to_vp_device(vdev); - - if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ)) - return false; - - return index == vp_dev->admin_vq.vq_index; -} - void vp_modern_avq_done(struct virtqueue *vq) { struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev); @@ -245,7 +235,7 @@ static void vp_modern_avq_cleanup(struct virtio_device *vdev) if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ)) return; - vq = vp_dev->vqs[vp_dev->admin_vq.vq_index]->vq; + vq = vp_dev->admin_vq.info->vq; if (!vq) return; From 4e39ecadf1d2a08187139619f1f314b64ba7d947 Mon Sep 17 00:00:00 2001 From: Xiaoguang Wang Date: Tue, 5 Nov 2024 21:35:18 +0800 Subject: [PATCH 107/235] vp_vdpa: fix id_table array not null terminated error Allocate one extra virtio_device_id as null terminator, otherwise vdpa_mgmtdev_get_classes() may iterate multiple times and visit undefined memory. Fixes: ffbda8e9df10 ("vdpa/vp_vdpa : add vdpa tool support in vp_vdpa") Cc: stable@vger.kernel.org Suggested-by: Parav Pandit Signed-off-by: Angus Chen Signed-off-by: Xiaoguang Wang Message-Id: <20241105133518.1494-1-lege.wang@jaguarmicro.com> Signed-off-by: Michael S. Tsirkin Reviewed-by: Parav Pandit Acked-by: Jason Wang --- drivers/vdpa/virtio_pci/vp_vdpa.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c index ac4ab22f7d8b..16380764275e 100644 --- a/drivers/vdpa/virtio_pci/vp_vdpa.c +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c @@ -612,7 +612,11 @@ static int vp_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id) goto mdev_err; } - mdev_id = kzalloc(sizeof(struct virtio_device_id), GFP_KERNEL); + /* + * id_table should be a null terminated array, so allocate one additional + * entry here, see vdpa_mgmtdev_get_classes(). + */ + mdev_id = kcalloc(2, sizeof(struct virtio_device_id), GFP_KERNEL); if (!mdev_id) { err = -ENOMEM; goto mdev_id_err; @@ -632,8 +636,8 @@ static int vp_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id) goto probe_err; } - mdev_id->device = mdev->id.device; - mdev_id->vendor = mdev->id.vendor; + mdev_id[0].device = mdev->id.device; + mdev_id[0].vendor = mdev->id.vendor; mgtdev->id_table = mdev_id; mgtdev->max_supported_vqs = vp_modern_get_num_queues(mdev); mgtdev->supported_features = vp_modern_get_features(mdev); From 751ecf6afd6568adc98f2a6052315552c0483d18 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Wed, 30 Oct 2024 20:23:50 +0000 Subject: [PATCH 108/235] arm64/sve: Discard stale CPU state when handling SVE traps The logic for handling SVE traps manipulates saved FPSIMD/SVE state incorrectly, and a race with preemption can result in a task having TIF_SVE set and TIF_FOREIGN_FPSTATE clear even though the live CPU state is stale (e.g. with SVE traps enabled). This has been observed to result in warnings from do_sve_acc() where SVE traps are not expected while TIF_SVE is set: | if (test_and_set_thread_flag(TIF_SVE)) | WARN_ON(1); /* SVE access shouldn't have trapped */ Warnings of this form have been reported intermittently, e.g. https://lore.kernel.org/linux-arm-kernel/CA+G9fYtEGe_DhY2Ms7+L7NKsLYUomGsgqpdBj+QwDLeSg=JhGg@mail.gmail.com/ https://lore.kernel.org/linux-arm-kernel/000000000000511e9a060ce5a45c@google.com/ The race can occur when the SVE trap handler is preempted before and after manipulating the saved FPSIMD/SVE state, starting and ending on the same CPU, e.g. | void do_sve_acc(unsigned long esr, struct pt_regs *regs) | { | // Trap on CPU 0 with TIF_SVE clear, SVE traps enabled | // task->fpsimd_cpu is 0. | // per_cpu_ptr(&fpsimd_last_state, 0) is task. | | ... | | // Preempted; migrated from CPU 0 to CPU 1. | // TIF_FOREIGN_FPSTATE is set. | | get_cpu_fpsimd_context(); | | if (test_and_set_thread_flag(TIF_SVE)) | WARN_ON(1); /* SVE access shouldn't have trapped */ | | sve_init_regs() { | if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) { | ... | } else { | fpsimd_to_sve(current); | current->thread.fp_type = FP_STATE_SVE; | } | } | | put_cpu_fpsimd_context(); | | // Preempted; migrated from CPU 1 to CPU 0. | // task->fpsimd_cpu is still 0 | // If per_cpu_ptr(&fpsimd_last_state, 0) is still task then: | // - Stale HW state is reused (with SVE traps enabled) | // - TIF_FOREIGN_FPSTATE is cleared | // - A return to userspace skips HW state restore | } Fix the case where the state is not live and TIF_FOREIGN_FPSTATE is set by calling fpsimd_flush_task_state() to detach from the saved CPU state. This ensures that a subsequent context switch will not reuse the stale CPU state, and will instead set TIF_FOREIGN_FPSTATE, forcing the new state to be reloaded from memory prior to a return to userspace. Fixes: cccb78ce89c4 ("arm64/sve: Rework SVE access trap to convert state in registers") Reported-by: Mark Rutland Signed-off-by: Mark Brown Cc: stable@vger.kernel.org Reviewed-by: Mark Rutland Link: https://lore.kernel.org/r/20241030-arm64-fpsimd-foreign-flush-v1-1-bd7bd66905a2@kernel.org Signed-off-by: Will Deacon --- arch/arm64/kernel/fpsimd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index 77006df20a75..6d21971ae559 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1367,6 +1367,7 @@ static void sve_init_regs(void) } else { fpsimd_to_sve(current); current->thread.fp_type = FP_STATE_SVE; + fpsimd_flush_task_state(current); } } From 25eb47eed52979c2f5eee3f37e6c67714e02c49c Mon Sep 17 00:00:00 2001 From: Jack Wu Date: Wed, 6 Nov 2024 18:50:29 +0800 Subject: [PATCH 109/235] USB: serial: qcserial: add support for Sierra Wireless EM86xx Add support for Sierra Wireless EM86xx with USB-id 0x1199:0x90e5 and 0x1199:0x90e4. 0x1199:0x90e5 T: Bus=03 Lev=01 Prnt=01 Port=05 Cnt=01 Dev#= 14 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=1199 ProdID=90e5 Rev= 5.15 S: Manufacturer=Sierra Wireless, Incorporated S: Product=Semtech EM8695 Mobile Broadband Adapter S: SerialNumber=004403161882339 C:* #Ifs= 6 Cfg#= 1 Atr=a0 MxPwr=500mA A: FirstIf#=12 IfCount= 2 Cls=02(comm.) Sub=0e Prot=00 I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=qcserial E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=usbfs E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=qcserial E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms I:* If#= 4 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none) E: Ad=85(I) Atr=03(Int.) MxPS= 64 Ivl=32ms I:* If#=12 Alt= 0 #EPs= 1 Cls=02(comm.) Sub=0e Prot=00 Driver=cdc_mbim E: Ad=87(I) Atr=03(Int.) MxPS= 64 Ivl=32ms I: If#=13 Alt= 0 #EPs= 0 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim I:* If#=13 Alt= 1 #EPs= 2 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim E: Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms 0x1199:0x90e4 T: Bus=03 Lev=01 Prnt=01 Port=05 Cnt=01 Dev#= 16 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=1199 ProdID=90e4 Rev= 0.00 S: Manufacturer=Sierra Wireless, Incorporated S: SerialNumber=004403161882339 C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr= 2mA I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=10 Driver=qcserial E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms Signed-off-by: Jack Wu Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold --- drivers/usb/serial/qcserial.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/usb/serial/qcserial.c b/drivers/usb/serial/qcserial.c index c7de9585feb2..13c664317a05 100644 --- a/drivers/usb/serial/qcserial.c +++ b/drivers/usb/serial/qcserial.c @@ -166,6 +166,8 @@ static const struct usb_device_id id_table[] = { {DEVICE_SWI(0x1199, 0x9090)}, /* Sierra Wireless EM7565 QDL */ {DEVICE_SWI(0x1199, 0x9091)}, /* Sierra Wireless EM7565 */ {DEVICE_SWI(0x1199, 0x90d2)}, /* Sierra Wireless EM9191 QDL */ + {DEVICE_SWI(0x1199, 0x90e4)}, /* Sierra Wireless EM86xx QDL*/ + {DEVICE_SWI(0x1199, 0x90e5)}, /* Sierra Wireless EM86xx */ {DEVICE_SWI(0x1199, 0xc080)}, /* Sierra Wireless EM7590 QDL */ {DEVICE_SWI(0x1199, 0xc081)}, /* Sierra Wireless EM7590 */ {DEVICE_SWI(0x413c, 0x81a2)}, /* Dell Wireless 5806 Gobi(TM) 4G LTE Mobile Broadband Card */ From de156f3cf70e17dc6ff4c3c364bb97a6db961ffd Mon Sep 17 00:00:00 2001 From: Mingcong Bai Date: Wed, 6 Nov 2024 10:40:50 +0800 Subject: [PATCH 110/235] ASoC: amd: yc: fix internal mic on Xiaomi Book Pro 14 2022 Xiaomi Book Pro 14 2022 (MIA2210-AD) requires a quirk entry for its internal microphone to be enabled. This is likely due to similar reasons as seen previously on Redmi Book 14/15 Pro 2022 models (since they likely came with similar firmware): - commit dcff8b7ca92d ("ASoC: amd: yc: Add Xiaomi Redmi Book Pro 15 2022 into DMI table") - commit c1dd6bf61997 ("ASoC: amd: yc: Add Xiaomi Redmi Book Pro 14 2022 into DMI table") A quirk would likely be needed for Xiaomi Book Pro 15 2022 models, too. However, I do not have such device on hand so I will leave it for now. Signed-off-by: Mingcong Bai Link: https://patch.msgid.link/20241106024052.15748-1-jeffbai@aosc.io Signed-off-by: Mark Brown --- sound/soc/amd/yc/acp6x-mach.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sound/soc/amd/yc/acp6x-mach.c b/sound/soc/amd/yc/acp6x-mach.c index 438865d5e376..dc476bfb6da4 100644 --- a/sound/soc/amd/yc/acp6x-mach.c +++ b/sound/soc/amd/yc/acp6x-mach.c @@ -395,6 +395,13 @@ static const struct dmi_system_id yc_acp_quirk_table[] = { DMI_MATCH(DMI_PRODUCT_NAME, "Redmi Book Pro 15 2022"), } }, + { + .driver_data = &acp6x_card, + .matches = { + DMI_MATCH(DMI_BOARD_VENDOR, "TIMI"), + DMI_MATCH(DMI_PRODUCT_NAME, "Xiaomi Book Pro 14 2022"), + } + }, { .driver_data = &acp6x_card, .matches = { From 44d0469f79bd3d0b3433732877358df7dc6b17b1 Mon Sep 17 00:00:00 2001 From: Zijian Zhang Date: Wed, 6 Nov 2024 00:37:42 +0000 Subject: [PATCH 111/235] bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rx As the introduction of the support for vsock and unix sockets in sockmap, tls_sw_has_ctx_tx/rx cannot presume the socket passed in must be IS_ICSK. vsock and af_unix sockets have vsock_sock and unix_sock instead of inet_connection_sock. For these sockets, tls_get_ctx may return an invalid pointer and cause page fault in function tls_sw_ctx_rx. BUG: unable to handle page fault for address: 0000000000040030 Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:sk_psock_strp_data_ready+0x23/0x60 Call Trace: ? __die+0x81/0xc3 ? no_context+0x194/0x350 ? do_page_fault+0x30/0x110 ? async_page_fault+0x3e/0x50 ? sk_psock_strp_data_ready+0x23/0x60 virtio_transport_recv_pkt+0x750/0x800 ? update_load_avg+0x7e/0x620 vsock_loopback_work+0xd0/0x100 process_one_work+0x1a7/0x360 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x112/0x130 ? __kthread_cancel_work+0x40/0x40 ret_from_fork+0x1f/0x40 v2: - Add IS_ICSK check v3: - Update the commits in Fixes Fixes: 634f1a7110b4 ("vsock: support sockmap") Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") Signed-off-by: Zijian Zhang Acked-by: Stanislav Fomichev Acked-by: Jakub Kicinski Reviewed-by: Cong Wang Acked-by: Stefano Garzarella Link: https://lore.kernel.org/r/20241106003742.399240-1-zijianzhang@bytedance.com Signed-off-by: Martin KaFai Lau --- include/net/tls.h | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index 3a33924db2bc..61fef2880114 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -390,8 +390,12 @@ tls_offload_ctx_tx(const struct tls_context *tls_ctx) static inline bool tls_sw_has_ctx_tx(const struct sock *sk) { - struct tls_context *ctx = tls_get_ctx(sk); + struct tls_context *ctx; + if (!sk_is_inet(sk) || !inet_test_bit(IS_ICSK, sk)) + return false; + + ctx = tls_get_ctx(sk); if (!ctx) return false; return !!tls_sw_ctx_tx(ctx); @@ -399,8 +403,12 @@ static inline bool tls_sw_has_ctx_tx(const struct sock *sk) static inline bool tls_sw_has_ctx_rx(const struct sock *sk) { - struct tls_context *ctx = tls_get_ctx(sk); + struct tls_context *ctx; + if (!sk_is_inet(sk) || !inet_test_bit(IS_ICSK, sk)) + return false; + + ctx = tls_get_ctx(sk); if (!ctx) return false; return !!tls_sw_ctx_rx(ctx); From b79276dcac9124a79c8cf7cc8fbdd3d4c3c9a7c7 Mon Sep 17 00:00:00 2001 From: Mario Limonciello Date: Mon, 4 Nov 2024 16:28:55 -0600 Subject: [PATCH 112/235] ACPI: processor: Move arch_init_invariance_cppc() call later arch_init_invariance_cppc() is called at the end of acpi_cppc_processor_probe() in order to configure frequency invariance based upon the values from _CPC. This however doesn't work on AMD CPPC shared memory designs that have AMD preferred cores enabled because _CPC needs to be analyzed from all cores to judge if preferred cores are enabled. This issue manifests to users as a warning since commit 21fb59ab4b97 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"): ``` Could not retrieve highest performance (-19) ``` However the warning isn't the cause of this, it was actually commit 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") which exposed the issue. To fix this problem, change arch_init_invariance_cppc() into a new weak symbol that is called at the end of acpi_processor_driver_init(). Each architecture that supports it can declare the symbol to override the weak one. Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the architectures using the generic arch_topology.c code. Fixes: 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") Reported-by: Ivan Shapovalov Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431 Tested-by: Oleksandr Natalenko Signed-off-by: Mario Limonciello Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org [ rjw: Changelog edit ] Signed-off-by: Rafael J. Wysocki --- arch/arm64/include/asm/topology.h | 4 ---- arch/x86/include/asm/topology.h | 5 ----- arch/x86/kernel/acpi/cppc.c | 7 ++++++- drivers/acpi/cppc_acpi.c | 6 ------ drivers/acpi/processor_driver.c | 9 +++++++++ drivers/base/arch_topology.c | 6 +++++- include/acpi/processor.h | 2 ++ include/linux/arch_topology.h | 4 ---- 8 files changed, 22 insertions(+), 21 deletions(-) diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 5fc3af9f8f29..341174bf9106 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -26,10 +26,6 @@ void update_freq_counters_refs(void); #define arch_scale_freq_invariant topology_scale_freq_invariant #define arch_scale_freq_ref topology_get_freq_ref -#ifdef CONFIG_ACPI_CPPC_LIB -#define arch_init_invariance_cppc topology_init_cpu_capacity_cppc -#endif - /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index aef70336d624..92f3664dd933 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -305,9 +305,4 @@ static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled extern void arch_scale_freq_tick(void); #define arch_scale_freq_tick arch_scale_freq_tick -#ifdef CONFIG_ACPI_CPPC_LIB -void init_freq_invariance_cppc(void); -#define arch_init_invariance_cppc init_freq_invariance_cppc -#endif - #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/acpi/cppc.c b/arch/x86/kernel/acpi/cppc.c index 956984054bf3..aab9d0570841 100644 --- a/arch/x86/kernel/acpi/cppc.c +++ b/arch/x86/kernel/acpi/cppc.c @@ -110,7 +110,7 @@ static void amd_set_max_freq_ratio(void) static DEFINE_MUTEX(freq_invariance_lock); -void init_freq_invariance_cppc(void) +static inline void init_freq_invariance_cppc(void) { static bool init_done; @@ -127,6 +127,11 @@ void init_freq_invariance_cppc(void) mutex_unlock(&freq_invariance_lock); } +void acpi_processor_init_invariance_cppc(void) +{ + init_freq_invariance_cppc(); +} + /* * Get the highest performance register value. * @cpu: CPU from which to get highest performance. diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c index 1a40f0514eaa..5c0cc7aae872 100644 --- a/drivers/acpi/cppc_acpi.c +++ b/drivers/acpi/cppc_acpi.c @@ -671,10 +671,6 @@ static int pcc_data_alloc(int pcc_ss_id) * ) */ -#ifndef arch_init_invariance_cppc -static inline void arch_init_invariance_cppc(void) { } -#endif - /** * acpi_cppc_processor_probe - Search for per CPU _CPC objects. * @pr: Ptr to acpi_processor containing this CPU's logical ID. @@ -905,8 +901,6 @@ int acpi_cppc_processor_probe(struct acpi_processor *pr) goto out_free; } - arch_init_invariance_cppc(); - kfree(output.pointer); return 0; diff --git a/drivers/acpi/processor_driver.c b/drivers/acpi/processor_driver.c index cb52dd000b95..3b281bc1e73c 100644 --- a/drivers/acpi/processor_driver.c +++ b/drivers/acpi/processor_driver.c @@ -237,6 +237,9 @@ static struct notifier_block acpi_processor_notifier_block = { .notifier_call = acpi_processor_notifier, }; +void __weak acpi_processor_init_invariance_cppc(void) +{ } + /* * We keep the driver loaded even when ACPI is not running. * This is needed for the powernow-k8 driver, that works even without @@ -270,6 +273,12 @@ static int __init acpi_processor_driver_init(void) NULL, acpi_soft_cpu_dead); acpi_processor_throttling_init(); + + /* + * Frequency invariance calculations on AMD platforms can't be run until + * after acpi_cppc_processor_probe() has been called for all online CPUs + */ + acpi_processor_init_invariance_cppc(); return 0; err: driver_unregister(&acpi_processor_driver); diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 75fcb75d5515..3ebe77566788 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -366,7 +366,7 @@ void __weak freq_inv_set_max_ratio(int cpu, u64 max_rate) #ifdef CONFIG_ACPI_CPPC_LIB #include -void topology_init_cpu_capacity_cppc(void) +static inline void topology_init_cpu_capacity_cppc(void) { u64 capacity, capacity_scale = 0; struct cppc_perf_caps perf_caps; @@ -417,6 +417,10 @@ void topology_init_cpu_capacity_cppc(void) exit: free_raw_capacity(); } +void acpi_processor_init_invariance_cppc(void) +{ + topology_init_cpu_capacity_cppc(); +} #endif #ifdef CONFIG_CPU_FREQ diff --git a/include/acpi/processor.h b/include/acpi/processor.h index e6f6074eadbf..a17e97e634a6 100644 --- a/include/acpi/processor.h +++ b/include/acpi/processor.h @@ -465,4 +465,6 @@ extern int acpi_processor_ffh_lpi_probe(unsigned int cpu); extern int acpi_processor_ffh_lpi_enter(struct acpi_lpi_state *lpi); #endif +void acpi_processor_init_invariance_cppc(void); + #endif diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index b721f360d759..4a952c4885ed 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -11,10 +11,6 @@ void topology_normalize_cpu_scale(void); int topology_update_cpu_topology(void); -#ifdef CONFIG_ACPI_CPPC_LIB -void topology_init_cpu_capacity_cppc(void); -#endif - struct device_node; bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu); From 94debe5eaa0adaa24a6de4a8e5f138be7381eb9e Mon Sep 17 00:00:00 2001 From: Venkata Prasad Potturu Date: Wed, 6 Nov 2024 19:56:57 +0530 Subject: [PATCH 113/235] ASoC: SOF: amd: Fix for incorrect DMA ch status register offset DMA ch status register offset change in acp7.0 platform Incorrect DMA channel status register offset check lead to firmware boot failure. [ 14.432497] snd_sof_amd_acp70 0000:c4:00.5: ------------[ DSP dump start ]------------ [ 14.432533] snd_sof_amd_acp70 0000:c4:00.5: Firmware boot failure due to timeout [ 14.432549] snd_sof_amd_acp70 0000:c4:00.5: fw_state: SOF_FW_BOOT_IN_PROGRESS (3) [ 14.432610] snd_sof_amd_acp70 0000:c4:00.5: invalid header size 0x71c41000. FW oops is bogus [ 14.432626] snd_sof_amd_acp70 0000:c4:00.5: unexpected fault 0x71c40000 trace 0x71c40000 [ 14.432642] snd_sof_amd_acp70 0000:c4:00.5: ------------[ DSP dump end ]------------ [ 14.432657] snd_sof_amd_acp70 0000:c4:00.5: error: failed to boot DSP firmware -5 [ 14.432672] snd_sof_amd_acp70 0000:c4:00.5: fw_state change: 3 -> 4 [ 14.433260] dmic-codec dmic-codec: ASoC: Unregistered DAI 'dmic-hifi' [ 14.433319] snd_sof_amd_acp70 0000:c4:00.5: fw_state change: 4 -> 0 [ 14.433358] snd_sof_amd_acp70 0000:c4:00.5: error: sof_probe_work failed err: -5 Update correct register offset for DMA ch status register. Fixes: 490be7ba2a01 ("ASoC: SOF: amd: add support for acp7.0 based platform") Signed-off-by: Venkata Prasad Potturu Link: https://patch.msgid.link/20241106142658.1240929-1-venkataprasad.potturu@amd.com Signed-off-by: Mark Brown --- sound/soc/sof/amd/acp.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/sound/soc/sof/amd/acp.c b/sound/soc/sof/amd/acp.c index de3001f5b9bb..95d4762c9d93 100644 --- a/sound/soc/sof/amd/acp.c +++ b/sound/soc/sof/amd/acp.c @@ -342,11 +342,19 @@ int acp_dma_status(struct acp_dev_data *adata, unsigned char ch) { struct snd_sof_dev *sdev = adata->dev; unsigned int val; + unsigned int acp_dma_ch_sts; int ret = 0; + switch (adata->pci_rev) { + case ACP70_PCI_ID: + acp_dma_ch_sts = ACP70_DMA_CH_STS; + break; + default: + acp_dma_ch_sts = ACP_DMA_CH_STS; + } val = snd_sof_dsp_read(sdev, ACP_DSP_BAR, ACP_DMA_CNTL_0 + ch * sizeof(u32)); if (val & ACP_DMA_CH_RUN) { - ret = snd_sof_dsp_read_poll_timeout(sdev, ACP_DSP_BAR, ACP_DMA_CH_STS, val, !val, + ret = snd_sof_dsp_read_poll_timeout(sdev, ACP_DSP_BAR, acp_dma_ch_sts, val, !val, ACP_REG_POLL_INTERVAL, ACP_DMA_COMPLETE_TIMEOUT_US); if (ret < 0) From a4aebaf6e6efff548b01a3dc49b4b9074751c15b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 6 Nov 2024 21:50:55 +0100 Subject: [PATCH 114/235] media: dvbdev: fix the logic when DVB_DYNAMIC_MINORS is not set When CONFIG_DVB_DYNAMIC_MINORS, ret is not initialized, and a semaphore is left at the wrong state, in case of errors. Make the code simpler and avoid mistakes by having just one error check logic used weather DVB_DYNAMIC_MINORS is used or not. Reported-by: kernel test robot Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202410201717.ULWWdJv8-lkp@intel.com/ Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/9e067488d8935b8cf00959764a1fa5de85d65725.1730926254.git.mchehab+huawei@kernel.org --- drivers/media/dvb-core/dvbdev.c | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/drivers/media/dvb-core/dvbdev.c b/drivers/media/dvb-core/dvbdev.c index 14f323fbada7..9df7c213716a 100644 --- a/drivers/media/dvb-core/dvbdev.c +++ b/drivers/media/dvb-core/dvbdev.c @@ -530,6 +530,9 @@ int dvb_register_device(struct dvb_adapter *adap, struct dvb_device **pdvbdev, for (minor = 0; minor < MAX_DVB_MINORS; minor++) if (!dvb_minors[minor]) break; +#else + minor = nums2minor(adap->num, type, id); +#endif if (minor >= MAX_DVB_MINORS) { if (new_node) { list_del(&new_node->list_head); @@ -543,17 +546,7 @@ int dvb_register_device(struct dvb_adapter *adap, struct dvb_device **pdvbdev, mutex_unlock(&dvbdev_register_lock); return -EINVAL; } -#else - minor = nums2minor(adap->num, type, id); - if (minor >= MAX_DVB_MINORS) { - dvb_media_device_free(dvbdev); - list_del(&dvbdev->list_head); - kfree(dvbdev); - *pdvbdev = NULL; - mutex_unlock(&dvbdev_register_lock); - return ret; - } -#endif + dvbdev->minor = minor; dvb_minors[minor] = dvb_device_get(dvbdev); up_write(&minor_rwsem); From 464cb98f1c07298c4c10e714ae0c36338d18d316 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 6 Nov 2024 08:44:18 +0000 Subject: [PATCH 115/235] irqchip/gic-v3: Force propagation of the active state with a read-back Christoffer reports that on some implementations, writing to GICR_ISACTIVER0 (and similar GICD registers) can race badly with a guest issuing a deactivation of that interrupt via the system register interface. There are multiple reasons to this: - this uses an early write-acknoledgement memory type (nGnRE), meaning that the write may only have made it as far as some interconnect by the time the store is considered "done" - the GIC itself is allowed to buffer the write until it decides to take it into account (as long as it is in finite time) The effects are that the activation may not have taken effect by the time the kernel enters the guest, forcing an immediate exit, or that a guest deactivation occurs before the interrupt is active, doing nothing. In order to guarantee that the write to the ISACTIVER register has taken effect, read back from it, forcing the interconnect to propagate the write, and the GIC to process the write before returning the read. Reported-by: Christoffer Dall Signed-off-by: Marc Zyngier Signed-off-by: Thomas Gleixner Acked-by: Christoffer Dall Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241106084418.3794612-1-maz@kernel.org --- drivers/irqchip/irq-gic-v3.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index ce87205e3e82..8b6159f4cdaf 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -524,6 +524,13 @@ static int gic_irq_set_irqchip_state(struct irq_data *d, } gic_poke_irq(d, reg); + + /* + * Force read-back to guarantee that the active state has taken + * effect, and won't race with a guest-driven deactivation. + */ + if (reg == GICD_ISACTIVER) + gic_peek_irq(d, reg); return 0; } From cda7163d4e3d99db93aa38f0e825b8433c7a8452 Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Wed, 30 Oct 2024 11:25:47 +1030 Subject: [PATCH 116/235] btrfs: fix per-subvolume RO/RW flags with new mount API [BUG] With util-linux 2.40.2, the 'mount' utility is already utilizing the new mount API. e.g: # strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/ ... fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 But this leads to a new problem, that per-subvolume RO/RW mount no longer works, if the initial mount is RO: # mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test # mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch # mount | grep mnt /dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1) /dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2) # touch /mnt/scratch/foobar touch: cannot touch '/mnt/scratch/foobar': Read-only file system This is a common use cases on distros. [CAUSE] We have a workaround for remount to handle the RO->RW change, but if the mount is using the new mount API, we do not do that, and rely on the mount tool NOT to set the ro flag. But that's not how the mount tool is doing for the new API: fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 This means we will set the super block RO at the first mount. Later RW mount will not try to reconfigure the fs to RW because the mount tool is already using the new API. This totally breaks the per-subvolume RO/RW mount behavior. [FIX] Do not skip the reconfiguration even if using the new API. The old comments are just expecting any mount tool to properly skip the RO flag set even if we specify "ro", which is not the reality. Update the comments regarding the backward compatibility on the kernel level so it works with old and new mount utilities. CC: stable@vger.kernel.org # 6.8+ Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes") Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/super.c | 25 +++++-------------------- 1 file changed, 5 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 926d7a9ed99d..c64d07134122 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1979,25 +1979,10 @@ static int btrfs_get_tree_super(struct fs_context *fc) * fsconfig(FSCONFIG_SET_FLAG, "ro"). This option is seen by the filesystem * in fc->sb_flags. * - * This disambiguation has rather positive consequences. Mounting a subvolume - * ro will not also turn the superblock ro. Only the mount for the subvolume - * will become ro. - * - * So, if the superblock creation request comes from the new mount API the - * caller must have explicitly done: - * - * fsconfig(FSCONFIG_SET_FLAG, "ro") - * fsmount/mount_setattr(MOUNT_ATTR_RDONLY) - * - * IOW, at some point the caller must have explicitly turned the whole - * superblock ro and we shouldn't just undo it like we did for the old mount - * API. In any case, it lets us avoid the hack in the new mount API. - * - * Consequently, the remounting hack must only be used for requests originating - * from the old mount API and should be marked for full deprecation so it can be - * turned off in a couple of years. - * - * The new mount API has no reason to support this hack. + * But, currently the util-linux mount command already utilizes the new mount + * API and is still setting fsconfig(FSCONFIG_SET_FLAG, "ro") no matter if it's + * btrfs or not, setting the whole super block RO. To make per-subvolume mounting + * work with different options work we need to keep backward compatibility. */ static struct vfsmount *btrfs_reconfigure_for_mount(struct fs_context *fc) { @@ -2019,7 +2004,7 @@ static struct vfsmount *btrfs_reconfigure_for_mount(struct fs_context *fc) if (IS_ERR(mnt)) return mnt; - if (!fc->oldapi || !ro2rw) + if (!ro2rw) return mnt; /* We need to convert to rw, call reconfigure. */ From c9a75ec45f1111ef530ab186c2a7684d0a0c9245 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 4 Nov 2024 12:11:15 +0000 Subject: [PATCH 117/235] btrfs: reinitialize delayed ref list after deleting it from the list At insert_delayed_ref() if we need to update the action of an existing ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's ref_add_list using list_del(), which leaves the ref's add_list member not reinitialized, as list_del() sets the next and prev members of the list to LIST_POISON1 and LIST_POISON2, respectively. If later we end up calling drop_delayed_ref() against the ref, which can happen during merging or when destroying delayed refs due to a transaction abort, we can trigger a crash since at drop_delayed_ref() we call list_empty() against the ref's add_list, which returns false since the list was not reinitialized after the list_del() and as a consequence we call list_del() again at drop_delayed_ref(). This results in an invalid list access since the next and prev members are set to poison pointers, resulting in a splat if CONFIG_LIST_HARDENED and CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences otherwise. So fix this by deleting from the list with list_del_init() instead. Fixes: 1d57ee941692 ("btrfs: improve delayed refs iterations") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/delayed-ref.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 115b90d29b1d..65d841d7142c 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -649,7 +649,7 @@ static bool insert_delayed_ref(struct btrfs_trans_handle *trans, &href->ref_add_list); else if (ref->action == BTRFS_DROP_DELAYED_REF) { ASSERT(!list_empty(&exist->add_list)); - list_del(&exist->add_list); + list_del_init(&exist->add_list); } else { ASSERT(0); } From 2b084d8205949dd804e279df8e68531da78be1e8 Mon Sep 17 00:00:00 2001 From: Haisu Wang Date: Fri, 25 Oct 2024 14:54:40 +0800 Subject: [PATCH 118/235] btrfs: fix the length of reserved qgroup to free The dealloc flag may be cleared and the extent won't reach the disk in cow_file_range when errors path. The reserved qgroup space is freed in commit 30479f31d44d ("btrfs: fix qgroup reserve leaks in cow_file_range"). However, the length of untouched region to free needs to be adjusted with the correct remaining region size. Fixes: 30479f31d44d ("btrfs: fix qgroup reserve leaks in cow_file_range") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo Reviewed-by: Boris Burkov Signed-off-by: Haisu Wang Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c6d257fb40af..a97dfc72a80a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1618,7 +1618,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, clear_bits |= EXTENT_CLEAR_DATA_RESV; extent_clear_unlock_delalloc(inode, start, end, locked_folio, &cached, clear_bits, page_ops); - btrfs_qgroup_free_data(inode, NULL, start, cur_alloc_size, NULL); + btrfs_qgroup_free_data(inode, NULL, start, end - start + 1, NULL); } return ret; } From 8c462d56487e3abdbf8a61cedfe7c795a54f4a78 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Wed, 6 Nov 2024 16:04:48 +0000 Subject: [PATCH 119/235] arm64: smccc: Remove broken support for SMCCCv1.3 SVE discard hint SMCCCv1.3 added a hint bit which callers can set in an SMCCC function ID (AKA "FID") to indicate that it is acceptable for the SMCCC implementation to discard SVE and/or SME state over a specific SMCCC call. The kernel support for using this hint is broken and SMCCC calls may clobber the SVE and/or SME state of arbitrary tasks, though FPSIMD state is unaffected. The kernel support is intended to use the hint when there is no SVE or SME state to save, and to do this it checks whether TIF_FOREIGN_FPSTATE is set or TIF_SVE is clear in assembly code: | ldr , [, #TSK_TI_FLAGS] | tbnz , #TIF_FOREIGN_FPSTATE, 1f // Any live FP state? | tbnz , #TIF_SVE, 2f // Does that state include SVE? | | 1: orr , , ARM_SMCCC_1_3_SVE_HINT | 2: | << SMCCC call using FID >> This is not safe as-is: (1) SMCCC calls can be made in a preemptible context and preemption can result in TIF_FOREIGN_FPSTATE being set or cleared at arbitrary points in time. Thus checking for TIF_FOREIGN_FPSTATE provides no guarantee. (2) TIF_FOREIGN_FPSTATE only indicates that the live FP/SVE/SME state in the CPU does not belong to the current task, and does not indicate that clobbering this state is acceptable. When the live CPU state is clobbered it is necessary to update fpsimd_last_state.st to ensure that a subsequent context switch will reload FP/SVE/SME state from memory rather than consuming the clobbered state. This and the SMCCC call itself must happen in a critical section with preemption disabled to avoid races. (3) Live SVE/SME state can exist with TIF_SVE clear (e.g. with only TIF_SME set), and checking TIF_SVE alone is insufficient. Remove the broken support for the SMCCCv1.3 SVE saving hint. This is effectively a revert of commits: * cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") * a7c3acca5380 ("arm64: smccc: Save lr before calling __arm_smccc_sve_check()") ... leaving behind the ARM_SMCCC_VERSION_1_3 and ARM_SMCCC_1_3_SVE_HINT definitions, since these are simply definitions from the SMCCC specification, and the latter is used in KVM via ARM_SMCCC_CALL_HINTS. If we want to bring this back in future, we'll probably want to handle this logic in C where we can use all the usual FPSIMD/SVE/SME helper functions, and that'll likely require some rework of the SMCCC code and/or its callers. Fixes: cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") Signed-off-by: Mark Rutland Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Marc Zyngier Cc: Mark Brown Cc: Will Deacon Cc: stable@vger.kernel.org Reviewed-by: Mark Brown Link: https://lore.kernel.org/r/20241106160448.2712997-1-mark.rutland@arm.com Signed-off-by: Will Deacon --- arch/arm64/kernel/smccc-call.S | 35 +++------------------------------- drivers/firmware/smccc/smccc.c | 4 ---- include/linux/arm-smccc.h | 32 +++---------------------------- 3 files changed, 6 insertions(+), 65 deletions(-) diff --git a/arch/arm64/kernel/smccc-call.S b/arch/arm64/kernel/smccc-call.S index 487381164ff6..2def9d0dd3dd 100644 --- a/arch/arm64/kernel/smccc-call.S +++ b/arch/arm64/kernel/smccc-call.S @@ -7,48 +7,19 @@ #include #include -#include - -/* - * If we have SMCCC v1.3 and (as is likely) no SVE state in - * the registers then set the SMCCC hint bit to say there's no - * need to preserve it. Do this by directly adjusting the SMCCC - * function value which is already stored in x0 ready to be called. - */ -SYM_FUNC_START(__arm_smccc_sve_check) - - ldr_l x16, smccc_has_sve_hint - cbz x16, 2f - - get_current_task x16 - ldr x16, [x16, #TSK_TI_FLAGS] - tbnz x16, #TIF_FOREIGN_FPSTATE, 1f // Any live FP state? - tbnz x16, #TIF_SVE, 2f // Does that state include SVE? - -1: orr x0, x0, ARM_SMCCC_1_3_SVE_HINT - -2: ret -SYM_FUNC_END(__arm_smccc_sve_check) -EXPORT_SYMBOL(__arm_smccc_sve_check) .macro SMCCC instr - stp x29, x30, [sp, #-16]! - mov x29, sp -alternative_if ARM64_SVE - bl __arm_smccc_sve_check -alternative_else_nop_endif \instr #0 - ldr x4, [sp, #16] + ldr x4, [sp] stp x0, x1, [x4, #ARM_SMCCC_RES_X0_OFFS] stp x2, x3, [x4, #ARM_SMCCC_RES_X2_OFFS] - ldr x4, [sp, #24] + ldr x4, [sp, #8] cbz x4, 1f /* no quirk structure */ ldr x9, [x4, #ARM_SMCCC_QUIRK_ID_OFFS] cmp x9, #ARM_SMCCC_QUIRK_QCOM_A6 b.ne 1f str x6, [x4, ARM_SMCCC_QUIRK_STATE_OFFS] -1: ldp x29, x30, [sp], #16 - ret +1: ret .endm /* diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c index d670635914ec..a74600d9f2d7 100644 --- a/drivers/firmware/smccc/smccc.c +++ b/drivers/firmware/smccc/smccc.c @@ -16,7 +16,6 @@ static u32 smccc_version = ARM_SMCCC_VERSION_1_0; static enum arm_smccc_conduit smccc_conduit = SMCCC_CONDUIT_NONE; bool __ro_after_init smccc_trng_available = false; -u64 __ro_after_init smccc_has_sve_hint = false; s32 __ro_after_init smccc_soc_id_version = SMCCC_RET_NOT_SUPPORTED; s32 __ro_after_init smccc_soc_id_revision = SMCCC_RET_NOT_SUPPORTED; @@ -28,9 +27,6 @@ void __init arm_smccc_version_init(u32 version, enum arm_smccc_conduit conduit) smccc_conduit = conduit; smccc_trng_available = smccc_probe_trng(); - if (IS_ENABLED(CONFIG_ARM64_SVE) && - smccc_version >= ARM_SMCCC_VERSION_1_3) - smccc_has_sve_hint = true; if ((smccc_version >= ARM_SMCCC_VERSION_1_2) && (smccc_conduit != SMCCC_CONDUIT_NONE)) { diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h index f59099a213d0..67f6fdf2e7cd 100644 --- a/include/linux/arm-smccc.h +++ b/include/linux/arm-smccc.h @@ -315,8 +315,6 @@ u32 arm_smccc_get_version(void); void __init arm_smccc_version_init(u32 version, enum arm_smccc_conduit conduit); -extern u64 smccc_has_sve_hint; - /** * arm_smccc_get_soc_id_version() * @@ -414,15 +412,6 @@ struct arm_smccc_quirk { } state; }; -/** - * __arm_smccc_sve_check() - Set the SVE hint bit when doing SMC calls - * - * Sets the SMCCC hint bit to indicate if there is live state in the SVE - * registers, this modifies x0 in place and should never be called from C - * code. - */ -asmlinkage unsigned long __arm_smccc_sve_check(unsigned long x0); - /** * __arm_smccc_smc() - make SMC calls * @a0-a7: arguments passed in registers 0 to 7 @@ -490,20 +479,6 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, unsigned long a1, #endif -/* nVHE hypervisor doesn't have a current thread so needs separate checks */ -#if defined(CONFIG_ARM64_SVE) && !defined(__KVM_NVHE_HYPERVISOR__) - -#define SMCCC_SVE_CHECK ALTERNATIVE("nop \n", "bl __arm_smccc_sve_check \n", \ - ARM64_SVE) -#define smccc_sve_clobbers "x16", "x30", "cc", - -#else - -#define SMCCC_SVE_CHECK -#define smccc_sve_clobbers - -#endif - #define __constraint_read_2 "r" (arg0) #define __constraint_read_3 __constraint_read_2, "r" (arg1) #define __constraint_read_4 __constraint_read_3, "r" (arg2) @@ -574,12 +549,11 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, unsigned long a1, register unsigned long r3 asm("r3"); \ CONCATENATE(__declare_arg_, \ COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__); \ - asm volatile(SMCCC_SVE_CHECK \ - inst "\n" : \ + asm volatile(inst "\n" : \ "=r" (r0), "=r" (r1), "=r" (r2), "=r" (r3) \ : CONCATENATE(__constraint_read_, \ COUNT_ARGS(__VA_ARGS__)) \ - : smccc_sve_clobbers "memory"); \ + : "memory"); \ if (___res) \ *___res = (typeof(*___res)){r0, r1, r2, r3}; \ } while (0) @@ -628,7 +602,7 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, unsigned long a1, asm ("" : \ : CONCATENATE(__constraint_read_, \ COUNT_ARGS(__VA_ARGS__)) \ - : smccc_sve_clobbers "memory"); \ + : "memory"); \ if (___res) \ ___res->a0 = SMCCC_RET_NOT_SUPPORTED; \ } while (0) From 81235ae0c846e1fb46a2c6fe9283fe2b2b24f7dc Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Wed, 6 Nov 2024 16:42:20 +0000 Subject: [PATCH 120/235] arm64: Kconfig: Make SME depend on BROKEN for now Although support for SME was merged in v5.19, we've since uncovered a number of issues with the implementation, including issues which might corrupt the FPSIMD/SVE/SME state of arbitrary tasks. While there are patches to address some of these issues, ongoing review has highlighted additional functional problems, and more time is necessary to analyse and fix these. For now, mark SME as BROKEN in the hope that we can fix things properly in the near future. As SME is an OPTIONAL part of ARMv9.2+, and there is very little extant hardware, this should not adversely affect the vast majority of users. Signed-off-by: Mark Rutland Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Marc Zyngier Cc: Mark Brown Cc: Will Deacon Cc: stable@vger.kernel.org # 5.19 Acked-by: Catalin Marinas Link: https://lore.kernel.org/r/20241106164220.2789279-1-mark.rutland@arm.com Signed-off-by: Will Deacon --- arch/arm64/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fd9df6dcc593..70d7f4f20225 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2214,6 +2214,7 @@ config ARM64_SME bool "ARM Scalable Matrix Extension support" default y depends on ARM64_SVE + depends on BROKEN help The Scalable Matrix Extension (SME) is an extension to the AArch64 execution state which utilises a substantial subset of the SVE From 702a47ce6dde72f6e247b3c3c00a0fc521f9b1c6 Mon Sep 17 00:00:00 2001 From: Tudor Ambarus Date: Wed, 6 Nov 2024 12:18:02 +0000 Subject: [PATCH 121/235] media: videobuf2-core: copy vb planes unconditionally Copy the relevant data from userspace to the vb->planes unconditionally as it's possible some of the fields may have changed after the buffer has been validated. Keep the dma_buf_put(planes[plane].dbuf) calls in the first `if (!reacquired)` case, in order to be close to the plane validation code where the buffers were got in the first place. Cc: stable@vger.kernel.org Fixes: 95af7c00f35b ("media: videobuf2-core: release all planes first in __prepare_dmabuf()") Signed-off-by: Tudor Ambarus Tested-by: Will McVicker Acked-by: Tomasz Figa Signed-off-by: Hans Verkuil --- .../media/common/videobuf2/videobuf2-core.c | 28 ++++++++++--------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/drivers/media/common/videobuf2/videobuf2-core.c b/drivers/media/common/videobuf2/videobuf2-core.c index 29a8d876e6c2..b0523fc23506 100644 --- a/drivers/media/common/videobuf2/videobuf2-core.c +++ b/drivers/media/common/videobuf2/videobuf2-core.c @@ -1482,18 +1482,23 @@ static int __prepare_dmabuf(struct vb2_buffer *vb) } vb->planes[plane].dbuf_mapped = 1; } + } else { + for (plane = 0; plane < vb->num_planes; ++plane) + dma_buf_put(planes[plane].dbuf); + } - /* - * Now that everything is in order, copy relevant information - * provided by userspace. - */ - for (plane = 0; plane < vb->num_planes; ++plane) { - vb->planes[plane].bytesused = planes[plane].bytesused; - vb->planes[plane].length = planes[plane].length; - vb->planes[plane].m.fd = planes[plane].m.fd; - vb->planes[plane].data_offset = planes[plane].data_offset; - } + /* + * Now that everything is in order, copy relevant information + * provided by userspace. + */ + for (plane = 0; plane < vb->num_planes; ++plane) { + vb->planes[plane].bytesused = planes[plane].bytesused; + vb->planes[plane].length = planes[plane].length; + vb->planes[plane].m.fd = planes[plane].m.fd; + vb->planes[plane].data_offset = planes[plane].data_offset; + } + if (reacquired) { /* * Call driver-specific initialization on the newly acquired buffer, * if provided. @@ -1503,9 +1508,6 @@ static int __prepare_dmabuf(struct vb2_buffer *vb) dprintk(q, 1, "buffer initialization failed\n"); goto err_put_vb2_buf; } - } else { - for (plane = 0; plane < vb->num_planes; ++plane) - dma_buf_put(planes[plane].dbuf); } ret = call_vb_qop(vb, buf_prepare, vb); From 8c21e40e1e481f7fef6e570089e317068b972c45 Mon Sep 17 00:00:00 2001 From: Markus Petri Date: Thu, 7 Nov 2024 10:40:20 +0100 Subject: [PATCH 122/235] ASoC: amd: yc: Support dmic on another model of Lenovo Thinkpad E14 Gen 6 Another model of Thinkpad E14 Gen 6 (21M4) needs a quirk entry for the dmic to be detected. Signed-off-by: Markus Petri Link: https://patch.msgid.link/20241107094020.1050935-1-mp@localhost Signed-off-by: Mark Brown --- sound/soc/amd/yc/acp6x-mach.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sound/soc/amd/yc/acp6x-mach.c b/sound/soc/amd/yc/acp6x-mach.c index dc476bfb6da4..2436e8deb2be 100644 --- a/sound/soc/amd/yc/acp6x-mach.c +++ b/sound/soc/amd/yc/acp6x-mach.c @@ -227,6 +227,13 @@ static const struct dmi_system_id yc_acp_quirk_table[] = { DMI_MATCH(DMI_PRODUCT_NAME, "21M3"), } }, + { + .driver_data = &acp6x_card, + .matches = { + DMI_MATCH(DMI_BOARD_VENDOR, "LENOVO"), + DMI_MATCH(DMI_PRODUCT_NAME, "21M4"), + } + }, { .driver_data = &acp6x_card, .matches = { From 63c1c87993e0e5bb11bced3d8224446a2bc62338 Mon Sep 17 00:00:00 2001 From: Luo Yifan Date: Wed, 6 Nov 2024 09:46:54 +0800 Subject: [PATCH 123/235] ASoC: stm: Prevent potential division by zero in stm32_sai_mclk_round_rate() This patch checks if div is less than or equal to zero (div <= 0). If div is zero or negative, the function returns -EINVAL, ensuring the division operation (*prate / div) is safe to perform. Signed-off-by: Luo Yifan Link: https://patch.msgid.link/20241106014654.206860-1-luoyifan@cmss.chinamobile.com Signed-off-by: Mark Brown --- sound/soc/stm/stm32_sai_sub.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sound/soc/stm/stm32_sai_sub.c b/sound/soc/stm/stm32_sai_sub.c index 7bc4a96b7503..2570daa3e225 100644 --- a/sound/soc/stm/stm32_sai_sub.c +++ b/sound/soc/stm/stm32_sai_sub.c @@ -378,8 +378,8 @@ static long stm32_sai_mclk_round_rate(struct clk_hw *hw, unsigned long rate, int div; div = stm32_sai_get_clk_div(sai, *prate, rate); - if (div < 0) - return div; + if (div <= 0) + return -EINVAL; mclk->freq = *prate / div; From 23569c8b314925bdb70dd1a7b63cfe6100868315 Mon Sep 17 00:00:00 2001 From: Luo Yifan Date: Thu, 7 Nov 2024 09:59:36 +0800 Subject: [PATCH 124/235] ASoC: stm: Prevent potential division by zero in stm32_sai_get_clk_div() This patch checks if div is less than or equal to zero (div <= 0). If div is zero or negative, the function returns -EINVAL, ensuring the division operation is safe to perform. Signed-off-by: Luo Yifan Reviewed-by: Olivier Moysan Link: https://patch.msgid.link/20241107015936.211902-1-luoyifan@cmss.chinamobile.com Signed-off-by: Mark Brown --- sound/soc/stm/stm32_sai_sub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/soc/stm/stm32_sai_sub.c b/sound/soc/stm/stm32_sai_sub.c index 2570daa3e225..5828f9dd866e 100644 --- a/sound/soc/stm/stm32_sai_sub.c +++ b/sound/soc/stm/stm32_sai_sub.c @@ -317,7 +317,7 @@ static int stm32_sai_get_clk_div(struct stm32_sai_sub_data *sai, int div; div = DIV_ROUND_CLOSEST(input_rate, output_rate); - if (div > SAI_XCR1_MCKDIV_MAX(version)) { + if (div > SAI_XCR1_MCKDIV_MAX(version) || div <= 0) { dev_err(&sai->pdev->dev, "Divider %d out of range\n", div); return -EINVAL; } From bb1fb40f8beb45a3733118780a3da24fb071a2e9 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Wed, 6 Nov 2024 16:55:05 -0500 Subject: [PATCH 125/235] NFSD: Fix READDIR on NFSv3 mounts of ext4 exports I noticed that recently, simple operations like "make" started failing on NFSv3 mounts of ext4 exports. Network capture shows that READDIRPLUS operated correctly but READDIR failed with NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is passed a non-zero starting directory cookie. I bisected to commit c689bdd3bffa ("nfsd: further centralize protocol version checks."). Turns out that nfsd3_proc_readdir() does not call fh_verify() before it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when the directory is opened. For ext4, this causes the wrong "max file size" value to be used when sanity checking the incoming directory cookie (which is a seek offset value). The fhp->fh_64bit_cookies boolean is /always/ properly initialized after nfsd_open() returns. There doesn't seem to be a reason for the generic NFSD open helper to handle the f_mode fix-up for directories, so just move that to the one caller that tries to open an S_IFDIR with NFSD_MAY_64BIT_COOKIE. Suggested-by: NeilBrown Fixes: c689bdd3bffa ("nfsd: further centralize protocol version checks.") Reviewed-by: NeilBrown Signed-off-by: Chuck Lever --- fs/nfsd/vfs.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 22325b590e17..d6d4f2a0e898 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -903,11 +903,6 @@ __nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type, goto out; } - if (may_flags & NFSD_MAY_64BIT_COOKIE) - file->f_mode |= FMODE_64BITHASH; - else - file->f_mode |= FMODE_32BITHASH; - *filp = file; out: return host_err; @@ -2174,13 +2169,15 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, loff_t offset = *offsetp; int may_flags = NFSD_MAY_READ; - if (fhp->fh_64bit_cookies) - may_flags |= NFSD_MAY_64BIT_COOKIE; - err = nfsd_open(rqstp, fhp, S_IFDIR, may_flags, &file); if (err) goto out; + if (fhp->fh_64bit_cookies) + file->f_mode |= FMODE_64BITHASH; + else + file->f_mode |= FMODE_32BITHASH; + offset = vfs_llseek(file, offset, SEEK_SET); if (offset < 0) { err = nfserrno((int)offset); From 052ef642bd6c108a24f375f9ad174b97b425a50b Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Sun, 25 Aug 2024 15:21:31 +0200 Subject: [PATCH 126/235] drm: panel-orientation-quirks: Make Lenovo Yoga Tab 3 X90F DMI match less strict There are 2G and 4G RAM versions of the Lenovo Yoga Tab 3 X90F and it turns out that the 2G version has a DMI product name of "CHERRYVIEW D1 PLATFORM" where as the 4G version has "CHERRYVIEW C0 PLATFORM". The sys-vendor + product-version check are unique enough that the product-name check is not necessary. Drop the product-name check so that the existing DMI match for the 4G RAM version also matches the 2G RAM version. Signed-off-by: Hans de Goede Acked-by: Jani Nikula Link: https://patchwork.freedesktop.org/patch/msgid/20240825132131.6643-1-hdegoede@redhat.com --- drivers/gpu/drm/drm_panel_orientation_quirks.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/drm_panel_orientation_quirks.c b/drivers/gpu/drm/drm_panel_orientation_quirks.c index 0830cae9a4d0..2d84d7ea1ab7 100644 --- a/drivers/gpu/drm/drm_panel_orientation_quirks.c +++ b/drivers/gpu/drm/drm_panel_orientation_quirks.c @@ -403,7 +403,6 @@ static const struct dmi_system_id orientation_data[] = { }, { /* Lenovo Yoga Tab 3 X90F */ .matches = { DMI_MATCH(DMI_SYS_VENDOR, "Intel Corporation"), - DMI_MATCH(DMI_PRODUCT_NAME, "CHERRYVIEW D1 PLATFORM"), DMI_MATCH(DMI_PRODUCT_VERSION, "Blade3-10A-001"), }, .driver_data = (void *)&lcd1600x2560_rightside_up, From 444fa5b100e5c90550d6bccfe4476efb0391b3ca Mon Sep 17 00:00:00 2001 From: Liviu Dudau Date: Wed, 6 Nov 2024 18:58:06 +0000 Subject: [PATCH 127/235] drm/panthor: Lock XArray when getting entries for the VM Similar to commit cac075706f29 ("drm/panthor: Fix race when converting group handle to group object") we need to use the XArray's internal locking when retrieving a vm pointer from there. v2: Removed part of the patch that was trying to protect fetching the heap pointer from XArray, as that operation is protected by the @pool->lock. Fixes: 647810ec2476 ("drm/panthor: Add the MMU/VM logical block") Reported-by: Jann Horn Cc: stable@vger.kernel.org Signed-off-by: Liviu Dudau Reviewed-by: Boris Brezillon Reviewed-by: Steven Price Signed-off-by: Steven Price Link: https://patchwork.freedesktop.org/patch/msgid/20241106185806.389089-1-liviu.dudau@arm.com --- drivers/gpu/drm/panthor/panthor_mmu.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c index 5d5e25b1be95..7db2edb3374c 100644 --- a/drivers/gpu/drm/panthor/panthor_mmu.c +++ b/drivers/gpu/drm/panthor/panthor_mmu.c @@ -1580,7 +1580,9 @@ panthor_vm_pool_get_vm(struct panthor_vm_pool *pool, u32 handle) { struct panthor_vm *vm; + xa_lock(&pool->xa); vm = panthor_vm_get(xa_load(&pool->xa, handle)); + xa_unlock(&pool->xa); return vm; } From 48b86532c10128cf50c854a90c2d5b1410f4012d Mon Sep 17 00:00:00 2001 From: Jyri Sarha Date: Thu, 7 Nov 2024 15:28:40 +0200 Subject: [PATCH 128/235] ASoC: SOF: sof-client-probes-ipc4: Set param_size extension bits Write the size of the optional payload of SOF_IPC4_MOD_INIT_INSTANCE message to extension param_size-bits. The previous IPC4 version does not set these bits that should indicate the size of the optional payload (struct sof_ipc4_probe_cfg). The old firmware side component code works well without these bits, but when the probes are converted to use the generic module API, this does not work anymore. Fixes: f5623593060f ("ASoC: SOF: IPC4: probes: Implement IPC4 ops for probes client device") Signed-off-by: Jyri Sarha Reviewed-by: Ranjani Sridharan Reviewed-by: Liam Girdwood Reviewed-by: Bard Liao Signed-off-by: Peter Ujfalusi Link: https://patch.msgid.link/20241107132840.17386-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown --- sound/soc/sof/sof-client-probes-ipc4.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/soc/sof/sof-client-probes-ipc4.c b/sound/soc/sof/sof-client-probes-ipc4.c index 796eac0a2e74..603aed222480 100644 --- a/sound/soc/sof/sof-client-probes-ipc4.c +++ b/sound/soc/sof/sof-client-probes-ipc4.c @@ -125,6 +125,7 @@ static int ipc4_probes_init(struct sof_client_dev *cdev, u32 stream_tag, msg.primary |= SOF_IPC4_MSG_TARGET(SOF_IPC4_MODULE_MSG); msg.extension = SOF_IPC4_MOD_EXT_DST_MOD_INSTANCE(INVALID_PIPELINE_ID); msg.extension |= SOF_IPC4_MOD_EXT_CORE_ID(0); + msg.extension |= SOF_IPC4_MOD_EXT_PARAM_SIZE(sizeof(cfg) / sizeof(uint32_t)); msg.data_size = sizeof(cfg); msg.data_ptr = &cfg; From f432a1621f049bb207e78363d9d0e3c6fa2da5db Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Tue, 5 Nov 2024 00:17:13 +0100 Subject: [PATCH 129/235] drm/panthor: Be stricter about IO mapping flags The current panthor_device_mmap_io() implementation has two issues: 1. For mapping DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET, panthor_device_mmap_io() bails if VM_WRITE is set, but does not clear VM_MAYWRITE. That means userspace can use mprotect() to make the mapping writable later on. This is a classic Linux driver gotcha. I don't think this actually has any impact in practice: When the GPU is powered, writes to the FLUSH_ID seem to be ignored; and when the GPU is not powered, the dummy_latest_flush page provided by the driver is deliberately designed to not do any flushes, so the only thing writing to the dummy_latest_flush could achieve would be to make *more* flushes happen. 2. panthor_device_mmap_io() does not block MAP_PRIVATE mappings (which are mappings without the VM_SHARED flag). MAP_PRIVATE in combination with VM_MAYWRITE indicates that the VMA has copy-on-write semantics, which for VM_PFNMAP are semi-supported but fairly cursed. In particular, in such a mapping, the driver can only install PTEs during mmap() by calling remap_pfn_range() (because remap_pfn_range() wants to **store the physical address of the mapped physical memory into the vm_pgoff of the VMA**); installing PTEs later on with a fault handler (as panthor does) is not supported in private mappings, and so if you try to fault in such a mapping, vmf_insert_pfn_prot() splats when it hits a BUG() check. Fix it by clearing the VM_MAYWRITE flag (userspace writing to the FLUSH_ID doesn't make sense) and requiring VM_SHARED (copy-on-write semantics for the FLUSH_ID don't make sense). Reproducers for both scenarios are in the notes of my patch on the mailing list; I tested that these bugs exist on a Rock 5B machine. Note that I only compile-tested the patch, I haven't tested it; I don't have a working kernel build setup for the test machine yet. Please test it before applying it. Cc: stable@vger.kernel.org Fixes: 5fe909cae118 ("drm/panthor: Add the device logical block") Signed-off-by: Jann Horn Reviewed-by: Boris Brezillon Reviewed-by: Liviu Dudau Reviewed-by: Steven Price Signed-off-by: Steven Price Link: https://patchwork.freedesktop.org/patch/msgid/20241105-panthor-flush-page-fixes-v1-1-829aaf37db93@google.com --- drivers/gpu/drm/panthor/panthor_device.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c index 4082c8f2951d..6fbff516c1c1 100644 --- a/drivers/gpu/drm/panthor/panthor_device.c +++ b/drivers/gpu/drm/panthor/panthor_device.c @@ -390,11 +390,15 @@ int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct * { u64 offset = (u64)vma->vm_pgoff << PAGE_SHIFT; + if ((vma->vm_flags & VM_SHARED) == 0) + return -EINVAL; + switch (offset) { case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET: if (vma->vm_end - vma->vm_start != PAGE_SIZE || (vma->vm_flags & (VM_WRITE | VM_EXEC))) return -EINVAL; + vm_flags_clear(vma, VM_MAYWRITE); break; From 1904fb9ebf911441f90a68e96b22aa73e4410505 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 5 Nov 2024 17:52:34 -0800 Subject: [PATCH 130/235] netlink: terminate outstanding dump on socket close Netlink supports iterative dumping of data. It provides the families the following ops: - start - (optional) kicks off the dumping process - dump - actual dump helper, keeps getting called until it returns 0 - done - (optional) pairs with .start, can be used for cleanup The whole process is asynchronous and the repeated calls to .dump don't actually happen in a tight loop, but rather are triggered in response to recvmsg() on the socket. This gives the user full control over the dump, but also means that the user can close the socket without getting to the end of the dump. To make sure .start is always paired with .done we check if there is an ongoing dump before freeing the socket, and if so call .done. The complication is that sockets can get freed from BH and .done is allowed to sleep. So we use a workqueue to defer the call, when needed. Unfortunately this does not work correctly. What we defer is not the cleanup but rather releasing a reference on the socket. We have no guarantee that we own the last reference, if someone else holds the socket they may release it in BH and we're back to square one. The whole dance, however, appears to be unnecessary. Only the user can interact with dumps, so we can clean up when socket is closed. And close always happens in process context. Some async code may still access the socket after close, queue notification skbs to it etc. but no dumps can start, end or otherwise make progress. Delete the workqueue and flush the dump state directly from the release handler. Note that further cleanup is possible in -next, for instance we now always call .done before releasing the main module reference, so dump doesn't have to take a reference of its own. Reported-by: syzkaller Fixes: ed5d7788a934 ("netlink: Do not schedule work from sk_destruct") Reviewed-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20241106015235.2458807-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- net/netlink/af_netlink.c | 31 ++++++++----------------------- net/netlink/af_netlink.h | 2 -- 2 files changed, 8 insertions(+), 25 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 0a9287fadb47..f84aad420d44 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -393,15 +393,6 @@ static void netlink_skb_set_owner_r(struct sk_buff *skb, struct sock *sk) static void netlink_sock_destruct(struct sock *sk) { - struct netlink_sock *nlk = nlk_sk(sk); - - if (nlk->cb_running) { - if (nlk->cb.done) - nlk->cb.done(&nlk->cb); - module_put(nlk->cb.module); - kfree_skb(nlk->cb.skb); - } - skb_queue_purge(&sk->sk_receive_queue); if (!sock_flag(sk, SOCK_DEAD)) { @@ -414,14 +405,6 @@ static void netlink_sock_destruct(struct sock *sk) WARN_ON(nlk_sk(sk)->groups); } -static void netlink_sock_destruct_work(struct work_struct *work) -{ - struct netlink_sock *nlk = container_of(work, struct netlink_sock, - work); - - sk_free(&nlk->sk); -} - /* This lock without WQ_FLAG_EXCLUSIVE is good on UP and it is _very_ bad on * SMP. Look, when several writers sleep and reader wakes them up, all but one * immediately hit write lock and grab all the cpus. Exclusive sleep solves @@ -731,12 +714,6 @@ static void deferred_put_nlk_sk(struct rcu_head *head) if (!refcount_dec_and_test(&sk->sk_refcnt)) return; - if (nlk->cb_running && nlk->cb.done) { - INIT_WORK(&nlk->work, netlink_sock_destruct_work); - schedule_work(&nlk->work); - return; - } - sk_free(sk); } @@ -788,6 +765,14 @@ static int netlink_release(struct socket *sock) NETLINK_URELEASE, &n); } + /* Terminate any outstanding dump */ + if (nlk->cb_running) { + if (nlk->cb.done) + nlk->cb.done(&nlk->cb); + module_put(nlk->cb.module); + kfree_skb(nlk->cb.skb); + } + module_put(nlk->module); if (netlink_is_kernel(sk)) { diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h index 5b0e4e62ab8b..778a3809361f 100644 --- a/net/netlink/af_netlink.h +++ b/net/netlink/af_netlink.h @@ -4,7 +4,6 @@ #include #include -#include #include /* flags */ @@ -50,7 +49,6 @@ struct netlink_sock { struct rhash_head node; struct rcu_head rcu; - struct work_struct work; }; static inline struct netlink_sock *nlk_sk(struct sock *sk) From 55d42a0c3f9ccd07c199e0ddbe1ba87572d30074 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 5 Nov 2024 17:52:35 -0800 Subject: [PATCH 131/235] selftests: net: add a test for closing a netlink socket ith dump in progress Close a socket with dump in progress. We need a dump which generates enough info not to fit into a single skb. Policy dump fits the bill. Use the trick discovered by syzbot for keeping a ref on the socket longer than just close, with mqueue. TAP version 13 1..3 # Starting 3 tests from 1 test cases. # RUN global.test_sanity ... # OK global.test_sanity ok 1 global.test_sanity # RUN global.close_in_progress ... # OK global.close_in_progress ok 2 global.close_in_progress # RUN global.close_with_ref ... # OK global.close_with_ref ok 3 global.close_with_ref # PASSED: 3 / 3 tests passed. # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0 Note that this test is not expected to fail but rather crash the kernel if we get the cleanup wrong. Reviewed-by: Kuniyuki Iwashima Link: https://patch.msgid.link/20241106015235.2458807-2-kuba@kernel.org Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/netlink-dumps.c | 110 ++++++++++++++++++++ 2 files changed, 111 insertions(+) create mode 100644 tools/testing/selftests/net/netlink-dumps.c diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 649f1fe0dc46..5e86f7a51b43 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -78,6 +78,7 @@ TEST_PROGS += test_vxlan_vnifiltering.sh TEST_GEN_FILES += io_uring_zerocopy_tx TEST_PROGS += io_uring_zerocopy_tx.sh TEST_GEN_FILES += bind_bhash +TEST_GEN_PROGS += netlink-dumps TEST_GEN_PROGS += sk_bind_sendto_listen TEST_GEN_PROGS += sk_connect_zero_addr TEST_GEN_PROGS += sk_so_peek_off diff --git a/tools/testing/selftests/net/netlink-dumps.c b/tools/testing/selftests/net/netlink-dumps.c new file mode 100644 index 000000000000..7ee6dcd334df --- /dev/null +++ b/tools/testing/selftests/net/netlink-dumps.c @@ -0,0 +1,110 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "../kselftest_harness.h" + +static const struct { + struct nlmsghdr nlhdr; + struct genlmsghdr genlhdr; + struct nlattr ahdr; + __u16 val; + __u16 pad; +} dump_policies = { + .nlhdr = { + .nlmsg_len = sizeof(dump_policies), + .nlmsg_type = GENL_ID_CTRL, + .nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP, + .nlmsg_seq = 1, + }, + .genlhdr = { + .cmd = CTRL_CMD_GETPOLICY, + .version = 2, + }, + .ahdr = { + .nla_len = 6, + .nla_type = CTRL_ATTR_FAMILY_ID, + }, + .val = GENL_ID_CTRL, + .pad = 0, +}; + +// Sanity check for the test itself, make sure the dump doesn't fit in one msg +TEST(test_sanity) +{ + int netlink_sock; + char buf[8192]; + ssize_t n; + + netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); + ASSERT_GE(netlink_sock, 0); + + n = send(netlink_sock, &dump_policies, sizeof(dump_policies), 0); + ASSERT_EQ(n, sizeof(dump_policies)); + + n = recv(netlink_sock, buf, sizeof(buf), MSG_DONTWAIT); + ASSERT_GE(n, sizeof(struct nlmsghdr)); + + n = recv(netlink_sock, buf, sizeof(buf), MSG_DONTWAIT); + ASSERT_GE(n, sizeof(struct nlmsghdr)); + + close(netlink_sock); +} + +TEST(close_in_progress) +{ + int netlink_sock; + ssize_t n; + + netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); + ASSERT_GE(netlink_sock, 0); + + n = send(netlink_sock, &dump_policies, sizeof(dump_policies), 0); + ASSERT_EQ(n, sizeof(dump_policies)); + + close(netlink_sock); +} + +TEST(close_with_ref) +{ + char cookie[NOTIFY_COOKIE_LEN] = {}; + int netlink_sock, mq_fd; + struct sigevent sigev; + ssize_t n; + + netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); + ASSERT_GE(netlink_sock, 0); + + n = send(netlink_sock, &dump_policies, sizeof(dump_policies), 0); + ASSERT_EQ(n, sizeof(dump_policies)); + + mq_fd = syscall(__NR_mq_open, "sed", O_CREAT | O_WRONLY, 0600, 0); + ASSERT_GE(mq_fd, 0); + + memset(&sigev, 0, sizeof(sigev)); + sigev.sigev_notify = SIGEV_THREAD; + sigev.sigev_value.sival_ptr = cookie; + sigev.sigev_signo = netlink_sock; + + syscall(__NR_mq_notify, mq_fd, &sigev); + + close(netlink_sock); + + // give mqueue time to fire + usleep(100 * 1000); +} + +TEST_HARNESS_MAIN From fd00045f383f51b66a7a46084a0e92b8de563157 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sun, 27 Oct 2024 20:40:20 -0400 Subject: [PATCH 132/235] bcachefs: Fix null ptr deref in bucket_gen_get() bucket_gen() checks if we're lookup up a valid bucket and returns NULL otherwise, but bucket_gen_get() was failing to check; other callers were correct. Also do a bit of cleanup on callers. Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_node_scan.c | 2 +- fs/bcachefs/buckets.h | 19 +++++++++++-------- fs/bcachefs/io_read.c | 7 +++---- fs/bcachefs/io_write.c | 7 ++----- 4 files changed, 17 insertions(+), 18 deletions(-) diff --git a/fs/bcachefs/btree_node_scan.c b/fs/bcachefs/btree_node_scan.c index a7aedb134e9f..30131c3bdd97 100644 --- a/fs/bcachefs/btree_node_scan.c +++ b/fs/bcachefs/btree_node_scan.c @@ -186,7 +186,7 @@ static void try_read_btree_node(struct find_btree_nodes *f, struct bch_dev *ca, .ptrs[0].type = 1 << BCH_EXTENT_ENTRY_ptr, .ptrs[0].offset = offset, .ptrs[0].dev = ca->dev_idx, - .ptrs[0].gen = *bucket_gen(ca, sector_to_bucket(ca, offset)), + .ptrs[0].gen = bucket_gen_get(ca, sector_to_bucket(ca, offset)), }; rcu_read_unlock(); diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h index fd5e6ccad45e..ccc78bfe2fd4 100644 --- a/fs/bcachefs/buckets.h +++ b/fs/bcachefs/buckets.h @@ -103,12 +103,18 @@ static inline u8 *bucket_gen(struct bch_dev *ca, size_t b) return gens->b + b; } -static inline u8 bucket_gen_get(struct bch_dev *ca, size_t b) +static inline int bucket_gen_get_rcu(struct bch_dev *ca, size_t b) +{ + u8 *gen = bucket_gen(ca, b); + return gen ? *gen : -1; +} + +static inline int bucket_gen_get(struct bch_dev *ca, size_t b) { rcu_read_lock(); - u8 gen = *bucket_gen(ca, b); + int ret = bucket_gen_get_rcu(ca, b); rcu_read_unlock(); - return gen; + return ret; } static inline size_t PTR_BUCKET_NR(const struct bch_dev *ca, @@ -169,10 +175,8 @@ static inline int gen_after(u8 a, u8 b) static inline int dev_ptr_stale_rcu(struct bch_dev *ca, const struct bch_extent_ptr *ptr) { - u8 *gen = bucket_gen(ca, PTR_BUCKET_NR(ca, ptr)); - if (!gen) - return -1; - return gen_after(*gen, ptr->gen); + int gen = bucket_gen_get_rcu(ca, PTR_BUCKET_NR(ca, ptr)); + return gen < 0 ? gen : gen_after(gen, ptr->gen); } /** @@ -184,7 +188,6 @@ static inline int dev_ptr_stale(struct bch_dev *ca, const struct bch_extent_ptr rcu_read_lock(); int ret = dev_ptr_stale_rcu(ca, ptr); rcu_read_unlock(); - return ret; } diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c index fc246f342820..ac6a6fcc2bb8 100644 --- a/fs/bcachefs/io_read.c +++ b/fs/bcachefs/io_read.c @@ -802,16 +802,15 @@ static noinline void read_from_stale_dirty_pointer(struct btree_trans *trans, PTR_BUCKET_POS(ca, &ptr), BTREE_ITER_cached); - u8 *gen = bucket_gen(ca, iter.pos.offset); - if (gen) { - + int gen = bucket_gen_get(ca, iter.pos.offset); + if (gen >= 0) { prt_printf(&buf, "Attempting to read from stale dirty pointer:\n"); printbuf_indent_add(&buf, 2); bch2_bkey_val_to_text(&buf, c, k); prt_newline(&buf); - prt_printf(&buf, "memory gen: %u", *gen); + prt_printf(&buf, "memory gen: %u", gen); ret = lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_slot(&iter))); if (!ret) { diff --git a/fs/bcachefs/io_write.c b/fs/bcachefs/io_write.c index 8609e25e450f..96720adcfee0 100644 --- a/fs/bcachefs/io_write.c +++ b/fs/bcachefs/io_write.c @@ -1300,11 +1300,8 @@ static void bch2_nocow_write(struct bch_write_op *op) bucket_to_u64(i->b), BUCKET_NOCOW_LOCK_UPDATE); - rcu_read_lock(); - u8 *gen = bucket_gen(ca, i->b.offset); - stale = !gen ? -1 : gen_after(*gen, i->gen); - rcu_read_unlock(); - + int gen = bucket_gen_get(ca, i->b.offset); + stale = gen < 0 ? gen : gen_after(gen, i->gen); if (unlikely(stale)) { stale_at = i; goto err_bucket_stale; From 72acab3a7c5aee76451fa6054e9608026476a971 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sun, 27 Oct 2024 18:25:30 -0400 Subject: [PATCH 133/235] bcachefs: Fix error handling in bch2_btree_node_prefetch() Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_cache.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/bcachefs/btree_cache.c b/fs/bcachefs/btree_cache.c index 6e4afb2b5441..6f6225703eef 100644 --- a/fs/bcachefs/btree_cache.c +++ b/fs/bcachefs/btree_cache.c @@ -1312,9 +1312,12 @@ int bch2_btree_node_prefetch(struct btree_trans *trans, b = bch2_btree_node_fill(trans, path, k, btree_id, level, SIX_LOCK_read, false); - if (!IS_ERR_OR_NULL(b)) + int ret = PTR_ERR_OR_ZERO(b); + if (ret) + return ret; + if (b) six_unlock_read(&b->c.lock); - return bch2_trans_relock(trans) ?: PTR_ERR_OR_ZERO(b); + return 0; } void bch2_btree_node_evict(struct btree_trans *trans, const struct bkey_i *k) From d335bb3fd3a4102f325ef8a353efc3d2fb523f55 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 31 Oct 2024 02:36:21 -0400 Subject: [PATCH 134/235] bcachefs: Ancient versions with bad bkey_formats are no longer supported Syzbot found an assertion pop, by generating an ancient filesystem version with an invalid bkey_format (with fields that can overflow) as well as packed keys that aren't representable unpacked. This breaks key comparisons in all sorts of painful ways. Filesystems have been automatically rewriting nodes with such invalid formats for years; we can safely drop support for them. Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/bkey.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/bcachefs/bkey.c b/fs/bcachefs/bkey.c index 587d7318a2e8..995ba32e9b6e 100644 --- a/fs/bcachefs/bkey.c +++ b/fs/bcachefs/bkey.c @@ -643,7 +643,7 @@ int bch2_bkey_format_invalid(struct bch_fs *c, enum bch_validate_flags flags, struct printbuf *err) { - unsigned i, bits = KEY_PACKED_BITS_START; + unsigned bits = KEY_PACKED_BITS_START; if (f->nr_fields != BKEY_NR_FIELDS) { prt_printf(err, "incorrect number of fields: got %u, should be %u", @@ -655,9 +655,8 @@ int bch2_bkey_format_invalid(struct bch_fs *c, * Verify that the packed format can't represent fields larger than the * unpacked format: */ - for (i = 0; i < f->nr_fields; i++) { - if ((!c || c->sb.version_min >= bcachefs_metadata_version_snapshot) && - bch2_bkey_format_field_overflows(f, i)) { + for (unsigned i = 0; i < f->nr_fields; i++) { + if (bch2_bkey_format_field_overflows(f, i)) { unsigned unpacked_bits = bch2_bkey_format_current.bits_per_field[i]; u64 unpacked_max = ~((~0ULL << 1) << (unpacked_bits - 1)); unsigned packed_bits = min(64, f->bits_per_field[i]); From cec136d348e037ea5b6a463164454d6d0174d92f Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 31 Oct 2024 02:50:55 -0400 Subject: [PATCH 135/235] bcachefs: Fix topology errors on split after merge If a btree split picks a pivot that's being deleted by a btree node merge, we're going to have problems. Fix this by checking if the pivot is being deleted, the same as we check for deletions in journal replay keys. Found by single_devic.ktest small_nodes. Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_update_interior.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/bcachefs/btree_update_interior.c b/fs/bcachefs/btree_update_interior.c index 64f0928e1137..c12d7b51de0c 100644 --- a/fs/bcachefs/btree_update_interior.c +++ b/fs/bcachefs/btree_update_interior.c @@ -1434,6 +1434,15 @@ bch2_btree_insert_keys_interior(struct btree_update *as, } } +static bool key_deleted_in_insert(struct keylist *insert_keys, struct bpos pos) +{ + if (insert_keys) + for_each_keylist_key(insert_keys, k) + if (bkey_deleted(&k->k) && bpos_eq(k->k.p, pos)) + return true; + return false; +} + /* * Move keys from n1 (original replacement node, now lower node) to n2 (higher * node) @@ -1441,7 +1450,8 @@ bch2_btree_insert_keys_interior(struct btree_update *as, static void __btree_split_node(struct btree_update *as, struct btree_trans *trans, struct btree *b, - struct btree *n[2]) + struct btree *n[2], + struct keylist *insert_keys) { struct bkey_packed *k; struct bpos n1_pos = POS_MIN; @@ -1476,7 +1486,8 @@ static void __btree_split_node(struct btree_update *as, if (b->c.level && u64s < n1_u64s && u64s + k->u64s >= n1_u64s && - bch2_key_deleted_in_journal(trans, b->c.btree_id, b->c.level, uk.p)) + (bch2_key_deleted_in_journal(trans, b->c.btree_id, b->c.level, uk.p) || + key_deleted_in_insert(insert_keys, uk.p))) n1_u64s += k->u64s; i = u64s >= n1_u64s; @@ -1603,7 +1614,7 @@ static int btree_split(struct btree_update *as, struct btree_trans *trans, n[0] = n1 = bch2_btree_node_alloc(as, trans, b->c.level); n[1] = n2 = bch2_btree_node_alloc(as, trans, b->c.level); - __btree_split_node(as, trans, b, n); + __btree_split_node(as, trans, b, n, keys); if (keys) { btree_split_insert_keys(as, trans, path, n1, keys); From ef4f6c322bf4ca8e6d050cd0667a9447b8cbe212 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 31 Oct 2024 03:33:36 -0400 Subject: [PATCH 136/235] bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery If BCH_FS_may_go_rw is not yet set, it indicates to the transaction commit path that updates should be done via the list of journal replay keys. This must be set before multithreaded use commences. Signed-off-by: Kent Overstreet --- fs/bcachefs/recovery.c | 7 +++++++ fs/bcachefs/recovery_passes.c | 6 ++++++ 2 files changed, 13 insertions(+) diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c index 32d15aacc069..3c7f941dde39 100644 --- a/fs/bcachefs/recovery.c +++ b/fs/bcachefs/recovery.c @@ -862,6 +862,13 @@ int bch2_fs_recovery(struct bch_fs *c) if (ret) goto err; + /* + * Normally set by the appropriate recovery pass: when cleared, this + * indicates we're in early recovery and btree updates should be done by + * being applied to the journal replay keys. _Must_ be cleared before + * multithreaded use: + */ + set_bit(BCH_FS_may_go_rw, &c->flags); clear_bit(BCH_FS_fsck_running, &c->flags); /* in case we don't run journal replay, i.e. norecovery mode */ diff --git a/fs/bcachefs/recovery_passes.c b/fs/bcachefs/recovery_passes.c index 735b8adc8f9d..4bbeac9e0526 100644 --- a/fs/bcachefs/recovery_passes.c +++ b/fs/bcachefs/recovery_passes.c @@ -221,6 +221,12 @@ int bch2_run_recovery_passes(struct bch_fs *c) { int ret = 0; + /* + * We can't allow set_may_go_rw to be excluded; that would cause us to + * use the journal replay keys for updates where it's not expected. + */ + c->opts.recovery_passes_exclude &= ~BCH_RECOVERY_PASS_set_may_go_rw; + while (c->curr_recovery_pass < ARRAY_SIZE(recovery_pass_fns)) { if (c->opts.recovery_pass_last && c->curr_recovery_pass > c->opts.recovery_pass_last) From 93d53f1caf2cf861d0f28d096792d3b92efae178 Mon Sep 17 00:00:00 2001 From: Pei Xiao Date: Wed, 30 Oct 2024 15:48:01 +0800 Subject: [PATCH 137/235] bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket bio_kmalloc may return NULL, will cause NULL pointer dereference. Add check NULL return for bio_kmalloc in journal_read_bucket. Signed-off-by: Pei Xiao Fixes: ac10a9611d87 ("bcachefs: Some fixes for building in userspace") Signed-off-by: Kent Overstreet --- fs/bcachefs/errcode.h | 1 + fs/bcachefs/journal_io.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h index a1bc6c7a8ba0..9c4fe5cdbfb7 100644 --- a/fs/bcachefs/errcode.h +++ b/fs/bcachefs/errcode.h @@ -84,6 +84,7 @@ x(ENOMEM, ENOMEM_dev_alloc) \ x(ENOMEM, ENOMEM_disk_accounting) \ x(ENOMEM, ENOMEM_stripe_head_alloc) \ + x(ENOMEM, ENOMEM_journal_read_bucket) \ x(ENOSPC, ENOSPC_disk_reservation) \ x(ENOSPC, ENOSPC_bucket_alloc) \ x(ENOSPC, ENOSPC_disk_label_add) \ diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c index 954f6a96e0f4..ccaafa90f4f4 100644 --- a/fs/bcachefs/journal_io.c +++ b/fs/bcachefs/journal_io.c @@ -1012,6 +1012,8 @@ static int journal_read_bucket(struct bch_dev *ca, nr_bvecs = buf_pages(buf->data, sectors_read << 9); bio = bio_kmalloc(nr_bvecs, GFP_KERNEL); + if (!bio) + return -BCH_ERR_ENOMEM_journal_read_bucket; bio_init(bio, ca->disk_sb.bdev, bio->bi_inline_vecs, nr_bvecs, REQ_OP_READ); bio->bi_iter.bi_sector = offset; From 9bb33852f5cc145b17d96f3792ff69148a37e1fd Mon Sep 17 00:00:00 2001 From: Hongbo Li Date: Tue, 29 Oct 2024 20:53:29 +0800 Subject: [PATCH 138/235] bcachefs: check the invalid parameter for perf test The perf_test does not check the number of iterations and threads when it is zero. If nr_thread is 0, the perf test will keep waiting for wakekup. If iteration is 0, it will cause exception of division by zero. This can be reproduced by: echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test or echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test Fixes: 1c6fdbd8f246 ("bcachefs: Initial commit") Signed-off-by: Hongbo Li Signed-off-by: Kent Overstreet --- fs/bcachefs/tests.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/bcachefs/tests.c b/fs/bcachefs/tests.c index 315038a0a92d..fb5c1543e52f 100644 --- a/fs/bcachefs/tests.c +++ b/fs/bcachefs/tests.c @@ -809,6 +809,11 @@ int bch2_btree_perf_test(struct bch_fs *c, const char *testname, unsigned i; u64 time; + if (nr == 0 || nr_threads == 0) { + pr_err("nr of iterations or threads is not allowed to be 0"); + return -EINVAL; + } + atomic_set(&j.ready, nr_threads); init_waitqueue_head(&j.ready_wait); From baefd3f849ed956d4c1aee80889093cf0d9c6a94 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 31 Oct 2024 01:17:54 -0400 Subject: [PATCH 139/235] bcachefs: btree_cache.freeable list fixes When allocating new btree nodes, we were leaving them on the freeable list - unlocked - allowing them to be reclaimed: ouch. Additionally, bch2_btree_node_free_never_used() -> bch2_btree_node_hash_remove was putting it on the freelist, while bch2_btree_node_free_never_used() was putting it back on the btree update reserve list - ouch. Originally, the code was written to always keep btree nodes on a list - live or freeable - and this worked when new nodes were kept locked. But now with the cycle detector, we can't keep nodes locked that aren't tracked by the cycle detector; and this is fine as long as they're not reachable. We also have better and more robust leak detection now, with memory allocation profiling, so the original justification no longer applies. Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_cache.c | 100 +++++++++++++++++----------- fs/bcachefs/btree_cache.h | 2 + fs/bcachefs/btree_update_interior.c | 13 ++-- 3 files changed, 67 insertions(+), 48 deletions(-) diff --git a/fs/bcachefs/btree_cache.c b/fs/bcachefs/btree_cache.c index 6f6225703eef..7123019ab3bc 100644 --- a/fs/bcachefs/btree_cache.c +++ b/fs/bcachefs/btree_cache.c @@ -59,16 +59,38 @@ static inline size_t btree_cache_can_free(struct btree_cache_list *list) static void btree_node_to_freedlist(struct btree_cache *bc, struct btree *b) { + BUG_ON(!list_empty(&b->list)); + if (b->c.lock.readers) - list_move(&b->list, &bc->freed_pcpu); + list_add(&b->list, &bc->freed_pcpu); else - list_move(&b->list, &bc->freed_nonpcpu); + list_add(&b->list, &bc->freed_nonpcpu); } -static void btree_node_data_free(struct bch_fs *c, struct btree *b) +static void __bch2_btree_node_to_freelist(struct btree_cache *bc, struct btree *b) +{ + BUG_ON(!list_empty(&b->list)); + BUG_ON(!b->data); + + bc->nr_freeable++; + list_add(&b->list, &bc->freeable); +} + +void bch2_btree_node_to_freelist(struct bch_fs *c, struct btree *b) { struct btree_cache *bc = &c->btree_cache; + mutex_lock(&bc->lock); + __bch2_btree_node_to_freelist(bc, b); + mutex_unlock(&bc->lock); + + six_unlock_write(&b->c.lock); + six_unlock_intent(&b->c.lock); +} + +static void __btree_node_data_free(struct btree_cache *bc, struct btree *b) +{ + BUG_ON(!list_empty(&b->list)); BUG_ON(btree_node_hashed(b)); /* @@ -94,11 +116,17 @@ static void btree_node_data_free(struct bch_fs *c, struct btree *b) #endif b->aux_data = NULL; - bc->nr_freeable--; - btree_node_to_freedlist(bc, b); } +static void btree_node_data_free(struct btree_cache *bc, struct btree *b) +{ + BUG_ON(list_empty(&b->list)); + list_del_init(&b->list); + --bc->nr_freeable; + __btree_node_data_free(bc, b); +} + static int bch2_btree_cache_cmp_fn(struct rhashtable_compare_arg *arg, const void *obj) { @@ -174,21 +202,10 @@ struct btree *__bch2_btree_node_mem_alloc(struct bch_fs *c) bch2_btree_lock_init(&b->c, 0); - bc->nr_freeable++; - list_add(&b->list, &bc->freeable); + __bch2_btree_node_to_freelist(bc, b); return b; } -void bch2_btree_node_to_freelist(struct bch_fs *c, struct btree *b) -{ - mutex_lock(&c->btree_cache.lock); - list_move(&b->list, &c->btree_cache.freeable); - mutex_unlock(&c->btree_cache.lock); - - six_unlock_write(&b->c.lock); - six_unlock_intent(&b->c.lock); -} - static inline bool __btree_node_pinned(struct btree_cache *bc, struct btree *b) { struct bbpos pos = BBPOS(b->c.btree_id, b->key.k.p); @@ -236,11 +253,11 @@ void bch2_btree_cache_unpin(struct bch_fs *c) /* Btree in memory cache - hash table */ -void bch2_btree_node_hash_remove(struct btree_cache *bc, struct btree *b) +void __bch2_btree_node_hash_remove(struct btree_cache *bc, struct btree *b) { lockdep_assert_held(&bc->lock); - int ret = rhashtable_remove_fast(&bc->table, &b->hash, bch_btree_cache_params); + int ret = rhashtable_remove_fast(&bc->table, &b->hash, bch_btree_cache_params); BUG_ON(ret); /* Cause future lookups for this node to fail: */ @@ -248,17 +265,22 @@ void bch2_btree_node_hash_remove(struct btree_cache *bc, struct btree *b) if (b->c.btree_id < BTREE_ID_NR) --bc->nr_by_btree[b->c.btree_id]; + --bc->live[btree_node_pinned(b)].nr; + list_del_init(&b->list); +} - bc->live[btree_node_pinned(b)].nr--; - bc->nr_freeable++; - list_move(&b->list, &bc->freeable); +void bch2_btree_node_hash_remove(struct btree_cache *bc, struct btree *b) +{ + __bch2_btree_node_hash_remove(bc, b); + __bch2_btree_node_to_freelist(bc, b); } int __bch2_btree_node_hash_insert(struct btree_cache *bc, struct btree *b) { + BUG_ON(!list_empty(&b->list)); BUG_ON(b->hash_val); - b->hash_val = btree_ptr_hash_val(&b->key); + b->hash_val = btree_ptr_hash_val(&b->key); int ret = rhashtable_lookup_insert_fast(&bc->table, &b->hash, bch_btree_cache_params); if (ret) @@ -270,10 +292,8 @@ int __bch2_btree_node_hash_insert(struct btree_cache *bc, struct btree *b) bool p = __btree_node_pinned(bc, b); mod_bit(BTREE_NODE_pinned, &b->flags, p); - list_move_tail(&b->list, &bc->live[p].list); + list_add_tail(&b->list, &bc->live[p].list); bc->live[p].nr++; - - bc->nr_freeable--; return 0; } @@ -485,7 +505,7 @@ static unsigned long bch2_btree_cache_scan(struct shrinker *shrink, goto out; if (!btree_node_reclaim(c, b, true)) { - btree_node_data_free(c, b); + btree_node_data_free(bc, b); six_unlock_write(&b->c.lock); six_unlock_intent(&b->c.lock); freed++; @@ -501,10 +521,10 @@ static unsigned long bch2_btree_cache_scan(struct shrinker *shrink, bc->not_freed[BCH_BTREE_CACHE_NOT_FREED_access_bit]++; --touched;; } else if (!btree_node_reclaim(c, b, true)) { - bch2_btree_node_hash_remove(bc, b); + __bch2_btree_node_hash_remove(bc, b); + __btree_node_data_free(bc, b); freed++; - btree_node_data_free(c, b); bc->nr_freed++; six_unlock_write(&b->c.lock); @@ -587,7 +607,7 @@ void bch2_fs_btree_cache_exit(struct bch_fs *c) BUG_ON(btree_node_read_in_flight(b) || btree_node_write_in_flight(b)); - btree_node_data_free(c, b); + btree_node_data_free(bc, b); } BUG_ON(!bch2_journal_error(&c->journal) && @@ -786,8 +806,8 @@ struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea BUG_ON(!six_trylock_intent(&b->c.lock)); BUG_ON(!six_trylock_write(&b->c.lock)); -got_node: +got_node: /* * btree_free() doesn't free memory; it sticks the node on the end of * the list. Check if there's any freed nodes there: @@ -796,7 +816,12 @@ struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea if (!btree_node_reclaim(c, b2, false)) { swap(b->data, b2->data); swap(b->aux_data, b2->aux_data); + + list_del_init(&b2->list); + --bc->nr_freeable; btree_node_to_freedlist(bc, b2); + mutex_unlock(&bc->lock); + six_unlock_write(&b2->c.lock); six_unlock_intent(&b2->c.lock); goto got_mem; @@ -810,11 +835,8 @@ struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea goto err; } - mutex_lock(&bc->lock); - bc->nr_freeable++; got_mem: - mutex_unlock(&bc->lock); - + BUG_ON(!list_empty(&b->list)); BUG_ON(btree_node_hashed(b)); BUG_ON(btree_node_dirty(b)); BUG_ON(btree_node_write_in_flight(b)); @@ -845,7 +867,7 @@ struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea if (bc->alloc_lock == current) { b2 = btree_node_cannibalize(c); clear_btree_node_just_written(b2); - bch2_btree_node_hash_remove(bc, b2); + __bch2_btree_node_hash_remove(bc, b2); if (b) { swap(b->data, b2->data); @@ -855,9 +877,9 @@ struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea six_unlock_intent(&b2->c.lock); } else { b = b2; - list_del_init(&b->list); } + BUG_ON(!list_empty(&b->list)); mutex_unlock(&bc->lock); trace_and_count(c, btree_cache_cannibalize, trans); @@ -936,7 +958,7 @@ static noinline struct btree *bch2_btree_node_fill(struct btree_trans *trans, b->hash_val = 0; mutex_lock(&bc->lock); - list_add(&b->list, &bc->freeable); + __bch2_btree_node_to_freelist(bc, b); mutex_unlock(&bc->lock); six_unlock_write(&b->c.lock); @@ -1356,7 +1378,7 @@ void bch2_btree_node_evict(struct btree_trans *trans, const struct bkey_i *k) mutex_lock(&bc->lock); bch2_btree_node_hash_remove(bc, b); - btree_node_data_free(c, b); + btree_node_data_free(bc, b); mutex_unlock(&bc->lock); out: six_unlock_write(&b->c.lock); diff --git a/fs/bcachefs/btree_cache.h b/fs/bcachefs/btree_cache.h index 367acd217c6a..66e86d1a178d 100644 --- a/fs/bcachefs/btree_cache.h +++ b/fs/bcachefs/btree_cache.h @@ -14,7 +14,9 @@ void bch2_recalc_btree_reserve(struct bch_fs *); void bch2_btree_node_to_freelist(struct bch_fs *, struct btree *); +void __bch2_btree_node_hash_remove(struct btree_cache *, struct btree *); void bch2_btree_node_hash_remove(struct btree_cache *, struct btree *); + int __bch2_btree_node_hash_insert(struct btree_cache *, struct btree *); int bch2_btree_node_hash_insert(struct btree_cache *, struct btree *, unsigned, enum btree_id); diff --git a/fs/bcachefs/btree_update_interior.c b/fs/bcachefs/btree_update_interior.c index c12d7b51de0c..22740b605f0a 100644 --- a/fs/bcachefs/btree_update_interior.c +++ b/fs/bcachefs/btree_update_interior.c @@ -237,10 +237,6 @@ static void __btree_node_free(struct btree_trans *trans, struct btree *b) BUG_ON(b->will_make_reachable); clear_btree_node_noevict(b); - - mutex_lock(&c->btree_cache.lock); - list_move(&b->list, &c->btree_cache.freeable); - mutex_unlock(&c->btree_cache.lock); } static void bch2_btree_node_free_inmem(struct btree_trans *trans, @@ -252,12 +248,12 @@ static void bch2_btree_node_free_inmem(struct btree_trans *trans, bch2_btree_node_lock_write_nofail(trans, path, &b->c); + __btree_node_free(trans, b); + mutex_lock(&c->btree_cache.lock); bch2_btree_node_hash_remove(&c->btree_cache, b); mutex_unlock(&c->btree_cache.lock); - __btree_node_free(trans, b); - six_unlock_write(&b->c.lock); mark_btree_node_locked_noreset(path, level, BTREE_NODE_INTENT_LOCKED); @@ -289,7 +285,7 @@ static void bch2_btree_node_free_never_used(struct btree_update *as, clear_btree_node_need_write(b); mutex_lock(&c->btree_cache.lock); - bch2_btree_node_hash_remove(&c->btree_cache, b); + __bch2_btree_node_hash_remove(&c->btree_cache, b); mutex_unlock(&c->btree_cache.lock); BUG_ON(p->nr >= ARRAY_SIZE(p->b)); @@ -521,8 +517,7 @@ static void bch2_btree_reserve_put(struct btree_update *as, struct btree_trans * btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_intent); btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_write); __btree_node_free(trans, b); - six_unlock_write(&b->c.lock); - six_unlock_intent(&b->c.lock); + bch2_btree_node_to_freelist(c, b); } } } From f9f0a5390dcef1f96cc506a2cf7d50c8e348fa3d Mon Sep 17 00:00:00 2001 From: Piotr Zalewski Date: Wed, 6 Nov 2024 19:46:30 +0000 Subject: [PATCH 140/235] bcachefs: Change OPT_STR max to be 1 less than the size of choices array Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices" array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text. The "_choices" array is a null-terminated array, so computing the maximum using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values bigger than the actual maximum would pass through option validation. Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6 Fixes: 63c4b2545382 ("bcachefs: Better superblock opt validation") Suggested-by: Kent Overstreet Signed-off-by: Piotr Zalewski Signed-off-by: Kent Overstreet --- fs/bcachefs/opts.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/bcachefs/opts.c b/fs/bcachefs/opts.c index 6673cbd8bdb9..0e2ee262fbd4 100644 --- a/fs/bcachefs/opts.c +++ b/fs/bcachefs/opts.c @@ -226,7 +226,7 @@ const struct bch_option bch2_opt_table[] = { #define OPT_UINT(_min, _max) .type = BCH_OPT_UINT, \ .min = _min, .max = _max #define OPT_STR(_choices) .type = BCH_OPT_STR, \ - .min = 0, .max = ARRAY_SIZE(_choices), \ + .min = 0, .max = ARRAY_SIZE(_choices) - 1, \ .choices = _choices #define OPT_STR_NOLIMIT(_choices) .type = BCH_OPT_STR, \ .min = 0, .max = U64_MAX, \ @@ -428,7 +428,7 @@ void bch2_opt_to_text(struct printbuf *out, prt_printf(out, "%lli", v); break; case BCH_OPT_STR: - if (v < opt->min || v >= opt->max - 1) + if (v < opt->min || v >= opt->max) prt_printf(out, "(invalid option %lli)", v); else if (flags & OPT_SHOW_FULL_LIST) prt_string_option(out, opt->choices, v); From 8440da933127fc5330c3d1090cdd612fddbc40eb Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Wed, 6 Nov 2024 16:40:08 -0500 Subject: [PATCH 141/235] bcachefs: Fix UAF in __promote_alloc() error path If we error in data_update_init() after adding to the rhashtable of outstanding promotes, kfree_rcu() is required. Reported-by: Reed Riley Signed-off-by: Kent Overstreet --- fs/bcachefs/io_read.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c index ac6a6fcc2bb8..b3b934a87c6d 100644 --- a/fs/bcachefs/io_read.c +++ b/fs/bcachefs/io_read.c @@ -262,7 +262,8 @@ static struct promote_op *__promote_alloc(struct btree_trans *trans, bio_free_pages(&(*rbio)->bio); kfree(*rbio); *rbio = NULL; - kfree(op); + /* We may have added to the rhashtable and thus need rcu freeing: */ + kfree_rcu(op, rcu); bch2_write_ref_put(c, BCH_WRITE_REF_promote); return ERR_PTR(ret); } From 83e445e64f48bdae3f25013e788fcf592f142576 Mon Sep 17 00:00:00 2001 From: Dragos Tatulea Date: Tue, 5 Nov 2024 20:51:02 +0200 Subject: [PATCH 142/235] vdpa/mlx5: Fix error path during device add MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the error recovery path of mlx5_vdpa_dev_add(), the cleanup is executed and at the end put_device() is called which ends up calling mlx5_vdpa_free(). This function will execute the same cleanup all over again. Most resources support being cleaned up twice, but the recent mlx5_vdpa_destroy_mr_resources() doesn't. This change drops the explicit cleanup from within the mlx5_vdpa_dev_add() and lets mlx5_vdpa_free() do its work. This issue was discovered while trying to add 2 vdpa devices with the same name: $> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.2 $> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.3 ... yields the following dump: BUG: kernel NULL pointer dereference, address: 00000000000000b8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP CPU: 4 UID: 0 PID: 2811 Comm: vdpa Not tainted 6.12.0-rc6 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:destroy_workqueue+0xe/0x2a0 Code: ... RSP: 0018:ffff88814920b9a8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff888105c10000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff888100400168 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888100120c00 R09: ffffffff828578c0 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888131fd99a0 R14: 0000000000000000 R15: ffff888105c10580 FS: 00007fdfa6b4f740(0000) GS:ffff88852ca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000b8 CR3: 000000018db09006 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __die+0x20/0x60 ? page_fault_oops+0x150/0x3e0 ? exc_page_fault+0x74/0x130 ? asm_exc_page_fault+0x22/0x30 ? destroy_workqueue+0xe/0x2a0 mlx5_vdpa_destroy_mr_resources+0x2b/0x40 [mlx5_vdpa] mlx5_vdpa_free+0x45/0x150 [mlx5_vdpa] vdpa_release_dev+0x1e/0x50 [vdpa] device_release+0x31/0x90 kobject_put+0x8d/0x230 mlx5_vdpa_dev_add+0x328/0x8b0 [mlx5_vdpa] vdpa_nl_cmd_dev_add_set_doit+0x2b8/0x4c0 [vdpa] genl_family_rcv_msg_doit+0xd0/0x120 genl_rcv_msg+0x180/0x2b0 ? __vdpa_alloc_device+0x1b0/0x1b0 [vdpa] ? genl_family_rcv_msg_dumpit+0xf0/0xf0 netlink_rcv_skb+0x54/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x1fc/0x2d0 netlink_sendmsg+0x1e4/0x410 __sock_sendmsg+0x38/0x60 ? sockfd_lookup_light+0x12/0x60 __sys_sendto+0x105/0x160 ? __count_memcg_events+0x53/0xe0 ? handle_mm_fault+0x100/0x220 ? do_user_addr_fault+0x40d/0x620 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x4c/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fdfa6c66b57 Code: ... RSP: 002b:00007ffeace22998 EFLAGS: 00000202 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055a498608350 RCX: 00007fdfa6c66b57 RDX: 000000000000006c RSI: 000055a498608350 RDI: 0000000000000003 RBP: 00007ffeace229c0 R08: 00007fdfa6d35200 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000202 R12: 000055a4986082a0 R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffeace233f3 Modules linked in: ... CR2: 00000000000000b8 Fixes: 62111654481d ("vdpa/mlx5: Postpone MR deletion") Signed-off-by: Dragos Tatulea Message-Id: <20241105185101.1323272-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang Acked-by: Eugenio Pérez --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 21 +++++---------------- 1 file changed, 5 insertions(+), 16 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index dee019977716..5f581e71e201 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3963,28 +3963,28 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name, mvdev->vdev.dma_dev = &mdev->pdev->dev; err = mlx5_vdpa_alloc_resources(&ndev->mvdev); if (err) - goto err_mpfs; + goto err_alloc; err = mlx5_vdpa_init_mr_resources(mvdev); if (err) - goto err_res; + goto err_alloc; if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { err = mlx5_vdpa_create_dma_mr(mvdev); if (err) - goto err_mr_res; + goto err_alloc; } err = alloc_fixed_resources(ndev); if (err) - goto err_mr; + goto err_alloc; ndev->cvq_ent.mvdev = mvdev; INIT_WORK(&ndev->cvq_ent.work, mlx5_cvq_kick_handler); mvdev->wq = create_singlethread_workqueue("mlx5_vdpa_wq"); if (!mvdev->wq) { err = -ENOMEM; - goto err_res2; + goto err_alloc; } mvdev->vdev.mdev = &mgtdev->mgtdev; @@ -4010,17 +4010,6 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name, _vdpa_unregister_device(&mvdev->vdev); err_reg: destroy_workqueue(mvdev->wq); -err_res2: - free_fixed_resources(ndev); -err_mr: - mlx5_vdpa_clean_mrs(mvdev); -err_mr_res: - mlx5_vdpa_destroy_mr_resources(mvdev); -err_res: - mlx5_vdpa_free_resources(&ndev->mvdev); -err_mpfs: - if (!is_zero_ether_addr(config->mac)) - mlx5_mpfs_del_mac(pfmdev, config->mac); err_alloc: put_device(&mvdev->vdev.dev); return err; From c928807f6f6b6d595a7e199591ae297c81de3aeb Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Mon, 28 Oct 2024 12:26:53 -0600 Subject: [PATCH 143/235] mm/page_alloc: keep track of free highatomic OOM kills due to vastly overestimated free highatomic reserves were observed: ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB The second line above shows that the OOM kill was due to the following condition: free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) And the third line shows there were no free pages in any MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'. Therefore __zone_watermark_unusable_free() underestimated the usable free memory by over 1GB, which resulted in the unnecessary OOM kill above. The comments in __zone_watermark_unusable_free() warns about the potential risk, i.e., If the caller does not have rights to reserves below the min watermark then subtract the high-atomic reserves. This will over-estimate the size of the atomic reserve but it avoids a search. However, it is possible to keep track of free pages in reserved highatomic pageblocks with a new per-zone counter nr_free_highatomic protected by the zone lock, to avoid a search when calculating the usable free memory. And the cost would be minimal, i.e., simple arithmetics in the highatomic alloc/free/move paths. Note that since nr_free_highatomic can be relatively small, using a per-cpu counter might cause too much drift and defeat its purpose, in addition to the extra memory overhead. Dependson e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") - see [1] [akpm@linux-foundation.org: s/if/else if/, per Johannes, stealth whitespace tweak] Link: https://lkml.kernel.org/r/20241028182653.3420139-1-yuzhao@google.com Link: https://lkml.kernel.org/r/0d0ddb33-fcdc-43e2-801f-0c1df2031afb@suse.cz [1] Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: Yu Zhao Reported-by: Link Lin Acked-by: David Rientjes Acked-by: Vlastimil Babka Acked-by: Johannes Weiner Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 10 +++++++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5b1c984daf45..80bc5640bb60 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -823,6 +823,7 @@ struct zone { unsigned long watermark_boost; unsigned long nr_reserved_highatomic; + unsigned long nr_free_highatomic; /* * We don't know if the memory that we're going to allocate will be diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8ad38cd5e574..c6c7bb3ea71b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -635,6 +635,8 @@ compaction_capture(struct capture_control *capc, struct page *page, static inline void account_freepages(struct zone *zone, int nr_pages, int migratetype) { + lockdep_assert_held(&zone->lock); + if (is_migrate_isolate(migratetype)) return; @@ -642,6 +644,9 @@ static inline void account_freepages(struct zone *zone, int nr_pages, if (is_migrate_cma(migratetype)) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); + else if (is_migrate_highatomic(migratetype)) + WRITE_ONCE(zone->nr_free_highatomic, + zone->nr_free_highatomic + nr_pages); } /* Used for pages not on another list */ @@ -3079,11 +3084,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z, /* * If the caller does not have rights to reserves below the min - * watermark then subtract the high-atomic reserves. This will - * over-estimate the size of the atomic reserve but it avoids a search. + * watermark then subtract the free pages reserved for highatomic. */ if (likely(!(alloc_flags & ALLOC_RESERVES))) - unusable_free += z->nr_reserved_highatomic; + unusable_free += READ_ONCE(z->nr_free_highatomic); #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */ From cb6fcef8b4b6c655b6a25cc3a415cd9eb81b3da8 Mon Sep 17 00:00:00 2001 From: "Masami Hiramatsu (Google)" Date: Mon, 28 Oct 2024 12:26:27 +0900 Subject: [PATCH 144/235] objpool: fix to make percpu slot allocation more robust Since gfp & GFP_ATOMIC == GFP_ATOMIC is true for GFP_KERNEL | GFP_HIGH, it will use kmalloc if user specifies that combination. Here the reason why combining the __vmalloc_node() and kmalloc_node() is that the vmalloc does not support all GFP flag, especially GFP_ATOMIC. So we should check if gfp & (GFP_ATOMIC | GFP_KERNEL) != GFP_ATOMIC for vmalloc first. This ensures caller can sleep. And for the robustness, even if vmalloc fails, it should retry with kmalloc to allocate it. Link: https://lkml.kernel.org/r/173008598713.1262174.2959179484209897252.stgit@mhiramat.roam.corp.google.com Fixes: aff1871bfc81 ("objpool: fix choosing allocation for percpu slots") Signed-off-by: Masami Hiramatsu (Google) Reported-by: Linus Torvalds Closes: https://lore.kernel.org/all/CAHk-=whO+vSH+XVRio8byJU8idAWES0SPGVZ7KAVdc4qrV0VUA@mail.gmail.com/ Cc: Leo Yan Cc: Linus Torvalds Cc: Matt Wu Cc: Mikel Rychliski Cc: Steven Rostedt (Google) Cc: Viktor Malik Signed-off-by: Andrew Morton --- lib/objpool.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/lib/objpool.c b/lib/objpool.c index fd108fe0d095..b998b720c732 100644 --- a/lib/objpool.c +++ b/lib/objpool.c @@ -74,15 +74,21 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, * warm caches and TLB hits. in default vmalloc is used to * reduce the pressure of kernel slab system. as we know, * mimimal size of vmalloc is one page since vmalloc would - * always align the requested size to page size + * always align the requested size to page size. + * but if vmalloc fails or it is not available (e.g. GFP_ATOMIC) + * allocate percpu slot with kmalloc. */ - if ((pool->gfp & GFP_ATOMIC) == GFP_ATOMIC) - slot = kmalloc_node(size, pool->gfp, cpu_to_node(i)); - else + slot = NULL; + + if ((pool->gfp & (GFP_ATOMIC | GFP_KERNEL)) != GFP_ATOMIC) slot = __vmalloc_node(size, sizeof(void *), pool->gfp, cpu_to_node(i), __builtin_return_address(0)); - if (!slot) - return -ENOMEM; + + if (!slot) { + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i)); + if (!slot) + return -ENOMEM; + } memset(slot, 0, size); pool->cpu_slots[i] = slot; From faa242b1d2a97143150bdc50d5b61fd70fcd17cd Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Sun, 27 Oct 2024 12:33:21 +0000 Subject: [PATCH 145/235] mm/mlock: set the correct prev on failure After commit 94d7d9233951 ("mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al."), if vma_modify_flags() return error, the vma is set to an error code. This will lead to an invalid prev be returned. Generally this shouldn't matter as the caller should treat an error as indicating state is now invalidated, however unfortunately apply_mlockall_flags() does not check for errors and assumes that mlock_fixup() correctly maintains prev even if an error were to occur. This patch fixes that assumption. [lorenzo.stoakes@oracle.com: provide a better fix and rephrase the log] Link: https://lkml.kernel.org/r/20241027123321.19511-1-richard.weiyang@gmail.com Fixes: 94d7d9233951 ("mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.") Signed-off-by: Wei Yang Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Vlastimil Babka Cc: Jann Horn Cc: Signed-off-by: Andrew Morton --- mm/mlock.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index e3e3dc2b2956..cde076fa7d5e 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -725,14 +725,17 @@ static int apply_mlockall_flags(int flags) } for_each_vma(vmi, vma) { + int error; vm_flags_t newflags; newflags = vma->vm_flags & ~VM_LOCKED_MASK; newflags |= to_add; - /* Ignore errors */ - mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end, - newflags); + error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end, + newflags); + /* Ignore errors, but prev needs fixing up. */ + if (error) + prev = vma; cond_resched(); } out: From 3488af0970445ff5532c7e8dc5e6456b877aee5e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 31 Oct 2024 11:37:56 -0700 Subject: [PATCH 146/235] mm/damon/core: handle zero {aggregation,ops_update} intervals Patch series "mm/damon/core: fix handling of zero non-sampling intervals". DAMON's internal intervals accounting logic is not correctly handling non-sampling intervals of zero values for a wrong assumption. This could cause unexpected monitoring behavior, and even result in infinite hang of DAMON sysfs interface user threads in case of zero aggregation interval. Fix those by updating the intervals accounting logic. For details of the root case and solutions, please refer to commit messages of fixes. This patch (of 2): DAMON's logics to determine if this is the time to do aggregation and ops update assumes next_{aggregation,ops_update}_sis are always set larger than current passed_sample_intervals. And therefore it further assumes continuously incrementing passed_sample_intervals every sampling interval will make it reaches to the next_{aggregation,ops_update}_sis in future. The logic therefore make the action and update next_{aggregation,ops_updaste}_sis only if passed_sample_intervals is same to the counts, respectively. If Aggregation interval or Ops update interval are zero, however, next_aggregation_sis or next_ops_update_sis are set same to current passed_sample_intervals, respectively. And passed_sample_intervals is incremented before doing the next_{aggregation,ops_update}_sis check. Hence, passed_sample_intervals becomes larger than next_{aggregation,ops_update}_sis, and the logic says it is not the time to do the action and update next_{aggregation,ops_update}_sis forever, until an overflow happens. In other words, DAMON stops doing aggregations or ops updates effectively forever, and users cannot get monitoring results. Based on the documents and the common sense, a reasonable behavior for such inputs is doing an aggregation and an ops update for every sampling interval. Handle the case by removing the assumption. Note that this could incur particular real issue for DAMON sysfs interface users, in case of zero Aggregation interval. When user starts DAMON with zero Aggregation interval and asks online DAMON parameter tuning via DAMON sysfs interface, the request is handled by the aggregation callback. Until the callback finishes the work, the user who requested the online tuning just waits. Hence, the user will be stuck until the passed_sample_intervals overflows. Link: https://lkml.kernel.org/r/20241031183757.49610-1-sj@kernel.org Link: https://lkml.kernel.org/r/20241031183757.49610-2-sj@kernel.org Fixes: 4472edf63d66 ("mm/damon/core: use number of passed access sampling as a timer") Signed-off-by: SeongJae Park Cc: [6.7.x] Signed-off-by: Andrew Morton --- mm/damon/core.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index a83f3b736d51..3131a07569e4 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -2000,7 +2000,7 @@ static int kdamond_fn(void *data) if (ctx->ops.check_accesses) max_nr_accesses = ctx->ops.check_accesses(ctx); - if (ctx->passed_sample_intervals == next_aggregation_sis) { + if (ctx->passed_sample_intervals >= next_aggregation_sis) { kdamond_merge_regions(ctx, max_nr_accesses / 10, sz_limit); @@ -2018,7 +2018,7 @@ static int kdamond_fn(void *data) sample_interval = ctx->attrs.sample_interval ? ctx->attrs.sample_interval : 1; - if (ctx->passed_sample_intervals == next_aggregation_sis) { + if (ctx->passed_sample_intervals >= next_aggregation_sis) { ctx->next_aggregation_sis = next_aggregation_sis + ctx->attrs.aggr_interval / sample_interval; @@ -2028,7 +2028,7 @@ static int kdamond_fn(void *data) ctx->ops.reset_aggregated(ctx); } - if (ctx->passed_sample_intervals == next_ops_update_sis) { + if (ctx->passed_sample_intervals >= next_ops_update_sis) { ctx->next_ops_update_sis = next_ops_update_sis + ctx->attrs.ops_update_interval / sample_interval; From 8e7bde615f634a82a44b1f3d293c049fd3ef9ca9 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 31 Oct 2024 11:37:57 -0700 Subject: [PATCH 147/235] mm/damon/core: handle zero schemes apply interval DAMON's logics to determine if this is the time to apply damos schemes assumes next_apply_sis is always set larger than current passed_sample_intervals. And therefore assume continuously incrementing passed_sample_intervals will make it reaches to the next_apply_sis in future. The logic hence does apply the scheme and update next_apply_sis only if passed_sample_intervals is same to next_apply_sis. If Schemes apply interval is set as zero, however, next_apply_sis is set same to current passed_sample_intervals, respectively. And passed_sample_intervals is incremented before doing the next_apply_sis check. Hence, next_apply_sis becomes larger than next_apply_sis, and the logic says it is not the time to apply schemes and update next_apply_sis. In other words, DAMON stops applying schemes until passed_sample_intervals overflows. Based on the documents and the common sense, a reasonable behavior for such inputs would be applying the schemes for every sampling interval. Handle the case by removing the assumption. Link: https://lkml.kernel.org/r/20241031183757.49610-3-sj@kernel.org Fixes: 42f994b71404 ("mm/damon/core: implement scheme-specific apply interval") Signed-off-by: SeongJae Park Cc: [6.7.x] Signed-off-by: Andrew Morton --- mm/damon/core.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index 3131a07569e4..ce700e694b63 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1412,7 +1412,7 @@ static void damon_do_apply_schemes(struct damon_ctx *c, damon_for_each_scheme(s, c) { struct damos_quota *quota = &s->quota; - if (c->passed_sample_intervals != s->next_apply_sis) + if (c->passed_sample_intervals < s->next_apply_sis) continue; if (!s->wmarks.activated) @@ -1622,7 +1622,7 @@ static void kdamond_apply_schemes(struct damon_ctx *c) bool has_schemes_to_apply = false; damon_for_each_scheme(s, c) { - if (c->passed_sample_intervals != s->next_apply_sis) + if (c->passed_sample_intervals < s->next_apply_sis) continue; if (!s->wmarks.activated) @@ -1642,9 +1642,9 @@ static void kdamond_apply_schemes(struct damon_ctx *c) } damon_for_each_scheme(s, c) { - if (c->passed_sample_intervals != s->next_apply_sis) + if (c->passed_sample_intervals < s->next_apply_sis) continue; - s->next_apply_sis += + s->next_apply_sis = c->passed_sample_intervals + (s->apply_interval_us ? s->apply_interval_us : c->attrs.aggr_interval) / sample_interval; } From 4401e9d10ab0281a520b9f8c220f30f60b5c248f Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 31 Oct 2024 09:12:03 -0700 Subject: [PATCH 148/235] mm/damon/core: avoid overflow in damon_feed_loop_next_input() damon_feed_loop_next_input() is inefficient and fragile to overflows. Specifically, 'score_goal_diff_bp' calculation can overflow when 'score' is high. The calculation is actually unnecessary at all because 'goal' is a constant of value 10,000. Calculation of 'compensation' is again fragile to overflow. Final calculation of return value for under-achiving case is again fragile to overflow when the current score is under-achieving the target. Add two corner cases handling at the beginning of the function to make the body easier to read, and rewrite the body of the function to avoid overflows and the unnecessary bp value calcuation. Link: https://lkml.kernel.org/r/20241031161203.47751-1-sj@kernel.org Fixes: 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") Signed-off-by: SeongJae Park Reported-by: Guenter Roeck Closes: https://lore.kernel.org/944f3d5b-9177-48e7-8ec9-7f1331a3fea3@roeck-us.net Tested-by: Guenter Roeck Cc: [6.8.x] Signed-off-by: Andrew Morton --- mm/damon/core.c | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index ce700e694b63..511c3f61ab44 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1456,17 +1456,31 @@ static unsigned long damon_feed_loop_next_input(unsigned long last_input, unsigned long score) { const unsigned long goal = 10000; - unsigned long score_goal_diff = max(goal, score) - min(goal, score); - unsigned long score_goal_diff_bp = score_goal_diff * 10000 / goal; - unsigned long compensation = last_input * score_goal_diff_bp / 10000; /* Set minimum input as 10000 to avoid compensation be zero */ const unsigned long min_input = 10000; + unsigned long score_goal_diff, compensation; + bool over_achieving = score > goal; - if (goal > score) + if (score == goal) + return last_input; + if (score >= goal * 2) + return min_input; + + if (over_achieving) + score_goal_diff = score - goal; + else + score_goal_diff = goal - score; + + if (last_input < ULONG_MAX / score_goal_diff) + compensation = last_input * score_goal_diff / goal; + else + compensation = last_input / goal * score_goal_diff; + + if (over_achieving) + return max(last_input - compensation, min_input); + if (last_input < ULONG_MAX - compensation) return last_input + compensation; - if (last_input > compensation + min_input) - return last_input - compensation; - return min_input; + return ULONG_MAX; } #ifdef CONFIG_PSI From 652e1a51465f2e8e75590bc3dd1e3a3b61020568 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ADra=20Canal?= Date: Fri, 1 Nov 2024 13:54:05 -0300 Subject: [PATCH 149/235] mm: fix docs for the kernel parameter ``thp_anon=`` MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If we add ``thp_anon=32,64K:always`` to the kernel command line, we will see the following error: [ 0.000000] huge_memory: thp_anon=32,64K:always: error parsing string, ignoring setting This happens because the correct format isn't ``thp_anon=,[KMG]:```, as [KMG] must follow each number to especify its unit. So, the correct format is ``thp_anon=[KMG],[KMG]:```. Therefore, adjust the documentation to reflect the correct format of the parameter ``thp_anon=``. Link: https://lkml.kernel.org/r/20241101165719.1074234-3-mcanal@igalia.com Fixes: dd4d30d1cdbe ("mm: override mTHP "enabled" defaults at kernel cmdline") Signed-off-by: Maíra Canal Acked-by: Barry Song Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Hugh Dickins Cc: Jonathan Corbet Cc: Lance Yang Cc: Ryan Roberts Signed-off-by: Andrew Morton --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/mm/transhuge.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1518343bbe22..1666576acc0e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6688,7 +6688,7 @@ 0: no polling (default) thp_anon= [KNL] - Format: ,[KMG]:;-[KMG]: + Format: [KMG],[KMG]:;[KMG]-[KMG]: state is one of "always", "madvise", "never" or "inherit". Control the default behavior of the system with respect to anonymous transparent hugepages. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index cfdd16a52e39..a1bb495eab59 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -303,7 +303,7 @@ control by passing the parameter ``transparent_hugepage=always`` or kernel command line. Alternatively, each supported anonymous THP size can be controlled by -passing ``thp_anon=,[KMG]:;-[KMG]:``, +passing ``thp_anon=[KMG],[KMG]:;[KMG]-[KMG]:``, where ```` is the THP size (must be a power of 2 of PAGE_SIZE and supported anonymous THP) and ```` is one of ``always``, ``madvise``, ``never`` or ``inherit``. From 0268d4579901821ff17259213c2d8c9679995d48 Mon Sep 17 00:00:00 2001 From: Muhammad Usama Anjum Date: Fri, 1 Nov 2024 19:15:57 +0500 Subject: [PATCH 150/235] selftests: hugetlb_dio: check for initial conditions to skip in the start The test should be skipped if initial conditions aren't fulfilled in the start instead of failing and outputting non-compliant TAP logs. This kind of failure pollutes the results. The initial conditions are: - The test should only execute if /tmp file can be allocated. - The test should only execute if huge pages are free. Before: TAP version 13 1..4 Bail out! Error opening file : Read-only file system (30) # Planned tests != run tests (4 != 0) # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 After: TAP version 13 1..0 # SKIP Unable to allocate file: Read-only file system Link: https://lkml.kernel.org/r/20241101141557.3159432-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum Fixes: 3a103b5315b7 ("selftest: mm: Test if hugepage does not get leaked during __bio_release_pages()") Cc: Muhammad Usama Anjum Cc: Shuah Khan Cc: Donet Tom Cc: Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/hugetlb_dio.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c index f9ac20c657ec..60001c142ce9 100644 --- a/tools/testing/selftests/mm/hugetlb_dio.c +++ b/tools/testing/selftests/mm/hugetlb_dio.c @@ -44,13 +44,6 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) if (fd < 0) ksft_exit_fail_perror("Error opening file\n"); - /* Get the free huge pages before allocation */ - free_hpage_b = get_free_hugepages(); - if (free_hpage_b == 0) { - close(fd); - ksft_exit_skip("No free hugepage, exiting!\n"); - } - /* Allocate a hugetlb page */ orig_buffer = mmap(NULL, h_pagesize, mmap_prot, mmap_flags, -1, 0); if (orig_buffer == MAP_FAILED) { @@ -94,8 +87,20 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) int main(void) { size_t pagesize = 0; + int fd; ksft_print_header(); + + /* Open the file to DIO */ + fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); + if (fd < 0) + ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno)); + close(fd); + + /* Check if huge pages are free */ + if (!get_free_hugepages()) + ksft_exit_skip("No free hugepage, exiting\n"); + ksft_set_plan(4); /* Get base page size */ From 432dc0654c612457285a5dcf9bb13968ac6f0804 Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Fri, 1 Nov 2024 19:19:40 +0000 Subject: [PATCH 151/235] ucounts: fix counter leak in inc_rlimit_get_ucounts() The inc_rlimit_get_ucounts() increments the specified rlimit counter and then checks its limit. If the value exceeds the limit, the function returns an error without decrementing the counter. Link: https://lkml.kernel.org/r/20241101191940.3211128-1-roman.gushchin@linux.dev Fixes: 15bc01effefe ("ucounts: Fix signal ucount refcounting") Signed-off-by: Andrei Vagin Co-developed-by: Roman Gushchin Signed-off-by: Roman Gushchin Tested-by: Roman Gushchin Acked-by: Alexey Gladkov Cc: Kees Cook Cc: Andrei Vagin Cc: "Eric W. Biederman" Cc: Alexey Gladkov Cc: Oleg Nesterov Cc: Signed-off-by: Andrew Morton --- kernel/ucount.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/ucount.c b/kernel/ucount.c index 8c07714ff27d..9469102c5ac0 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -317,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) for (iter = ucounts; iter; iter = iter->ns->ucounts) { long new = atomic_long_add_return(1, &iter->rlimit[type]); if (new < 0 || new > max) - goto unwind; + goto dec_unwind; if (iter == ucounts) ret = new; max = get_userns_rlimit_max(iter->ns, type); @@ -334,7 +334,6 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) dec_unwind: dec = atomic_long_sub_return(1, &iter->rlimit[type]); WARN_ON_ONCE(dec < 0); -unwind: do_dec_rlimit_put_ucounts(ucounts, iter, type); return 0; } From b8ee299855f08539e04d6c1a6acb3dc9e5423c00 Mon Sep 17 00:00:00 2001 From: Qi Xi Date: Fri, 1 Nov 2024 11:48:03 +0800 Subject: [PATCH 152/235] fs/proc: fix compile warning about variable 'vmcore_mmap_ops' When build with !CONFIG_MMU, the variable 'vmcore_mmap_ops' is defined but not used: >> fs/proc/vmcore.c:458:42: warning: unused variable 'vmcore_mmap_ops' 458 | static const struct vm_operations_struct vmcore_mmap_ops = { Fix this by only defining it when CONFIG_MMU is enabled. Link: https://lkml.kernel.org/r/20241101034803.9298-1-xiqi2@huawei.com Fixes: 9cb218131de1 ("vmcore: introduce remap_oldmem_pfn_range()") Signed-off-by: Qi Xi Reported-by: kernel test robot Closes: https://lore.kernel.org/lkml/202410301936.GcE8yUos-lkp@intel.com/ Cc: Baoquan He Cc: Dave Young Cc: Michael Holzheu Cc: Vivek Goyal Cc: Wang ShaoBo Signed-off-by: Andrew Morton --- fs/proc/vmcore.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index b52d85f8ad59..b4521b096058 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -457,10 +457,6 @@ static vm_fault_t mmap_vmcore_fault(struct vm_fault *vmf) #endif } -static const struct vm_operations_struct vmcore_mmap_ops = { - .fault = mmap_vmcore_fault, -}; - /** * vmcore_alloc_buf - allocate buffer in vmalloc memory * @size: size of buffer @@ -488,6 +484,11 @@ static inline char *vmcore_alloc_buf(size_t size) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU + +static const struct vm_operations_struct vmcore_mmap_ops = { + .fault = mmap_vmcore_fault, +}; + /* * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages * reported as not being ram with the zero page. From 9e05e5c7ee8758141d2db7e8fea2cab34500c6ed Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Mon, 4 Nov 2024 19:54:19 +0000 Subject: [PATCH 153/235] signal: restore the override_rlimit logic Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set. This behavior change caused production issues. For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications. Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior. Link: https://lkml.kernel.org/r/20241104195419.3962584-1-roman.gushchin@linux.dev Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin Co-developed-by: Andrei Vagin Signed-off-by: Andrei Vagin Acked-by: Oleg Nesterov Acked-by: Alexey Gladkov Cc: Kees Cook Cc: "Eric W. Biederman" Cc: Signed-off-by: Andrew Morton --- include/linux/user_namespace.h | 3 ++- kernel/signal.c | 3 ++- kernel/ucount.c | 6 ++++-- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 3625096d5f85..7183e5aca282 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type); +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, + bool override_rlimit); void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max); diff --git a/kernel/signal.c b/kernel/signal.c index 4344860ffcac..cbabb2d05e0a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, */ rcu_read_lock(); ucounts = task_ucounts(t); - sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, + override_rlimit); rcu_read_unlock(); if (!sigpending) return NULL; diff --git a/kernel/ucount.c b/kernel/ucount.c index 9469102c5ac0..696406939be5 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); } -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, + bool override_rlimit) { /* Caller must hold a reference to ucounts */ struct ucounts *iter; @@ -320,7 +321,8 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) goto dec_unwind; if (iter == ucounts) ret = new; - max = get_userns_rlimit_max(iter->ns, type); + if (!override_rlimit) + max = get_userns_rlimit_max(iter->ns, type); /* * Grab an extra ucount reference for the caller when * the rlimit count was previously 0. From 0b63c0e01fba40e3992bc627272ec7b618ccaef7 Mon Sep 17 00:00:00 2001 From: Andrew Kanner Date: Sun, 3 Nov 2024 20:38:45 +0100 Subject: [PATCH 154/235] ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() Syzkaller is able to provoke null-ptr-dereference in ocfs2_xa_remove(): [ 57.319872] (a.out,1161,7):ocfs2_xa_remove:2028 ERROR: status = -12 [ 57.320420] (a.out,1161,7):ocfs2_xa_cleanup_value_truncate:1999 ERROR: Partial truncate while removing xattr overlay.upper. Leaking 1 clusters and removing the entry [ 57.321727] BUG: kernel NULL pointer dereference, address: 0000000000000004 [...] [ 57.325727] RIP: 0010:ocfs2_xa_block_wipe_namevalue+0x2a/0xc0 [...] [ 57.331328] Call Trace: [ 57.331477] [...] [ 57.333511] ? do_user_addr_fault+0x3e5/0x740 [ 57.333778] ? exc_page_fault+0x70/0x170 [ 57.334016] ? asm_exc_page_fault+0x2b/0x30 [ 57.334263] ? __pfx_ocfs2_xa_block_wipe_namevalue+0x10/0x10 [ 57.334596] ? ocfs2_xa_block_wipe_namevalue+0x2a/0xc0 [ 57.334913] ocfs2_xa_remove_entry+0x23/0xc0 [ 57.335164] ocfs2_xa_set+0x704/0xcf0 [ 57.335381] ? _raw_spin_unlock+0x1a/0x40 [ 57.335620] ? ocfs2_inode_cache_unlock+0x16/0x20 [ 57.335915] ? trace_preempt_on+0x1e/0x70 [ 57.336153] ? start_this_handle+0x16c/0x500 [ 57.336410] ? preempt_count_sub+0x50/0x80 [ 57.336656] ? _raw_read_unlock+0x20/0x40 [ 57.336906] ? start_this_handle+0x16c/0x500 [ 57.337162] ocfs2_xattr_block_set+0xa6/0x1e0 [ 57.337424] __ocfs2_xattr_set_handle+0x1fd/0x5d0 [ 57.337706] ? ocfs2_start_trans+0x13d/0x290 [ 57.337971] ocfs2_xattr_set+0xb13/0xfb0 [ 57.338207] ? dput+0x46/0x1c0 [ 57.338393] ocfs2_xattr_trusted_set+0x28/0x30 [ 57.338665] ? ocfs2_xattr_trusted_set+0x28/0x30 [ 57.338948] __vfs_removexattr+0x92/0xc0 [ 57.339182] __vfs_removexattr_locked+0xd5/0x190 [ 57.339456] ? preempt_count_sub+0x50/0x80 [ 57.339705] vfs_removexattr+0x5f/0x100 [...] Reproducer uses faultinject facility to fail ocfs2_xa_remove() -> ocfs2_xa_value_truncate() with -ENOMEM. In this case the comment mentions that we can return 0 if ocfs2_xa_cleanup_value_truncate() is going to wipe the entry anyway. But the following 'rc' check is wrong and execution flow do 'ocfs2_xa_remove_entry(loc);' twice: * 1st: in ocfs2_xa_cleanup_value_truncate(); * 2nd: returning back to ocfs2_xa_remove() instead of going to 'out'. Fix this by skipping the 2nd removal of the same entry and making syzkaller repro happy. Link: https://lkml.kernel.org/r/20241103193845.2940988-1-andrew.kanner@gmail.com Fixes: 399ff3a748cf ("ocfs2: Handle errors while setting external xattr values.") Signed-off-by: Andrew Kanner Reported-by: syzbot+386ce9e60fa1b18aac5b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/671e13ab.050a0220.2b8c0f.01d0.GAE@google.com/T/ Tested-by: syzbot+386ce9e60fa1b18aac5b@syzkaller.appspotmail.com Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Jun Piao Cc: Signed-off-by: Andrew Morton --- fs/ocfs2/xattr.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c index dd0a05365e79..73a6f6fd8a8e 100644 --- a/fs/ocfs2/xattr.c +++ b/fs/ocfs2/xattr.c @@ -2036,8 +2036,7 @@ static int ocfs2_xa_remove(struct ocfs2_xa_loc *loc, rc = 0; ocfs2_xa_cleanup_value_truncate(loc, "removing", orig_clusters); - if (rc) - goto out; + goto out; } } From c289f4de8e479251b64988839fd0e87f246e03a2 Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Mon, 4 Nov 2024 00:44:09 +0100 Subject: [PATCH 155/235] mailmap: add entry for Thorsten Blum Map my previously used email address to my @linux.dev address. Link: https://lkml.kernel.org/r/20241103234411.2522-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum Cc: Alex Elder Cc: David S. Miller Cc: Geliang Tang Cc: Kees Cook Cc: Mathieu Othacehe Cc: Matthieu Baerts (NGI0) Cc: Matt Ranostay Cc: Naoya Horiguchi Cc: Neeraj Upadhyay Cc: Quentin Monnet Signed-off-by: Andrew Morton --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index 5378f04b2566..5e829da09e7f 100644 --- a/.mailmap +++ b/.mailmap @@ -665,6 +665,7 @@ Tomeu Vizoso Thomas Graf Thomas Körper Thomas Pedersen +Thorsten Blum Tiezhu Yang Tingwei Zhang Tirupathi Reddy From ca43f73cd1720e3b0b9c49deec1a13c89c0ca1e8 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 7 Nov 2024 21:48:33 -0500 Subject: [PATCH 156/235] bcachefs: bch2_btree_write_buffer_flush_going_ro() The write buffer needs to be specifically flushed when going RO: keys in the journal that haven't yet been moved to the write buffer don't have a journal pin yet. This fixes numerous syzbot bugs, all with symptoms of still doing writes after we've got RO. Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_write_buffer.c | 30 +++++++++++++++++++++++++++--- fs/bcachefs/btree_write_buffer.h | 1 + fs/bcachefs/super.c | 1 + 3 files changed, 29 insertions(+), 3 deletions(-) diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c index 3f56b584f8ec..1639c60dffa0 100644 --- a/fs/bcachefs/btree_write_buffer.c +++ b/fs/bcachefs/btree_write_buffer.c @@ -277,6 +277,10 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans) bool accounting_replay_done = test_bit(BCH_FS_accounting_replay_done, &c->flags); int ret = 0; + ret = bch2_journal_error(&c->journal); + if (ret) + return ret; + bch2_trans_unlock(trans); bch2_trans_begin(trans); @@ -491,7 +495,8 @@ static int fetch_wb_keys_from_journal(struct bch_fs *c, u64 seq) return ret; } -static int btree_write_buffer_flush_seq(struct btree_trans *trans, u64 seq) +static int btree_write_buffer_flush_seq(struct btree_trans *trans, u64 seq, + bool *did_work) { struct bch_fs *c = trans->c; struct btree_write_buffer *wb = &c->btree_write_buffer; @@ -502,6 +507,8 @@ static int btree_write_buffer_flush_seq(struct btree_trans *trans, u64 seq) fetch_from_journal_err = fetch_wb_keys_from_journal(c, seq); + *did_work |= wb->inc.keys.nr || wb->flushing.keys.nr; + /* * On memory allocation failure, bch2_btree_write_buffer_flush_locked() * is not guaranteed to empty wb->inc: @@ -521,17 +528,34 @@ static int bch2_btree_write_buffer_journal_flush(struct journal *j, struct journal_entry_pin *_pin, u64 seq) { struct bch_fs *c = container_of(j, struct bch_fs, journal); + bool did_work = false; - return bch2_trans_run(c, btree_write_buffer_flush_seq(trans, seq)); + return bch2_trans_run(c, btree_write_buffer_flush_seq(trans, seq, &did_work)); } int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans) { struct bch_fs *c = trans->c; + bool did_work = false; trace_and_count(c, write_buffer_flush_sync, trans, _RET_IP_); - return btree_write_buffer_flush_seq(trans, journal_cur_seq(&c->journal)); + return btree_write_buffer_flush_seq(trans, journal_cur_seq(&c->journal), &did_work); +} + +/* + * The write buffer requires flushing when going RO: keys in the journal for the + * write buffer don't have a journal pin yet + */ +bool bch2_btree_write_buffer_flush_going_ro(struct bch_fs *c) +{ + if (bch2_journal_error(&c->journal)) + return false; + + bool did_work = false; + bch2_trans_run(c, btree_write_buffer_flush_seq(trans, + journal_cur_seq(&c->journal), &did_work)); + return did_work; } int bch2_btree_write_buffer_flush_nocheck_rw(struct btree_trans *trans) diff --git a/fs/bcachefs/btree_write_buffer.h b/fs/bcachefs/btree_write_buffer.h index 725e79654216..d535cea28bde 100644 --- a/fs/bcachefs/btree_write_buffer.h +++ b/fs/bcachefs/btree_write_buffer.h @@ -21,6 +21,7 @@ static inline bool bch2_btree_write_buffer_must_wait(struct bch_fs *c) struct btree_trans; int bch2_btree_write_buffer_flush_sync(struct btree_trans *); +bool bch2_btree_write_buffer_flush_going_ro(struct bch_fs *); int bch2_btree_write_buffer_flush_nocheck_rw(struct btree_trans *); int bch2_btree_write_buffer_tryflush(struct btree_trans *); diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c index 657fd3759e7b..a6ed9a0bf1c7 100644 --- a/fs/bcachefs/super.c +++ b/fs/bcachefs/super.c @@ -272,6 +272,7 @@ static void __bch2_fs_read_only(struct bch_fs *c) clean_passes++; if (bch2_btree_interior_updates_flush(c) || + bch2_btree_write_buffer_flush_going_ro(c) || bch2_journal_flush_all_pins(&c->journal) || bch2_btree_flush_all_writes(c) || seq != atomic64_read(&c->journal.seq)) { From 27a036a0c3e7046f508143af96a54f657c3584b8 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 7 Nov 2024 23:24:22 -0500 Subject: [PATCH 157/235] bcachefs: Fix bch_member.btree_bitmap_shift validation Needs to match the assert later when we resize... Reported-by: syzbot+e8eff054face85d7ea41@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/sb-members.c | 4 ++-- fs/bcachefs/sb-members_format.h | 6 ++++++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/bcachefs/sb-members.c b/fs/bcachefs/sb-members.c index fb08dd680dac..116131f95815 100644 --- a/fs/bcachefs/sb-members.c +++ b/fs/bcachefs/sb-members.c @@ -163,7 +163,7 @@ static int validate_member(struct printbuf *err, return -BCH_ERR_invalid_sb_members; } - if (m.btree_bitmap_shift >= 64) { + if (m.btree_bitmap_shift >= BCH_MI_BTREE_BITMAP_SHIFT_MAX) { prt_printf(err, "device %u: invalid btree_bitmap_shift %u", i, m.btree_bitmap_shift); return -BCH_ERR_invalid_sb_members; } @@ -450,7 +450,7 @@ static void __bch2_dev_btree_bitmap_mark(struct bch_sb_field_members_v2 *mi, uns m->btree_bitmap_shift += resize; } - BUG_ON(m->btree_bitmap_shift > 57); + BUG_ON(m->btree_bitmap_shift >= BCH_MI_BTREE_BITMAP_SHIFT_MAX); BUG_ON(end > 64ULL << m->btree_bitmap_shift); for (unsigned bit = start >> m->btree_bitmap_shift; diff --git a/fs/bcachefs/sb-members_format.h b/fs/bcachefs/sb-members_format.h index d727d2dfda08..2adf1221a440 100644 --- a/fs/bcachefs/sb-members_format.h +++ b/fs/bcachefs/sb-members_format.h @@ -65,6 +65,12 @@ struct bch_member { __le32 last_journal_bucket_offset; }; +/* + * btree_allocated_bitmap can represent sector addresses of a u64: it itself has + * 64 elements, so 64 - ilog2(64) + */ +#define BCH_MI_BTREE_BITMAP_SHIFT_MAX 58 + /* * This limit comes from the bucket_gens array - it's a single allocation, and * kernel allocation are limited to INT_MAX From f8f1dde6868139f2294786365c56d7ff5cc3f4e7 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 7 Nov 2024 22:18:02 -0500 Subject: [PATCH 158/235] bcachefs: Fix missing validation for bch_backpointer.level This fixes an assertion pop where we try to navigate to the target of the backpointer, and the path level isn't what we expect. Reported-by: syzbot+b17df21b4d370f2dc330@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/backpointers.c | 7 ++++++- fs/bcachefs/sb-errors_format.h | 6 +++++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/bcachefs/backpointers.c b/fs/bcachefs/backpointers.c index 47455a85c909..ed344400c8bd 100644 --- a/fs/bcachefs/backpointers.c +++ b/fs/bcachefs/backpointers.c @@ -52,6 +52,12 @@ int bch2_backpointer_validate(struct bch_fs *c, struct bkey_s_c k, enum bch_validate_flags flags) { struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); + int ret = 0; + + bkey_fsck_err_on(bp.v->level > BTREE_MAX_DEPTH, + c, backpointer_level_bad, + "backpointer level bad: %u >= %u", + bp.v->level, BTREE_MAX_DEPTH); rcu_read_lock(); struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp.k->p.inode); @@ -64,7 +70,6 @@ int bch2_backpointer_validate(struct bch_fs *c, struct bkey_s_c k, struct bpos bucket = bp_pos_to_bucket(ca, bp.k->p); struct bpos bp_pos = bucket_pos_to_bp_noerror(ca, bucket, bp.v->bucket_offset); rcu_read_unlock(); - int ret = 0; bkey_fsck_err_on((bp.v->bucket_offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT) >= ca->mi.bucket_size || !bpos_eq(bp.k->p, bp_pos), diff --git a/fs/bcachefs/sb-errors_format.h b/fs/bcachefs/sb-errors_format.h index 937275d061fe..9feb6739f77a 100644 --- a/fs/bcachefs/sb-errors_format.h +++ b/fs/bcachefs/sb-errors_format.h @@ -136,7 +136,9 @@ enum bch_fsck_flags { x(bucket_gens_nonzero_for_invalid_buckets, 122, FSCK_AUTOFIX) \ x(need_discard_freespace_key_to_invalid_dev_bucket, 123, 0) \ x(need_discard_freespace_key_bad, 124, 0) \ + x(discarding_bucket_not_in_need_discard_btree, 291, 0) \ x(backpointer_bucket_offset_wrong, 125, 0) \ + x(backpointer_level_bad, 294, 0) \ x(backpointer_to_missing_device, 126, 0) \ x(backpointer_to_missing_alloc, 127, 0) \ x(backpointer_to_missing_ptr, 128, 0) \ @@ -177,7 +179,9 @@ enum bch_fsck_flags { x(ptr_stripe_redundant, 163, 0) \ x(reservation_key_nr_replicas_invalid, 164, 0) \ x(reflink_v_refcount_wrong, 165, 0) \ + x(reflink_v_pos_bad, 292, 0) \ x(reflink_p_to_missing_reflink_v, 166, 0) \ + x(reflink_refcount_underflow, 293, 0) \ x(stripe_pos_bad, 167, 0) \ x(stripe_val_size_bad, 168, 0) \ x(stripe_csum_granularity_bad, 290, 0) \ @@ -302,7 +306,7 @@ enum bch_fsck_flags { x(accounting_key_replicas_devs_unsorted, 280, FSCK_AUTOFIX) \ x(accounting_key_version_0, 282, FSCK_AUTOFIX) \ x(logged_op_but_clean, 283, FSCK_AUTOFIX) \ - x(MAX, 291, 0) + x(MAX, 295, 0) enum bch_sb_error_id { #define x(t, n, ...) BCH_FSCK_ERR_##t = n, From 10299cdde869abab7a42fb5ab905a47a4e2cd24e Mon Sep 17 00:00:00 2001 From: John Sperbeck Date: Tue, 5 Nov 2024 19:40:31 -0800 Subject: [PATCH 159/235] KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB In 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the kernel sources"), VMX_BASIC_MEM_TYPE_WB was removed. Use X86_MEMTYPE_WB instead. Fixes: 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the kernel sources") Signed-off-by: John Sperbeck Message-ID: <20241106034031.503291-1-jsperbeck@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index 089b8925b6b2..d7ac122820bf 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -200,7 +200,7 @@ static inline void init_vmcs_control_fields(struct vmx_pages *vmx) if (vmx->eptp_gpa) { uint64_t ept_paddr; struct eptPageTablePointer eptp = { - .memory_type = VMX_BASIC_MEM_TYPE_WB, + .memory_type = X86_MEMTYPE_WB, .page_walk_length = 3, /* + 1 */ .ad_enabled = ept_vpid_cap_supported(VMX_EPT_VPID_CAP_AD_BITS), .address = vmx->eptp_gpa >> PAGE_SHIFT_4K, From e3a7792d96765ff435f3000e94619fcef2f6bfec Mon Sep 17 00:00:00 2001 From: Dionna Glaze Date: Tue, 5 Nov 2024 01:05:48 +0000 Subject: [PATCH 160/235] kvm: svm: Fix gctx page leak on invalid inputs Ensure that snp gctx page allocation is adequately deallocated on failure during snp_launch_start. Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command") CC: Sean Christopherson CC: Paolo Bonzini CC: Thomas Gleixner CC: Ingo Molnar CC: Borislav Petkov CC: Dave Hansen CC: Ashish Kalra CC: Tom Lendacky CC: John Allen CC: Herbert Xu CC: "David S. Miller" CC: Michael Roth CC: Luis Chamberlain CC: Russ Weight CC: Danilo Krummrich CC: Greg Kroah-Hartman CC: "Rafael J. Wysocki" CC: Tianfei zhang CC: Alexey Kardashevskiy Signed-off-by: Dionna Glaze Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 03c7db3b4cc3..fb854cf20ac3 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2215,10 +2215,6 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) if (sev->snp_context) return -EINVAL; - sev->snp_context = snp_context_create(kvm, argp); - if (!sev->snp_context) - return -ENOTTY; - if (params.flags) return -EINVAL; @@ -2233,6 +2229,10 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET) return -EINVAL; + sev->snp_context = snp_context_create(kvm, argp); + if (!sev->snp_context) + return -ENOTTY; + start.gctx_paddr = __psp_pa(sev->snp_context); start.policy = params.policy; memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw)); From d3ddef46f22e8c3124e0df1f325bc6a18dadff39 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 5 Nov 2024 17:51:35 -0800 Subject: [PATCH 161/235] KVM: x86: Unconditionally set irr_pending when updating APICv state Always set irr_pending (to true) when updating APICv status to fix a bug where KVM fails to set irr_pending when userspace sets APIC state and APICv is disabled, which ultimate results in KVM failing to inject the pending interrupt(s) that userspace stuffed into the vIRR, until another interrupt happens to be emulated by KVM. Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to be true if APICv is enabled, because not all vIRR updates will be visible to KVM. Hit the bug with a big hammer, even though strictly speaking KVM can scan the vIRR and set/clear irr_pending as appropriate for this specific case. The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog suggests, deleted code that updated irr_pending. Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with with the crucial difference that kvm_apic_update_apicv() did the scan even when APICv was being *disabled*, e.g. due to an AVIC inhibition. struct kvm_lapic *apic = vcpu->arch.apic; if (vcpu->arch.apicv_active) { /* irr_pending is always true when apicv is activated. */ apic->irr_pending = true; apic->isr_count = 1; } else { apic->irr_pending = (apic_search_irr(apic) != -1); apic->isr_count = count_vectors(apic->regs + APIC_ISR); } And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm: lapic: Introduce APICv update helper function"), prior to which KVM unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed that the new virtual APIC state could have a pending IRQ. Furthermore, in addition to introducing this issue, commit 755c2bf87860 also papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv as disabled prior to searching the IRR. Waiting until KVM emulates an EOI to update irr_pending "works", but only because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of memory barriers in between. I.e. leaving irr_pending set is basically hacking around bad ordering. So, effectively revert to the pre-b26a695a1d78 behavior for state restore, even though it's sub-optimal if no IRQs are pending, in order to provide a minimal fix, but leave behind a FIXME to document the ugliness. With luck, the ordering issue will be fixed and the mess will be cleaned up in the not-too-distant future. Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it") Cc: stable@vger.kernel.org Cc: Maxim Levitsky Reported-by: Yong He Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com Signed-off-by: Sean Christopherson Message-ID: <20241106015135.2462147-1-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 2098dc689088..95c6beb8ce27 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2629,19 +2629,26 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; - if (apic->apicv_active) { - /* irr_pending is always true when apicv is activated. */ - apic->irr_pending = true; + /* + * When APICv is enabled, KVM must always search the IRR for a pending + * IRQ, as other vCPUs and devices can set IRR bits even if the vCPU + * isn't running. If APICv is disabled, KVM _should_ search the IRR + * for a pending IRQ. But KVM currently doesn't ensure *all* hardware, + * e.g. CPUs and IOMMUs, has seen the change in state, i.e. searching + * the IRR at this time could race with IRQ delivery from hardware that + * still sees APICv as being enabled. + * + * FIXME: Ensure other vCPUs and devices observe the change in APICv + * state prior to updating KVM's metadata caches, so that KVM + * can safely search the IRR and set irr_pending accordingly. + */ + apic->irr_pending = true; + + if (apic->apicv_active) apic->isr_count = 1; - } else { - /* - * Don't clear irr_pending, searching the IRR can race with - * updates from the CPU as APICv is still active from hardware's - * perspective. The flag will be cleared as appropriate when - * KVM injects the interrupt. - */ + else apic->isr_count = count_vectors(apic->regs + APIC_ISR); - } + apic->highest_isr_cache = -1; } From aa0d42cacf093a6fcca872edc954f6f812926a17 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 1 Nov 2024 11:50:30 -0700 Subject: [PATCH 162/235] KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are myriad bugs in the implementation, some of which are fatal to the guest, and others which put the stability and health of the host at risk. For guest fatalities, the most glaring issue is that KVM fails to ensure tracing is disabled, and *stays* disabled prior to VM-Enter, which is necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing is enabled (enforced via a VMX consistency check). Per the SDM: If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load IA32_RTIT_CTL" VM-entry control must be 0. On the host side, KVM doesn't validate the guest CPUID configuration provided by userspace, and even worse, uses the guest configuration to decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring guest CPUID to enumerate more address ranges than are supported in hardware will result in KVM trying to passthrough, save, and load non-existent MSRs, which generates a variety of WARNs, ToPA ERRORs in the host, a potential deadlock, etc. Fixes: f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode") Cc: stable@vger.kernel.org Cc: Adrian Hunter Signed-off-by: Sean Christopherson Reviewed-by: Xiaoyao Li Tested-by: Adrian Hunter Message-ID: <20241101185031.1799556-2-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9886d67d9512..d28618e9277e 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -217,9 +217,11 @@ module_param(ple_window_shrink, uint, 0444); static unsigned int ple_window_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX; module_param(ple_window_max, uint, 0444); -/* Default is SYSTEM mode, 1 for host-guest mode */ +/* Default is SYSTEM mode, 1 for host-guest mode (which is BROKEN) */ int __read_mostly pt_mode = PT_MODE_SYSTEM; +#ifdef CONFIG_BROKEN module_param(pt_mode, int, S_IRUGO); +#endif struct x86_pmu_lbr __ro_after_init vmx_lbr_caps; From 8de3e97f3d3d62cd9f3067f073e8ac93261597db Mon Sep 17 00:00:00 2001 From: Liu Peibao Date: Fri, 1 Nov 2024 16:12:43 +0800 Subject: [PATCH 163/235] i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set When the Tx FIFO is empty and the last command has no STOP bit set, the master holds SCL low. If I2C_DYNAMIC_TAR_UPDATE is not set, BIT(13) MST_ON_HOLD of IC_RAW_INTR_STAT is not enabled, causing the __i2c_dw_disable() timeout. This is quite similar to commit 2409205acd3c ("i2c: designware: fix __i2c_dw_disable() in case master is holding SCL low"). Also check BIT(7) MST_HOLD_TX_FIFO_EMPTY in IC_STATUS, which is available when IC_STAT_FOR_CLK_STRETCH is set. Fixes: 2409205acd3c ("i2c: designware: fix __i2c_dw_disable() in case master is holding SCL low") Co-developed-by: Xiaowu Ding Signed-off-by: Xiaowu Ding Co-developed-by: Angus Chen Signed-off-by: Angus Chen Signed-off-by: Liu Peibao Acked-by: Jarkko Nikula Signed-off-by: Andi Shyti --- drivers/i2c/busses/i2c-designware-common.c | 6 ++++-- drivers/i2c/busses/i2c-designware-core.h | 1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/i2c/busses/i2c-designware-common.c b/drivers/i2c/busses/i2c-designware-common.c index f31d352d98b5..9d88b4fa03e4 100644 --- a/drivers/i2c/busses/i2c-designware-common.c +++ b/drivers/i2c/busses/i2c-designware-common.c @@ -524,7 +524,7 @@ int i2c_dw_set_sda_hold(struct dw_i2c_dev *dev) void __i2c_dw_disable(struct dw_i2c_dev *dev) { struct i2c_timings *t = &dev->timings; - unsigned int raw_intr_stats; + unsigned int raw_intr_stats, ic_stats; unsigned int enable; int timeout = 100; bool abort_needed; @@ -532,9 +532,11 @@ void __i2c_dw_disable(struct dw_i2c_dev *dev) int ret; regmap_read(dev->map, DW_IC_RAW_INTR_STAT, &raw_intr_stats); + regmap_read(dev->map, DW_IC_STATUS, &ic_stats); regmap_read(dev->map, DW_IC_ENABLE, &enable); - abort_needed = raw_intr_stats & DW_IC_INTR_MST_ON_HOLD; + abort_needed = (raw_intr_stats & DW_IC_INTR_MST_ON_HOLD) || + (ic_stats & DW_IC_STATUS_MASTER_HOLD_TX_FIFO_EMPTY); if (abort_needed) { if (!(enable & DW_IC_ENABLE_ENABLE)) { regmap_write(dev->map, DW_IC_ENABLE, DW_IC_ENABLE_ENABLE); diff --git a/drivers/i2c/busses/i2c-designware-core.h b/drivers/i2c/busses/i2c-designware-core.h index 8e8854ec9882..2d32896d0673 100644 --- a/drivers/i2c/busses/i2c-designware-core.h +++ b/drivers/i2c/busses/i2c-designware-core.h @@ -116,6 +116,7 @@ #define DW_IC_STATUS_RFNE BIT(3) #define DW_IC_STATUS_MASTER_ACTIVITY BIT(5) #define DW_IC_STATUS_SLAVE_ACTIVITY BIT(6) +#define DW_IC_STATUS_MASTER_HOLD_TX_FIFO_EMPTY BIT(7) #define DW_IC_SDA_HOLD_RX_SHIFT 16 #define DW_IC_SDA_HOLD_RX_MASK GENMASK(23, 16) From dc537189b5cf09e61839491fc6a465c5659d7dbd Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Fri, 8 Nov 2024 00:00:19 -0500 Subject: [PATCH 164/235] bcachefs: Fix validate_bset() repair path When we truncate a bset (due to it extending past the end of the btree node), we can't skip the rest of the validation for e.g. the packed format (if it's the first bset in the node). Reported-by: syzbot+4d722d3c539d77c7bc82@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_io.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/fs/bcachefs/btree_io.c b/fs/bcachefs/btree_io.c index 6296a11ccb09..839d68802e42 100644 --- a/fs/bcachefs/btree_io.c +++ b/fs/bcachefs/btree_io.c @@ -733,11 +733,8 @@ static int validate_bset(struct bch_fs *c, struct bch_dev *ca, c, ca, b, i, NULL, bset_past_end_of_btree_node, "bset past end of btree node (offset %u len %u but written %zu)", - offset, sectors, ptr_written ?: btree_sectors(c))) { + offset, sectors, ptr_written ?: btree_sectors(c))) i->u64s = 0; - ret = 0; - goto out; - } btree_err_on(offset && !i->u64s, -BCH_ERR_btree_node_read_err_fixable, @@ -829,7 +826,6 @@ static int validate_bset(struct bch_fs *c, struct bch_dev *ca, BSET_BIG_ENDIAN(i), write, &bn->format); } -out: fsck_err: printbuf_exit(&buf2); printbuf_exit(&buf1); From bcf77a05fb3d6210026483703bcacb22ed961c99 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Fri, 8 Nov 2024 00:25:18 -0500 Subject: [PATCH 165/235] bcachefs: Fix hidden btree errors when reading roots We silence btree errors in btree_node_scan, since it's probing and errors are expected: add a fake pass so that btree_node_scan is no longer recovery pass 0, and we don't think we're in btree node scan when reading btree roots. Signed-off-by: Kent Overstreet --- fs/bcachefs/recovery_passes.c | 6 ++++++ fs/bcachefs/recovery_passes_types.h | 1 + 2 files changed, 7 insertions(+) diff --git a/fs/bcachefs/recovery_passes.c b/fs/bcachefs/recovery_passes.c index 4bbeac9e0526..dff589ddc984 100644 --- a/fs/bcachefs/recovery_passes.c +++ b/fs/bcachefs/recovery_passes.c @@ -27,6 +27,12 @@ const char * const bch2_recovery_passes[] = { NULL }; +/* Fake recovery pass, so that scan_for_btree_nodes isn't 0: */ +static int bch2_recovery_pass_empty(struct bch_fs *c) +{ + return 0; +} + static int bch2_set_may_go_rw(struct bch_fs *c) { struct journal_keys *keys = &c->journal_keys; diff --git a/fs/bcachefs/recovery_passes_types.h b/fs/bcachefs/recovery_passes_types.h index 9d96c06e365c..94dc20ca2065 100644 --- a/fs/bcachefs/recovery_passes_types.h +++ b/fs/bcachefs/recovery_passes_types.h @@ -13,6 +13,7 @@ * must never change: */ #define BCH_RECOVERY_PASSES() \ + x(recovery_pass_empty, 41, PASS_SILENT) \ x(scan_for_btree_nodes, 37, 0) \ x(check_topology, 4, 0) \ x(accounting_read, 39, PASS_ALWAYS) \ From fb86c42a2a5d44e849ddfbc98b8d2f4f40d36ee3 Mon Sep 17 00:00:00 2001 From: Jiawei Ye Date: Fri, 8 Nov 2024 08:18:52 +0000 Subject: [PATCH 166/235] bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6 In the bpf_out_neigh_v6 function, rcu_read_lock() is used to begin an RCU read-side critical section. However, when unlocking, one branch incorrectly uses a different RCU unlock flavour rcu_read_unlock_bh() instead of rcu_read_unlock(). This mismatch in RCU locking flavours can lead to unexpected behavior and potential concurrency issues. This possible bug was identified using a static analysis tool developed by myself, specifically designed to detect RCU-related issues. This patch corrects the mismatched unlock flavour by replacing the incorrect rcu_read_unlock_bh() with the appropriate rcu_read_unlock(), ensuring that the RCU critical section is properly exited. This change prevents potential synchronization issues and aligns with proper RCU usage patterns. Fixes: 09eed1192cec ("neighbour: switch to standard rcu, instead of rcu_bh") Signed-off-by: Jiawei Ye Acked-by: Yonghong Song Link: https://lore.kernel.org/r/tencent_CFD3D1C3D68B45EA9F52D8EC76D2C4134306@qq.com Signed-off-by: Martin KaFai Lau --- net/core/filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/filter.c b/net/core/filter.c index e31ee8be2de0..fb56567c551e 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2249,7 +2249,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb, rcu_read_unlock(); return ret; } - rcu_read_unlock_bh(); + rcu_read_unlock(); if (dst) IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES); out_drop: From eb72e7fcc83987d5d5595b43222f23b295d5de7f Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 7 Nov 2024 19:20:21 +0000 Subject: [PATCH 167/235] sctp: fix possible UAF in sctp_v6_available() A lockdep report [1] with CONFIG_PROVE_RCU_LIST=y hints that sctp_v6_available() is calling dev_get_by_index_rcu() and ipv6_chk_addr() without holding rcu. [1] ============================= WARNING: suspicious RCU usage 6.12.0-rc5-virtme #1216 Tainted: G W ----------------------------- net/core/dev.c:876 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by sctp_hello/31495: #0: ffff9f1ebbdb7418 (sk_lock-AF_INET6){+.+.}-{0:0}, at: sctp_bind (./arch/x86/include/asm/jump_label.h:27 net/sctp/socket.c:315) sctp stack backtrace: CPU: 7 UID: 0 PID: 31495 Comm: sctp_hello Tainted: G W 6.12.0-rc5-virtme #1216 Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: dump_stack_lvl (lib/dump_stack.c:123) lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822) dev_get_by_index_rcu (net/core/dev.c:876 (discriminator 7)) sctp_v6_available (net/sctp/ipv6.c:701) sctp sctp_do_bind (net/sctp/socket.c:400 (discriminator 1)) sctp sctp_bind (net/sctp/socket.c:320) sctp inet6_bind_sk (net/ipv6/af_inet6.c:465) ? security_socket_bind (security/security.c:4581 (discriminator 1)) __sys_bind (net/socket.c:1848 net/socket.c:1869) ? do_user_addr_fault (./include/linux/rcupdate.h:347 ./include/linux/rcupdate.h:880 ./include/linux/mm.h:729 arch/x86/mm/fault.c:1340) ? do_user_addr_fault (./arch/x86/include/asm/preempt.h:84 (discriminator 13) ./include/linux/rcupdate.h:98 (discriminator 13) ./include/linux/rcupdate.h:882 (discriminator 13) ./include/linux/mm.h:729 (discriminator 13) arch/x86/mm/fault.c:1340 (discriminator 13)) __x64_sys_bind (net/socket.c:1877 (discriminator 1) net/socket.c:1875 (discriminator 1) net/socket.c:1875 (discriminator 1)) do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f59b934a1e7 Code: 44 00 00 48 8b 15 39 8c 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 31 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 8c 0c 00 f7 d8 64 89 01 48 All code ======== 0: 44 00 00 add %r8b,(%rax) 3: 48 8b 15 39 8c 0c 00 mov 0xc8c39(%rip),%rdx # 0xc8c43 a: f7 d8 neg %eax c: 64 89 02 mov %eax,%fs:(%rdx) f: b8 ff ff ff ff mov $0xffffffff,%eax 14: eb bd jmp 0xffffffffffffffd3 16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) 1d: 00 00 00 20: 0f 1f 00 nopl (%rax) 23: b8 31 00 00 00 mov $0x31,%eax 28: 0f 05 syscall 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction 30: 73 01 jae 0x33 32: c3 ret 33: 48 8b 0d 09 8c 0c 00 mov 0xc8c09(%rip),%rcx # 0xc8c43 3a: f7 d8 neg %eax 3c: 64 89 01 mov %eax,%fs:(%rcx) 3f: 48 rex.W Code starting with the faulting instruction =========================================== 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 6: 73 01 jae 0x9 8: c3 ret 9: 48 8b 0d 09 8c 0c 00 mov 0xc8c09(%rip),%rcx # 0xc8c19 10: f7 d8 neg %eax 12: 64 89 01 mov %eax,%fs:(%rcx) 15: 48 rex.W RSP: 002b:00007ffe2d0ad398 EFLAGS: 00000202 ORIG_RAX: 0000000000000031 RAX: ffffffffffffffda RBX: 00007ffe2d0ad3d0 RCX: 00007f59b934a1e7 RDX: 000000000000001c RSI: 00007ffe2d0ad3d0 RDI: 0000000000000005 RBP: 0000000000000005 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f59b9253298 R11: 0000000000000202 R12: 00007ffe2d0ada61 R13: 0000000000000000 R14: 0000562926516dd8 R15: 00007f59b9479000 Fixes: 6fe1e52490a9 ("sctp: check ipv6 addr with sk_bound_dev if set") Signed-off-by: Eric Dumazet Cc: Marcelo Ricardo Leitner Acked-by: Xin Long Link: https://patch.msgid.link/20241107192021.2579789-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/sctp/ipv6.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c index f7b809c0d142..38e2fbdcbeac 100644 --- a/net/sctp/ipv6.c +++ b/net/sctp/ipv6.c @@ -683,7 +683,7 @@ static int sctp_v6_available(union sctp_addr *addr, struct sctp_sock *sp) struct sock *sk = &sp->inet.sk; struct net *net = sock_net(sk); struct net_device *dev = NULL; - int type; + int type, res, bound_dev_if; type = ipv6_addr_type(in6); if (IPV6_ADDR_ANY == type) @@ -697,14 +697,21 @@ static int sctp_v6_available(union sctp_addr *addr, struct sctp_sock *sp) if (!(type & IPV6_ADDR_UNICAST)) return 0; - if (sk->sk_bound_dev_if) { - dev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if); + rcu_read_lock(); + bound_dev_if = READ_ONCE(sk->sk_bound_dev_if); + if (bound_dev_if) { + res = 0; + dev = dev_get_by_index_rcu(net, bound_dev_if); if (!dev) - return 0; + goto out; } - return ipv6_can_nonlocal_bind(net, &sp->inet) || - ipv6_chk_addr(net, in6, dev, 0); + res = ipv6_can_nonlocal_bind(net, &sp->inet) || + ipv6_chk_addr(net, in6, dev, 0); + +out: + rcu_read_unlock(); + return res; } /* This function checks if the address is a valid address to be used for From 0c0effb07f7d662af3e6f74da4d34241e412029b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micka=C3=ABl=20Sala=C3=BCn?= Date: Sat, 9 Nov 2024 12:08:54 +0100 Subject: [PATCH 168/235] landlock: Refactor filesystem access mask management MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace get_raw_handled_fs_accesses() with a generic landlock_union_access_masks(), and replace get_fs_domain() with a generic landlock_get_applicable_domain(). These helpers will also be useful for other types of access. Cc: Mikhail Ivanov Reviewed-by: Günther Noack Link: https://lore.kernel.org/r/20241109110856.222842-2-mic@digikod.net [mic: Slightly improve doc as suggested by Günther] Signed-off-by: Mickaël Salaün --- security/landlock/fs.c | 31 ++++----------- security/landlock/ruleset.h | 74 ++++++++++++++++++++++++++++++++---- security/landlock/syscalls.c | 2 +- 3 files changed, 75 insertions(+), 32 deletions(-) diff --git a/security/landlock/fs.c b/security/landlock/fs.c index 7d79fc8abe21..e31b97a9f175 100644 --- a/security/landlock/fs.c +++ b/security/landlock/fs.c @@ -388,38 +388,22 @@ static bool is_nouser_or_private(const struct dentry *dentry) unlikely(IS_PRIVATE(d_backing_inode(dentry)))); } -static access_mask_t -get_raw_handled_fs_accesses(const struct landlock_ruleset *const domain) -{ - access_mask_t access_dom = 0; - size_t layer_level; - - for (layer_level = 0; layer_level < domain->num_layers; layer_level++) - access_dom |= - landlock_get_raw_fs_access_mask(domain, layer_level); - return access_dom; -} - static access_mask_t get_handled_fs_accesses(const struct landlock_ruleset *const domain) { /* Handles all initially denied by default access rights. */ - return get_raw_handled_fs_accesses(domain) | + return landlock_union_access_masks(domain).fs | LANDLOCK_ACCESS_FS_INITIALLY_DENIED; } -static const struct landlock_ruleset * -get_fs_domain(const struct landlock_ruleset *const domain) -{ - if (!domain || !get_raw_handled_fs_accesses(domain)) - return NULL; - - return domain; -} +static const struct access_masks any_fs = { + .fs = ~0, +}; static const struct landlock_ruleset *get_current_fs_domain(void) { - return get_fs_domain(landlock_get_current_domain()); + return landlock_get_applicable_domain(landlock_get_current_domain(), + any_fs); } /* @@ -1517,7 +1501,8 @@ static int hook_file_open(struct file *const file) access_mask_t open_access_request, full_access_request, allowed_access, optional_access; const struct landlock_ruleset *const dom = - get_fs_domain(landlock_cred(file->f_cred)->domain); + landlock_get_applicable_domain( + landlock_cred(file->f_cred)->domain, any_fs); if (!dom) return 0; diff --git a/security/landlock/ruleset.h b/security/landlock/ruleset.h index 61bdbc550172..631e24d4ffe9 100644 --- a/security/landlock/ruleset.h +++ b/security/landlock/ruleset.h @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -47,6 +48,15 @@ struct access_masks { access_mask_t scope : LANDLOCK_NUM_SCOPE; }; +union access_masks_all { + struct access_masks masks; + u32 all; +}; + +/* Makes sure all fields are covered. */ +static_assert(sizeof(typeof_member(union access_masks_all, masks)) == + sizeof(typeof_member(union access_masks_all, all))); + typedef u16 layer_mask_t; /* Makes sure all layers can be checked. */ static_assert(BITS_PER_TYPE(layer_mask_t) >= LANDLOCK_MAX_NUM_LAYERS); @@ -260,6 +270,61 @@ static inline void landlock_get_ruleset(struct landlock_ruleset *const ruleset) refcount_inc(&ruleset->usage); } +/** + * landlock_union_access_masks - Return all access rights handled in the + * domain + * + * @domain: Landlock ruleset (used as a domain) + * + * Returns: an access_masks result of the OR of all the domain's access masks. + */ +static inline struct access_masks +landlock_union_access_masks(const struct landlock_ruleset *const domain) +{ + union access_masks_all matches = {}; + size_t layer_level; + + for (layer_level = 0; layer_level < domain->num_layers; layer_level++) { + union access_masks_all layer = { + .masks = domain->access_masks[layer_level], + }; + + matches.all |= layer.all; + } + + return matches.masks; +} + +/** + * landlock_get_applicable_domain - Return @domain if it applies to (handles) + * at least one of the access rights specified + * in @masks + * + * @domain: Landlock ruleset (used as a domain) + * @masks: access masks + * + * Returns: @domain if any access rights specified in @masks is handled, or + * NULL otherwise. + */ +static inline const struct landlock_ruleset * +landlock_get_applicable_domain(const struct landlock_ruleset *const domain, + const struct access_masks masks) +{ + const union access_masks_all masks_all = { + .masks = masks, + }; + union access_masks_all merge = {}; + + if (!domain) + return NULL; + + merge.masks = landlock_union_access_masks(domain); + if (merge.all & masks_all.all) + return domain; + + return NULL; +} + static inline void landlock_add_fs_access_mask(struct landlock_ruleset *const ruleset, const access_mask_t fs_access_mask, @@ -295,19 +360,12 @@ landlock_add_scope_mask(struct landlock_ruleset *const ruleset, ruleset->access_masks[layer_level].scope |= mask; } -static inline access_mask_t -landlock_get_raw_fs_access_mask(const struct landlock_ruleset *const ruleset, - const u16 layer_level) -{ - return ruleset->access_masks[layer_level].fs; -} - static inline access_mask_t landlock_get_fs_access_mask(const struct landlock_ruleset *const ruleset, const u16 layer_level) { /* Handles all initially denied by default access rights. */ - return landlock_get_raw_fs_access_mask(ruleset, layer_level) | + return ruleset->access_masks[layer_level].fs | LANDLOCK_ACCESS_FS_INITIALLY_DENIED; } diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c index f5a0e7182ec0..c097d356fa45 100644 --- a/security/landlock/syscalls.c +++ b/security/landlock/syscalls.c @@ -329,7 +329,7 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset, return -ENOMSG; /* Checks that allowed_access matches the @ruleset constraints. */ - mask = landlock_get_raw_fs_access_mask(ruleset, 0); + mask = ruleset->access_masks[0].fs; if ((path_beneath_attr.allowed_access | mask) != mask) return -EINVAL; From 8376226e5f53e78cd16a2b23577304e43acb3ba4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micka=C3=ABl=20Sala=C3=BCn?= Date: Sat, 9 Nov 2024 12:08:55 +0100 Subject: [PATCH 169/235] landlock: Refactor network access mask management MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace get_raw_handled_net_accesses() and get_current_net_domain() with a call to landlock_get_applicable_domain(). Cc: Konstantin Meskhidze Cc: Mikhail Ivanov Reviewed-by: Günther Noack Link: https://lore.kernel.org/r/20241109110856.222842-3-mic@digikod.net Signed-off-by: Mickaël Salaün --- security/landlock/net.c | 28 ++++++---------------------- 1 file changed, 6 insertions(+), 22 deletions(-) diff --git a/security/landlock/net.c b/security/landlock/net.c index c8bcd29bde09..d5dcc4407a19 100644 --- a/security/landlock/net.c +++ b/security/landlock/net.c @@ -39,27 +39,9 @@ int landlock_append_net_rule(struct landlock_ruleset *const ruleset, return err; } -static access_mask_t -get_raw_handled_net_accesses(const struct landlock_ruleset *const domain) -{ - access_mask_t access_dom = 0; - size_t layer_level; - - for (layer_level = 0; layer_level < domain->num_layers; layer_level++) - access_dom |= landlock_get_net_access_mask(domain, layer_level); - return access_dom; -} - -static const struct landlock_ruleset *get_current_net_domain(void) -{ - const struct landlock_ruleset *const dom = - landlock_get_current_domain(); - - if (!dom || !get_raw_handled_net_accesses(dom)) - return NULL; - - return dom; -} +static const struct access_masks any_net = { + .net = ~0, +}; static int current_check_access_socket(struct socket *const sock, struct sockaddr *const address, @@ -72,7 +54,9 @@ static int current_check_access_socket(struct socket *const sock, struct landlock_id id = { .type = LANDLOCK_KEY_NET_PORT, }; - const struct landlock_ruleset *const dom = get_current_net_domain(); + const struct landlock_ruleset *const dom = + landlock_get_applicable_domain(landlock_get_current_domain(), + any_net); if (!dom) return 0; From 03197e40a22c2641a1f9d1744418cd29f4954b83 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micka=C3=ABl=20Sala=C3=BCn?= Date: Sat, 9 Nov 2024 12:08:56 +0100 Subject: [PATCH 170/235] landlock: Optimize scope enforcement MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Do not walk through the domain hierarchy when the required scope is not supported by this domain. This is the same approach as for filesystem and network restrictions. Cc: Mikhail Ivanov Cc: Tahera Fahimi Reviewed-by: Günther Noack Link: https://lore.kernel.org/r/20241109110856.222842-4-mic@digikod.net Signed-off-by: Mickaël Salaün --- security/landlock/task.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/security/landlock/task.c b/security/landlock/task.c index 4acbd7c40eee..dc7dab78392e 100644 --- a/security/landlock/task.c +++ b/security/landlock/task.c @@ -204,12 +204,17 @@ static bool is_abstract_socket(struct sock *const sock) return false; } +static const struct access_masks unix_scope = { + .scope = LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET, +}; + static int hook_unix_stream_connect(struct sock *const sock, struct sock *const other, struct sock *const newsk) { const struct landlock_ruleset *const dom = - landlock_get_current_domain(); + landlock_get_applicable_domain(landlock_get_current_domain(), + unix_scope); /* Quick return for non-landlocked tasks. */ if (!dom) @@ -225,7 +230,8 @@ static int hook_unix_may_send(struct socket *const sock, struct socket *const other) { const struct landlock_ruleset *const dom = - landlock_get_current_domain(); + landlock_get_applicable_domain(landlock_get_current_domain(), + unix_scope); if (!dom) return 0; @@ -243,6 +249,10 @@ static int hook_unix_may_send(struct socket *const sock, return 0; } +static const struct access_masks signal_scope = { + .scope = LANDLOCK_SCOPE_SIGNAL, +}; + static int hook_task_kill(struct task_struct *const p, struct kernel_siginfo *const info, const int sig, const struct cred *const cred) @@ -256,6 +266,7 @@ static int hook_task_kill(struct task_struct *const p, } else { dom = landlock_get_current_domain(); } + dom = landlock_get_applicable_domain(dom, signal_scope); /* Quick return for non-landlocked tasks. */ if (!dom) @@ -279,7 +290,8 @@ static int hook_file_send_sigiotask(struct task_struct *tsk, /* Lock already held by send_sigio() and send_sigurg(). */ lockdep_assert_held(&fown->lock); - dom = landlock_file(fown->file)->fown_domain; + dom = landlock_get_applicable_domain( + landlock_file(fown->file)->fown_domain, signal_scope); /* Quick return for unowned socket. */ if (!dom) From a6250aa251eacaf3ebfcfe152a96a727fd483ecd Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sat, 9 Nov 2024 10:43:55 -1000 Subject: [PATCH 171/235] sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx() sched_ext dispatches tasks from the BPF scheduler from balance_scx() and thus every pick_task_scx() call must be preceded by balance_scx(). While this usually holds, due to a bug, there are cases where the fair class's balance() returns true indicating that it has tasks to run on the CPU and thus terminating balance() calls but fails to actually find the next task to run when pick_task() is called. In such cases, pick_task_scx() can be called without preceding balance_scx(). Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep running the previous task if possible and avoid stalling from entering idle without balancing. Signed-off-by: Tejun Heo Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org --- kernel/sched/core.c | 13 ++++++++----- kernel/sched/ext.c | 44 +++++++++++++++++++++++++++++++------------- kernel/sched/sched.h | 5 +++-- 3 files changed, 42 insertions(+), 20 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index aeb595514461..a910a5b4c274 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5914,12 +5914,15 @@ static void prev_balance(struct rq *rq, struct task_struct *prev, #ifdef CONFIG_SCHED_CLASS_EXT /* - * SCX requires a balance() call before every pick_next_task() including - * when waking up from SCHED_IDLE. If @start_class is below SCX, start - * from SCX instead. + * SCX requires a balance() call before every pick_task() including when + * waking up from SCHED_IDLE. If @start_class is below SCX, start from + * SCX instead. Also, set a flag to detect missing balance() call. */ - if (scx_enabled() && sched_class_above(&ext_sched_class, start_class)) - start_class = &ext_sched_class; + if (scx_enabled()) { + rq->scx.flags |= SCX_RQ_BAL_PENDING; + if (sched_class_above(&ext_sched_class, start_class)) + start_class = &ext_sched_class; + } #endif /* diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 3bdb08fc2056..19f9cb3a4190 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2634,7 +2634,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) lockdep_assert_rq_held(rq); rq->scx.flags |= SCX_RQ_IN_BALANCE; - rq->scx.flags &= ~SCX_RQ_BAL_KEEP; + rq->scx.flags &= ~(SCX_RQ_BAL_PENDING | SCX_RQ_BAL_KEEP); if (static_branch_unlikely(&scx_ops_cpu_preempt) && unlikely(rq->scx.cpu_released)) { @@ -2948,12 +2948,11 @@ static struct task_struct *pick_task_scx(struct rq *rq) { struct task_struct *prev = rq->curr; struct task_struct *p; + bool prev_on_scx = prev->sched_class == &ext_sched_class; + bool keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP; + bool kick_idle = false; /* - * If balance_scx() is telling us to keep running @prev, replenish slice - * if necessary and keep running @prev. Otherwise, pop the first one - * from the local DSQ. - * * WORKAROUND: * * %SCX_RQ_BAL_KEEP should be set iff $prev is on SCX as it must just @@ -2962,22 +2961,41 @@ static struct task_struct *pick_task_scx(struct rq *rq) * which then ends up calling pick_task_scx() without preceding * balance_scx(). * - * For now, ignore cases where $prev is not on SCX. This isn't great and - * can theoretically lead to stalls. However, for switch_all cases, this - * happens only while a BPF scheduler is being loaded or unloaded, and, - * for partial cases, fair will likely keep triggering this CPU. + * Keep running @prev if possible and avoid stalling from entering idle + * without balancing. * - * Once fair is fixed, restore WARN_ON_ONCE(). + * Once fair is fixed, remove the workaround and trigger WARN_ON_ONCE() + * if pick_task_scx() is called without preceding balance_scx(). */ - if ((rq->scx.flags & SCX_RQ_BAL_KEEP) && - prev->sched_class == &ext_sched_class) { + if (unlikely(rq->scx.flags & SCX_RQ_BAL_PENDING)) { + if (prev_on_scx) { + keep_prev = true; + } else { + keep_prev = false; + kick_idle = true; + } + } else if (unlikely(keep_prev && !prev_on_scx)) { + /* only allowed during transitions */ + WARN_ON_ONCE(scx_ops_enable_state() == SCX_OPS_ENABLED); + keep_prev = false; + } + + /* + * If balance_scx() is telling us to keep running @prev, replenish slice + * if necessary and keep running @prev. Otherwise, pop the first one + * from the local DSQ. + */ + if (keep_prev) { p = prev; if (!p->scx.slice) p->scx.slice = SCX_SLICE_DFL; } else { p = first_local_task(rq); - if (!p) + if (!p) { + if (kick_idle) + scx_bpf_kick_cpu(cpu_of(rq), SCX_KICK_IDLE); return NULL; + } if (unlikely(!p->scx.slice)) { if (!scx_rq_bypassing(rq) && !scx_warned_zero_slice) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6085ef50febf..4d79804631e4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -751,8 +751,9 @@ enum scx_rq_flags { */ SCX_RQ_ONLINE = 1 << 0, SCX_RQ_CAN_STOP_TICK = 1 << 1, - SCX_RQ_BAL_KEEP = 1 << 2, /* balance decided to keep current */ - SCX_RQ_BYPASSING = 1 << 3, + SCX_RQ_BAL_PENDING = 1 << 2, /* balance hasn't run yet */ + SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */ + SCX_RQ_BYPASSING = 1 << 4, SCX_RQ_IN_WAKEUP = 1 << 16, SCX_RQ_IN_BALANCE = 1 << 17, From e68da664d379f352d41d7955712c44e0a738e4ab Mon Sep 17 00:00:00 2001 From: Stefan Wahren Date: Fri, 8 Nov 2024 12:43:43 +0100 Subject: [PATCH 172/235] net: vertexcom: mse102x: Fix tx_bytes calculation The tx_bytes should consider the actual size of the Ethernet frames without the SPI encapsulation. But we still need to take care of Ethernet padding. Fixes: 2f207cbf0dd4 ("net: vertexcom: Add MSE102x SPI support") Signed-off-by: Stefan Wahren Link: https://patch.msgid.link/20241108114343.6174-3-wahrenst@gmx.net Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/vertexcom/mse102x.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/vertexcom/mse102x.c b/drivers/net/ethernet/vertexcom/mse102x.c index 2c37957478fb..89dc4c401a8d 100644 --- a/drivers/net/ethernet/vertexcom/mse102x.c +++ b/drivers/net/ethernet/vertexcom/mse102x.c @@ -437,13 +437,15 @@ static void mse102x_tx_work(struct work_struct *work) mse = &mses->mse102x; while ((txb = skb_dequeue(&mse->txq))) { + unsigned int len = max_t(unsigned int, txb->len, ETH_ZLEN); + mutex_lock(&mses->lock); ret = mse102x_tx_pkt_spi(mse, txb, work_timeout); mutex_unlock(&mses->lock); if (ret) { mse->ndev->stats.tx_dropped++; } else { - mse->ndev->stats.tx_bytes += txb->len; + mse->ndev->stats.tx_bytes += len; mse->ndev->stats.tx_packets++; } From 252e01e68241d33bfe0ed1fc333220d9bd8b06df Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 7 Nov 2024 16:47:31 -0800 Subject: [PATCH 173/235] selftests: net: add netlink-dumps to .gitignore Commit 55d42a0c3f9c ("selftests: net: add a test for closing a netlink socket ith dump in progress") added a new test but did not add it to gitignore. Reviewed-by: Joe Damato Link: https://patch.msgid.link/20241108004731.2979878-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index 217d8b7a7365..59fe07ee2df9 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -19,6 +19,7 @@ log.txt msg_oob msg_zerocopy ncdevmem +netlink-dumps nettest psock_fanout psock_snd From ace149e0830c380ddfce7e466fe860ca502fe4ee Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Fri, 13 Sep 2024 13:57:04 -0400 Subject: [PATCH 174/235] filemap: Fix bounds checking in filemap_read() If the caller supplies an iocb->ki_pos value that is close to the filesystem upper limit, and an iterator with a count that causes us to overflow that limit, then filemap_read() enters an infinite loop. This behaviour was discovered when testing xfstests generic/525 with the "localio" optimisation for loopback NFS mounts. Reported-by: Mike Snitzer Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()") Tested-by: Mike Snitzer Signed-off-by: Trond Myklebust Signed-off-by: Linus Torvalds --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index 36d22968be9a..56fa431c52af 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2625,7 +2625,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, if (unlikely(!iov_iter_count(iter))) return 0; - iov_iter_truncate(iter, inode->i_sb->s_maxbytes); + iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos); folio_batch_init(&fbatch); do { From 2d5404caa8c7bb5c4e0435f94b28834ae5456623 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 10 Nov 2024 14:19:35 -0800 Subject: [PATCH 175/235] Linux 6.12-rc7 --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index b8efbfe9da94..79192a3024bf 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ VERSION = 6 PATCHLEVEL = 12 SUBLEVEL = 0 -EXTRAVERSION = -rc6 +EXTRAVERSION = -rc7 NAME = Baby Opossum Posse # *DOCUMENTATION* From 0b6ec0c5ac6c9e80a6157cbc5631802c81a674d7 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sun, 10 Nov 2024 22:01:04 -0500 Subject: [PATCH 176/235] bcachefs: Fix assertion pop in topology repair Fixes: baefd3f849ed ("bcachefs: btree_cache.freeable list fixes") Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_gc.c | 2 +- fs/bcachefs/btree_update_interior.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/bcachefs/btree_gc.c b/fs/bcachefs/btree_gc.c index 0ca3feeb42c8..81dcf9e512c0 100644 --- a/fs/bcachefs/btree_gc.c +++ b/fs/bcachefs/btree_gc.c @@ -182,7 +182,7 @@ static int set_node_max(struct bch_fs *c, struct btree *b, struct bpos new_max) bch2_btree_node_drop_keys_outside_node(b); mutex_lock(&c->btree_cache.lock); - bch2_btree_node_hash_remove(&c->btree_cache, b); + __bch2_btree_node_hash_remove(&c->btree_cache, b); bkey_copy(&b->key, &new->k_i); ret = __bch2_btree_node_hash_insert(&c->btree_cache, b); diff --git a/fs/bcachefs/btree_update_interior.c b/fs/bcachefs/btree_update_interior.c index 22740b605f0a..d596ef93239f 100644 --- a/fs/bcachefs/btree_update_interior.c +++ b/fs/bcachefs/btree_update_interior.c @@ -2398,7 +2398,8 @@ static int __bch2_btree_node_update_key(struct btree_trans *trans, if (new_hash) { mutex_lock(&c->btree_cache.lock); bch2_btree_node_hash_remove(&c->btree_cache, new_hash); - bch2_btree_node_hash_remove(&c->btree_cache, b); + + __bch2_btree_node_hash_remove(&c->btree_cache, b); bkey_copy(&b->key, new_key); ret = __bch2_btree_node_hash_insert(&c->btree_cache, b); From 2642084f26b5a5e9353fa530efb30f49e752185d Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sun, 10 Nov 2024 23:28:33 -0500 Subject: [PATCH 177/235] bcachefs: Allow for unknown key types in backpointers fsck We can't assume that btrees only contain keys of a given type - even if they only have a single key type listed in the allowed key types for that btree; this is a forwards compatibility issue. Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/backpointers.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/bcachefs/backpointers.c b/fs/bcachefs/backpointers.c index ed344400c8bd..654a58132a4d 100644 --- a/fs/bcachefs/backpointers.c +++ b/fs/bcachefs/backpointers.c @@ -952,9 +952,13 @@ int bch2_check_extents_to_backpointers(struct bch_fs *c) static int check_one_backpointer(struct btree_trans *trans, struct bbpos start, struct bbpos end, - struct bkey_s_c_backpointer bp, + struct bkey_s_c bp_k, struct bkey_buf *last_flushed) { + if (bp_k.k->type != KEY_TYPE_backpointer) + return 0; + + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(bp_k); struct bch_fs *c = trans->c; struct btree_iter iter; struct bbpos pos = bp_to_bbpos(*bp.v); @@ -1009,9 +1013,7 @@ static int bch2_check_backpointers_to_extents_pass(struct btree_trans *trans, POS_MIN, BTREE_ITER_prefetch, k, NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ progress_update_iter(trans, &progress, &iter, "backpointers_to_extents"); - check_one_backpointer(trans, start, end, - bkey_s_c_to_backpointer(k), - &last_flushed); + check_one_backpointer(trans, start, end, k, &last_flushed); })); bch2_bkey_buf_exit(&last_flushed, c); From e7ac4daeed91a25382091e73818ea0cddb1afd5e Mon Sep 17 00:00:00 2001 From: Barry Song Date: Thu, 7 Nov 2024 14:12:46 +1300 Subject: [PATCH 178/235] mm: count zeromap read and set for swapout and swapin MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When the proportion of folios from the zeromap is small, missing their accounting may not significantly impact profiling. However, it's easy to construct a scenario where this becomes an issue—for example, allocating 1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out and swap-in counts seem to vanish into a black hole, potentially causing semantic ambiguity. On the other hand, Usama reported that zero-filled pages can exceed 10% in workloads utilizing zswap, while Hailong noted that some app in Android have more than 6% zero-filled pages. Before commit 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap"), both zswap and zRAM implemented similar optimizations, leading to these optimized-out pages being counted in either zswap or zRAM counters (with pswpin/pswpout also increasing for zRAM). With zeromap functioning prior to both zswap and zRAM, userspace will no longer detect these swap-out and swap-in actions. We have three ways to address this: 1. Introduce a dedicated counter specifically for the zeromap. 2. Use pswpin/pswpout accounting, treating the zero map as a standard backend. This approach aligns with zRAM's current handling of same-page fills at the device level. However, it would mean losing the optimized-out page counters previously available in zRAM and would not align with systems using zswap. Additionally, as noted by Nhat Pham, pswpin/pswpout counters apply only to I/O done directly to the backend device. 3. Count zeromap pages under zswap, aligning with system behavior when zswap is enabled. However, this would not be consistent with zRAM, nor would it align with systems lacking both zswap and zRAM. Given the complications with options 2 and 3, this patch selects option 1. We can find these counters from /proc/vmstat (counters for the whole system) and memcg's memory.stat (counters for the interested memcg). For example: $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat swpin_zero 1648 swpout_zero 33536 $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat swpin_zero 3905 swpout_zero 3985 This patch does not address any specific zeromap bug, but the missing swpout and swpin counts for zero-filled pages can be highly confusing and may mislead user-space agents that rely on changes in these counters as indicators. Therefore, we add a Fixes tag to encourage the inclusion of this counter in any kernel versions with zeromap. Many thanks to Kanchana for the contribution of changing count_objcg_event() to count_objcg_events() to support large folios[1], which has now been incorporated into this patch. [1] https://lkml.kernel.org/r/20241001053222.6944-5-kanchana.p.sridhar@intel.com Link: https://lkml.kernel.org/r/20241107011246.59137-1-21cnbao@gmail.com Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") Co-developed-by: Kanchana P Sridhar Signed-off-by: Barry Song Reviewed-by: Nhat Pham Reviewed-by: Chengming Zhou Acked-by: Johannes Weiner Cc: Usama Arif Cc: Yosry Ahmed Cc: Hailong Liu Cc: David Hildenbrand Cc: Hugh Dickins Cc: Matthew Wilcox (Oracle) Cc: Shakeel Butt Cc: Andi Kleen Cc: Baolin Wang Cc: Chris Li Cc: "Huang, Ying" Cc: Kairui Song Cc: Ryan Roberts Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 9 +++++++++ include/linux/memcontrol.h | 12 +++++++----- include/linux/vm_event_item.h | 2 ++ mm/memcontrol.c | 4 ++++ mm/page_io.c | 16 ++++++++++++++++ mm/vmstat.c | 2 ++ mm/zswap.c | 6 +++--- 7 files changed, 43 insertions(+), 8 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 69af2173555f..6d02168d78be 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1599,6 +1599,15 @@ The following nested keys are defined. pglazyfreed (npn) Amount of reclaimed lazyfree pages + swpin_zero + Number of pages swapped into memory and filled with zero, where I/O + was optimized out because the page content was detected to be zero + during swapout. + + swpout_zero + Number of zero-filled pages swapped out with I/O skipped due to the + content being detected as zero. + zswpin Number of pages moved in to memory from zswap. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 34d2da05f2f1..e1b41554a5fb 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1760,8 +1760,9 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg) struct mem_cgroup *mem_cgroup_from_slab_obj(void *p); -static inline void count_objcg_event(struct obj_cgroup *objcg, - enum vm_event_item idx) +static inline void count_objcg_events(struct obj_cgroup *objcg, + enum vm_event_item idx, + unsigned long count) { struct mem_cgroup *memcg; @@ -1770,7 +1771,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); - count_memcg_events(memcg, idx, 1); + count_memcg_events(memcg, idx, count); rcu_read_unlock(); } @@ -1825,8 +1826,9 @@ static inline struct mem_cgroup *mem_cgroup_from_slab_obj(void *p) return NULL; } -static inline void count_objcg_event(struct obj_cgroup *objcg, - enum vm_event_item idx) +static inline void count_objcg_events(struct obj_cgroup *objcg, + enum vm_event_item idx, + unsigned long count) { } diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index aed952d04132..f70d0958095c 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_SWAP SWAP_RA, SWAP_RA_HIT, + SWPIN_ZERO, + SWPOUT_ZERO, #ifdef CONFIG_KSM KSM_SWPIN_COPY, #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 06df2af97415..53db98d2c4a1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -431,6 +431,10 @@ static const unsigned int memcg_vm_event_stat[] = { PGDEACTIVATE, PGLAZYFREE, PGLAZYFREED, +#ifdef CONFIG_SWAP + SWPIN_ZERO, + SWPOUT_ZERO, +#endif #ifdef CONFIG_ZSWAP ZSWPIN, ZSWPOUT, diff --git a/mm/page_io.c b/mm/page_io.c index 69536a2b3c13..01749b99fb54 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio) static void swap_zeromap_folio_set(struct folio *folio) { + struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio); struct swap_info_struct *sis = swp_swap_info(folio->swap); + int nr_pages = folio_nr_pages(folio); swp_entry_t entry; unsigned int i; @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio) entry = page_swap_entry(folio_page(folio, i)); set_bit(swp_offset(entry), sis->zeromap); } + + count_vm_events(SWPOUT_ZERO, nr_pages); + if (objcg) { + count_objcg_events(objcg, SWPOUT_ZERO, nr_pages); + obj_cgroup_put(objcg); + } } static void swap_zeromap_folio_clear(struct folio *folio) @@ -503,6 +511,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret) static bool swap_read_folio_zeromap(struct folio *folio) { int nr_pages = folio_nr_pages(folio); + struct obj_cgroup *objcg; bool is_zeromap; /* @@ -517,6 +526,13 @@ static bool swap_read_folio_zeromap(struct folio *folio) if (!is_zeromap) return false; + objcg = get_obj_cgroup_from_folio(folio); + count_vm_events(SWPIN_ZERO, nr_pages); + if (objcg) { + count_objcg_events(objcg, SWPIN_ZERO, nr_pages); + obj_cgroup_put(objcg); + } + folio_zero_range(folio, 0, folio_size(folio)); folio_mark_uptodate(folio); return true; diff --git a/mm/vmstat.c b/mm/vmstat.c index b5a4cea423e1..ac6a5aa34eab 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1415,6 +1415,8 @@ const char * const vmstat_text[] = { #ifdef CONFIG_SWAP "swap_ra", "swap_ra_hit", + "swpin_zero", + "swpout_zero", #ifdef CONFIG_KSM "ksm_swpin_copy", #endif diff --git a/mm/zswap.c b/mm/zswap.c index 162013952074..0030ce8fecfc 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1053,7 +1053,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, count_vm_event(ZSWPWB); if (entry->objcg) - count_objcg_event(entry->objcg, ZSWPWB); + count_objcg_events(entry->objcg, ZSWPWB, 1); zswap_entry_free(entry); @@ -1483,7 +1483,7 @@ bool zswap_store(struct folio *folio) if (objcg) { obj_cgroup_charge_zswap(objcg, entry->length); - count_objcg_event(objcg, ZSWPOUT); + count_objcg_events(objcg, ZSWPOUT, 1); } /* @@ -1577,7 +1577,7 @@ bool zswap_load(struct folio *folio) count_vm_event(ZSWPIN); if (entry->objcg) - count_objcg_event(entry->objcg, ZSWPIN); + count_objcg_events(entry->objcg, ZSWPIN, 1); if (swapcache) { zswap_entry_free(entry); From 1a1030d10a6335bb5e6cdb24fc9388d3d9bcc1ac Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 7 Nov 2024 13:36:10 +0100 Subject: [PATCH 179/235] cpufreq: intel_pstate: Rearrange locking in hybrid_init_cpu_capacity_scaling() Notice that hybrid_init_cpu_capacity_scaling() only needs to hold hybrid_capacity_lock around __hybrid_init_cpu_capacity_scaling() calls, so introduce a "locked" wrapper around the latter and call it from the former. This allows to drop a local variable and a label that are not needed any more. Also, rename __hybrid_init_cpu_capacity_scaling() to __hybrid_refresh_cpu_capacity_scaling() for consistency. Interestingly enough, this fixes a locking issue introduced by commit 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems") that put an arch_enable_hybrid_capacity_scale() call under hybrid_capacity_lock, which was a mistake because the latter is acquired in CPU hotplug paths and so it cannot be held around cpus_read_lock() calls. Link: https://lore.kernel.org/linux-pm/SJ1PR11MB6129EDBF22F8A90FC3A3EDC8B9582@SJ1PR11MB6129.namprd11.prod.outlook.com/ Fixes: 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems") Signed-off-by: Rafael J. Wysocki Reported-by: "Borah, Chaitanya Kumar" Link: https://patch.msgid.link/12554508.O9o76ZdvQC@rjwysocki.net [ rjw: Changelog update ] Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/intel_pstate.c | 37 ++++++++++++++++------------------ 1 file changed, 17 insertions(+), 20 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index cd2ac1ba53d2..400337f3b572 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1028,26 +1028,29 @@ static void hybrid_update_cpu_capacity_scaling(void) } } -static void __hybrid_init_cpu_capacity_scaling(void) +static void __hybrid_refresh_cpu_capacity_scaling(void) { hybrid_max_perf_cpu = NULL; hybrid_update_cpu_capacity_scaling(); } +static void hybrid_refresh_cpu_capacity_scaling(void) +{ + guard(mutex)(&hybrid_capacity_lock); + + __hybrid_refresh_cpu_capacity_scaling(); +} + static void hybrid_init_cpu_capacity_scaling(bool refresh) { - bool disable_itmt = false; - - mutex_lock(&hybrid_capacity_lock); - /* * If hybrid_max_perf_cpu is set at this point, the hybrid CPU capacity * scaling has been enabled already and the driver is just changing the * operation mode. */ if (refresh) { - __hybrid_init_cpu_capacity_scaling(); - goto unlock; + hybrid_refresh_cpu_capacity_scaling(); + return; } /* @@ -1056,19 +1059,13 @@ static void hybrid_init_cpu_capacity_scaling(bool refresh) * do not do that when SMT is in use. */ if (hwp_is_hybrid && !sched_smt_active() && arch_enable_hybrid_capacity_scale()) { - __hybrid_init_cpu_capacity_scaling(); - disable_itmt = true; - } - -unlock: - mutex_unlock(&hybrid_capacity_lock); - - /* - * Disabling ITMT causes sched domains to be rebuilt to disable asym - * packing and enable asym capacity. - */ - if (disable_itmt) + hybrid_refresh_cpu_capacity_scaling(); + /* + * Disabling ITMT causes sched domains to be rebuilt to disable asym + * packing and enable asym capacity. + */ sched_clear_itmt_support(); + } } static bool hybrid_clear_max_perf_cpu(void) @@ -1404,7 +1401,7 @@ static void intel_pstate_update_limits_for_all(void) mutex_lock(&hybrid_capacity_lock); if (hybrid_max_perf_cpu) - __hybrid_init_cpu_capacity_scaling(); + __hybrid_refresh_cpu_capacity_scaling(); mutex_unlock(&hybrid_capacity_lock); } From 42964e4b5e3ac95090bdd23ed7da2a941ccd902c Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Mon, 11 Nov 2024 16:48:18 +0100 Subject: [PATCH 180/235] dm-bufio: fix warnings about duplicate slab caches The commit 4c39529663b9 adds a warning about duplicate cache names if CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio code. The dm-bufio code allocates a slab cache with each client. It is not possible to preallocate the caches in the module init function because the size of auxiliary per-buffer data is not known at this point. So, this commit changes dm-bufio so that it appends a unique atomic value to the cache name, to avoid the warnings. Signed-off-by: Mikulas Patocka Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y") --- drivers/md/dm-bufio.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index d478aafa02c9..23e0b71b991e 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -2471,7 +2471,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign int r; unsigned int num_locks; struct dm_bufio_client *c; - char slab_name[27]; + char slab_name[64]; + static atomic_t seqno = ATOMIC_INIT(0); if (!block_size || block_size & ((1 << SECTOR_SHIFT) - 1)) { DMERR("%s: block size not specified or is not multiple of 512b", __func__); @@ -2522,7 +2523,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign (block_size < PAGE_SIZE || !is_power_of_2(block_size))) { unsigned int align = min(1U << __ffs(block_size), (unsigned int)PAGE_SIZE); - snprintf(slab_name, sizeof(slab_name), "dm_bufio_cache-%u", block_size); + snprintf(slab_name, sizeof(slab_name), "dm_bufio_cache-%u-%u", + block_size, atomic_inc_return(&seqno)); c->slab_cache = kmem_cache_create(slab_name, block_size, align, SLAB_RECLAIM_ACCOUNT, NULL); if (!c->slab_cache) { @@ -2531,9 +2533,11 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign } } if (aux_size) - snprintf(slab_name, sizeof(slab_name), "dm_bufio_buffer-%u", aux_size); + snprintf(slab_name, sizeof(slab_name), "dm_bufio_buffer-%u-%u", + aux_size, atomic_inc_return(&seqno)); else - snprintf(slab_name, sizeof(slab_name), "dm_bufio_buffer"); + snprintf(slab_name, sizeof(slab_name), "dm_bufio_buffer-%u", + atomic_inc_return(&seqno)); c->slab_buffer = kmem_cache_create(slab_name, sizeof(struct dm_buffer) + aux_size, 0, SLAB_RECLAIM_ACCOUNT, NULL); if (!c->slab_buffer) { From 346dbf1b1345476a6524512892cceb931bee3039 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Mon, 11 Nov 2024 16:51:02 +0100 Subject: [PATCH 181/235] dm-cache: fix warnings about duplicate slab caches The commit 4c39529663b9 adds a warning about duplicate cache names if CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache code. The dm-cache code allocates a slab cache for each device. This commit changes it to allocate just one slab cache in the module init function. Signed-off-by: Mikulas Patocka Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y") --- drivers/md/dm-cache-background-tracker.c | 25 ++++++------------------ drivers/md/dm-cache-background-tracker.h | 8 ++++++++ drivers/md/dm-cache-target.c | 25 +++++++++++++++++++----- 3 files changed, 34 insertions(+), 24 deletions(-) diff --git a/drivers/md/dm-cache-background-tracker.c b/drivers/md/dm-cache-background-tracker.c index 9c5308298cf1..f3051bd7d2df 100644 --- a/drivers/md/dm-cache-background-tracker.c +++ b/drivers/md/dm-cache-background-tracker.c @@ -11,12 +11,6 @@ #define DM_MSG_PREFIX "dm-background-tracker" -struct bt_work { - struct list_head list; - struct rb_node node; - struct policy_work work; -}; - struct background_tracker { unsigned int max_work; atomic_t pending_promotes; @@ -26,10 +20,10 @@ struct background_tracker { struct list_head issued; struct list_head queued; struct rb_root pending; - - struct kmem_cache *work_cache; }; +struct kmem_cache *btracker_work_cache = NULL; + struct background_tracker *btracker_create(unsigned int max_work) { struct background_tracker *b = kmalloc(sizeof(*b), GFP_KERNEL); @@ -48,12 +42,6 @@ struct background_tracker *btracker_create(unsigned int max_work) INIT_LIST_HEAD(&b->queued); b->pending = RB_ROOT; - b->work_cache = KMEM_CACHE(bt_work, 0); - if (!b->work_cache) { - DMERR("couldn't create mempool for background work items"); - kfree(b); - b = NULL; - } return b; } @@ -66,10 +54,9 @@ void btracker_destroy(struct background_tracker *b) BUG_ON(!list_empty(&b->issued)); list_for_each_entry_safe (w, tmp, &b->queued, list) { list_del(&w->list); - kmem_cache_free(b->work_cache, w); + kmem_cache_free(btracker_work_cache, w); } - kmem_cache_destroy(b->work_cache); kfree(b); } EXPORT_SYMBOL_GPL(btracker_destroy); @@ -180,7 +167,7 @@ static struct bt_work *alloc_work(struct background_tracker *b) if (max_work_reached(b)) return NULL; - return kmem_cache_alloc(b->work_cache, GFP_NOWAIT); + return kmem_cache_alloc(btracker_work_cache, GFP_NOWAIT); } int btracker_queue(struct background_tracker *b, @@ -203,7 +190,7 @@ int btracker_queue(struct background_tracker *b, * There was a race, we'll just ignore this second * bit of work for the same oblock. */ - kmem_cache_free(b->work_cache, w); + kmem_cache_free(btracker_work_cache, w); return -EINVAL; } @@ -244,7 +231,7 @@ void btracker_complete(struct background_tracker *b, update_stats(b, &w->work, -1); rb_erase(&w->node, &b->pending); list_del(&w->list); - kmem_cache_free(b->work_cache, w); + kmem_cache_free(btracker_work_cache, w); } EXPORT_SYMBOL_GPL(btracker_complete); diff --git a/drivers/md/dm-cache-background-tracker.h b/drivers/md/dm-cache-background-tracker.h index 5b8f5c667b81..09c8fc59f7bb 100644 --- a/drivers/md/dm-cache-background-tracker.h +++ b/drivers/md/dm-cache-background-tracker.h @@ -26,6 +26,14 @@ * protected with a spinlock. */ +struct bt_work { + struct list_head list; + struct rb_node node; + struct policy_work work; +}; + +extern struct kmem_cache *btracker_work_cache; + struct background_work; struct background_tracker; diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c index 40709310e327..849eb6333e98 100644 --- a/drivers/md/dm-cache-target.c +++ b/drivers/md/dm-cache-target.c @@ -10,6 +10,7 @@ #include "dm-bio-record.h" #include "dm-cache-metadata.h" #include "dm-io-tracker.h" +#include "dm-cache-background-tracker.h" #include #include @@ -2263,7 +2264,7 @@ static int parse_cache_args(struct cache_args *ca, int argc, char **argv, /*----------------------------------------------------------------*/ -static struct kmem_cache *migration_cache; +static struct kmem_cache *migration_cache = NULL; #define NOT_CORE_OPTION 1 @@ -3445,22 +3446,36 @@ static int __init dm_cache_init(void) int r; migration_cache = KMEM_CACHE(dm_cache_migration, 0); - if (!migration_cache) - return -ENOMEM; + if (!migration_cache) { + r = -ENOMEM; + goto err; + } + + btracker_work_cache = kmem_cache_create("dm_cache_bt_work", + sizeof(struct bt_work), __alignof__(struct bt_work), 0, NULL); + if (!btracker_work_cache) { + r = -ENOMEM; + goto err; + } r = dm_register_target(&cache_target); if (r) { - kmem_cache_destroy(migration_cache); - return r; + goto err; } return 0; + +err: + kmem_cache_destroy(migration_cache); + kmem_cache_destroy(btracker_work_cache); + return r; } static void __exit dm_cache_exit(void) { dm_unregister_target(&cache_target); kmem_cache_destroy(migration_cache); + kmem_cache_destroy(btracker_work_cache); } module_init(dm_cache_init); From 073d89808c065ac4c672c0a613a71b27a80691cb Mon Sep 17 00:00:00 2001 From: Wang Liang Date: Thu, 7 Nov 2024 10:34:05 +0800 Subject: [PATCH 182/235] net: fix data-races around sk->sk_forward_alloc Syzkaller reported this warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0 Modules linked in: CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78 FS: 0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __warn+0x88/0x130 ? inet_sock_destruct+0x1c5/0x1e0 ? report_bug+0x18e/0x1a0 ? handle_bug+0x53/0x90 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? inet_sock_destruct+0x1c5/0x1e0 __sk_destruct+0x2a/0x200 rcu_do_batch+0x1aa/0x530 ? rcu_do_batch+0x13b/0x530 rcu_core+0x159/0x2f0 handle_softirqs+0xd3/0x2b0 ? __pfx_smpboot_thread_fn+0x10/0x10 run_ksoftirqd+0x25/0x30 smpboot_thread_fn+0xdd/0x1d0 kthread+0xd3/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 ---[ end trace 0000000000000000 ]--- Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add() concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked, which triggers a data-race around sk->sk_forward_alloc: tcp_v6_rcv tcp_v6_do_rcv skb_clone_and_charge_r sk_rmem_schedule __sk_mem_schedule sk_forward_alloc_add() skb_set_owner_r sk_mem_charge sk_forward_alloc_add() __kfree_skb skb_release_all skb_release_head_state sock_rfree sk_mem_uncharge sk_forward_alloc_add() sk_mem_reclaim // set local var reclaimable __sk_mem_reclaim sk_forward_alloc_add() In this syzkaller testcase, two threads call tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like this: (cpu 1) | (cpu 2) | sk_forward_alloc ... | ... | 0 __sk_mem_schedule() | | +4096 = 4096 | __sk_mem_schedule() | +4096 = 8192 sk_mem_charge() | | -768 = 7424 | sk_mem_charge() | -768 = 6656 ... | ... | sk_mem_uncharge() | | +768 = 7424 reclaimable=7424 | | | sk_mem_uncharge() | +768 = 8192 | reclaimable=8192 | __sk_mem_reclaim() | | -4096 = 4096 | __sk_mem_reclaim() | -8192 = -4096 != 0 The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock(). Fix the same issue in dccp_v6_do_rcv(). Suggested-by: Eric Dumazet Reviewed-by: Eric Dumazet Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Signed-off-by: Wang Liang Link: https://patch.msgid.link/20241107023405.889239-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski --- net/dccp/ipv6.c | 2 +- net/ipv6/tcp_ipv6.c | 4 +--- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index da5dba120bc9..d6649246188d 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -618,7 +618,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) by tcp. Feel free to propose better solution. --ANK (980728) */ - if (np->rxopt.all) + if (np->rxopt.all && sk->sk_state != DCCP_LISTEN) opt_skb = skb_clone_and_charge_r(skb, sk); if (sk->sk_state == DCCP_OPEN) { /* Fast path */ diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index d71ab4e1efe1..c9de5ef8f267 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1618,7 +1618,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) by tcp. Feel free to propose better solution. --ANK (980728) */ - if (np->rxopt.all) + if (np->rxopt.all && sk->sk_state != TCP_LISTEN) opt_skb = skb_clone_and_charge_r(skb, sk); if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ @@ -1656,8 +1656,6 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) if (reason) goto reset; } - if (opt_skb) - __kfree_skb(opt_skb); return 0; } } else From 66edc3a5894c74f8887c8af23b97593a0dd0df4d Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Wed, 6 Nov 2024 19:53:54 +0000 Subject: [PATCH 183/235] mm: page_alloc: move mlocked flag clearance into free_pages_prepare() Syzbot reported a bad page state problem caused by a page being freed using free_page() still having a mlocked flag at free_pages_prepare() stage: BUG: Bad page state in process syz.5.504 pfn:61f45 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x61f45 flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000080204 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts 201884660643, free_ts 201499827394 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537 prep_new_page mm/page_alloc.c:1545 [inline] get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457 __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733 alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265 kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99 kvm_create_vm virt/kvm/kvm_main.c:1235 [inline] kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline] kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530 __do_compat_sys_ioctl fs/ioctl.c:1007 [inline] __se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386 do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e page last free pid 8399 tgid 8399 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1108 [inline] free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686 folios_put_refs+0x76c/0x860 mm/swap.c:1007 free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335 __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline] tlb_batch_pages_flush mm/mmu_gather.c:149 [inline] tlb_flush_mmu_free mm/mmu_gather.c:366 [inline] tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373 tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465 exit_mmap+0x496/0xc40 mm/mmap.c:1926 __mmput+0x115/0x390 kernel/fork.c:1348 exit_mm+0x220/0x310 kernel/exit.c:571 do_exit+0x9b2/0x28e0 kernel/exit.c:926 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Modules linked in: CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 bad_page+0x176/0x1d0 mm/page_alloc.c:501 free_page_is_bad mm/page_alloc.c:918 [inline] free_pages_prepare mm/page_alloc.c:1100 [inline] free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638 kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline] kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386 kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143 __fput+0x23f/0x880 fs/file_table.c:431 task_work_run+0x24f/0x310 kernel/task_work.c:239 exit_task_work include/linux/task_work.h:43 [inline] do_exit+0xa2f/0x28e0 kernel/exit.c:939 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386 do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e RIP: 0023:0xf745d579 Code: Unable to access opcode bytes at 0xf745d54f. RSP: 002b:00000000f75afd6c EFLAGS: 00000206 ORIG_RAX: 00000000000000fc RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000ffffff9c RDI: 00000000f744cff4 RBP: 00000000f717ae61 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 The problem was originally introduced by commit b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance"): it was focused on handling pagecache and anonymous memory and wasn't suitable for lower level get_page()/free_page() API's used for example by KVM, as with this reproducer. Fix it by moving the mlocked flag clearance down to free_page_prepare(). The bug itself if fairly old and harmless (aside from generating these warnings), aside from a small memory leak - "bad" pages are stopped from being allocated again. Link: https://lkml.kernel.org/r/20241106195354.270757-1-roman.gushchin@linux.dev Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance") Signed-off-by: Roman Gushchin Reported-by: syzbot+e985d3026c4fd041578e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6729f475.050a0220.701a.0019.GAE@google.com Acked-by: Hugh Dickins Cc: Matthew Wilcox Cc: Sean Christopherson Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- mm/page_alloc.c | 15 +++++++++++++++ mm/swap.c | 14 -------------- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c6c7bb3ea71b..216fbbfbedcf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1048,6 +1048,7 @@ __always_inline bool free_pages_prepare(struct page *page, bool skip_kasan_poison = should_skip_kasan_poison(page); bool init = want_init_on_free(); bool compound = PageCompound(page); + struct folio *folio = page_folio(page); VM_BUG_ON_PAGE(PageTail(page), page); @@ -1057,6 +1058,20 @@ __always_inline bool free_pages_prepare(struct page *page, if (memcg_kmem_online() && PageMemcgKmem(page)) __memcg_kmem_uncharge_page(page, order); + /* + * In rare cases, when truncation or holepunching raced with + * munlock after VM_LOCKED was cleared, Mlocked may still be + * found set here. This does not indicate a problem, unless + * "unevictable_pgs_cleared" appears worryingly large. + */ + if (unlikely(folio_test_mlocked(folio))) { + long nr_pages = folio_nr_pages(folio); + + __folio_clear_mlocked(folio); + zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages); + count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages); + } + if (unlikely(PageHWPoison(page)) && !order) { /* Do not let hwpoison pages hit pcplists/buddy */ reset_page_owner(page, order); diff --git a/mm/swap.c b/mm/swap.c index b8e3259ea2c4..59f30a981c6f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -78,20 +78,6 @@ static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp, lruvec_del_folio(*lruvecp, folio); __folio_clear_lru_flags(folio); } - - /* - * In rare cases, when truncation or holepunching raced with - * munlock after VM_LOCKED was cleared, Mlocked may still be - * found set here. This does not indicate a problem, unless - * "unevictable_pgs_cleared" appears worryingly large. - */ - if (unlikely(folio_test_mlocked(folio))) { - long nr_pages = folio_nr_pages(folio); - - __folio_clear_mlocked(folio); - zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages); - count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages); - } } /* From cd45e963e44b0f10d90b9e6c0e8b4f47f3c92471 Mon Sep 17 00:00:00 2001 From: Ryusuke Konishi Date: Thu, 7 Nov 2024 01:07:32 +0900 Subject: [PATCH 184/235] nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints". This series fixes null pointer dereference bugs that occur when using nilfs2 and two block-related tracepoints. This patch (of 2): It has been reported that when using "block:block_touch_buffer" tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because since the tracepoint was added in touch_buffer(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, the block_device structure is set after the function returns to the caller. Here, touch_buffer() is used to mark the folio/page that owns the buffer head as accessed, but the common search helper for folio/page used by the caller function was optimized to mark the folio/page as accessed when it was reimplemented a long time ago, eliminating the need to call touch_buffer() here in the first place. So this solves the issue by eliminating the touch_buffer() call itself. Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi Reported-by: Ubisectech Sirius Closes: https://lkml.kernel.org/r/86bd3013-887e-4e38-960f-ca45c657f032.bugreport@valiantsec.com Reported-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2 Tested-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Cc: Tejun Heo Cc: Signed-off-by: Andrew Morton --- fs/nilfs2/page.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c index 10def4b55995..296dbf9cca22 100644 --- a/fs/nilfs2/page.c +++ b/fs/nilfs2/page.c @@ -39,7 +39,6 @@ static struct buffer_head *__nilfs_get_folio_block(struct folio *folio, first_block = (unsigned long)index << (PAGE_SHIFT - blkbits); bh = get_nth_bh(bh, block - first_block); - touch_buffer(bh); wait_on_buffer(bh); return bh; } From 2026559a6c4ce34db117d2db8f710fe2a9420d5a Mon Sep 17 00:00:00 2001 From: Ryusuke Konishi Date: Thu, 7 Nov 2024 01:07:33 +0900 Subject: [PATCH 185/235] nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty() may cause a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because, since the tracepoint was added in mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, nilfs_grab_buffer(), which grabs a buffer to read (or create) a block of metadata, including b-tree node blocks, does not set the block device, but instead does so only if the buffer is not in the "uptodate" state for each of its caller block reading functions. However, if the uptodate flag is set on a folio/page, and the buffer heads are detached from it by try_to_free_buffers(), and new buffer heads are then attached by create_empty_buffers(), the uptodate flag may be restored to each buffer without the block device being set to bh->b_bdev, and mark_buffer_dirty() may be called later in that state, resulting in the bug mentioned above. Fix this issue by making nilfs_grab_buffer() always set the block device of the super block structure to the buffer head, regardless of the state of the buffer's uptodate flag. Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi Cc: Tejun Heo Cc: Ubisectech Sirius Cc: Signed-off-by: Andrew Morton --- fs/nilfs2/btnode.c | 2 -- fs/nilfs2/gcinode.c | 4 +--- fs/nilfs2/mdt.c | 1 - fs/nilfs2/page.c | 1 + 4 files changed, 2 insertions(+), 6 deletions(-) diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c index 57b4af5ad646..501ad7be5174 100644 --- a/fs/nilfs2/btnode.c +++ b/fs/nilfs2/btnode.c @@ -68,7 +68,6 @@ nilfs_btnode_create_block(struct address_space *btnc, __u64 blocknr) goto failed; } memset(bh->b_data, 0, i_blocksize(inode)); - bh->b_bdev = inode->i_sb->s_bdev; bh->b_blocknr = blocknr; set_buffer_mapped(bh); set_buffer_uptodate(bh); @@ -133,7 +132,6 @@ int nilfs_btnode_submit_block(struct address_space *btnc, __u64 blocknr, goto found; } set_buffer_mapped(bh); - bh->b_bdev = inode->i_sb->s_bdev; bh->b_blocknr = pblocknr; /* set block address for read */ bh->b_end_io = end_buffer_read_sync; get_bh(bh); diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c index 1c9ae36a03ab..ace22253fed0 100644 --- a/fs/nilfs2/gcinode.c +++ b/fs/nilfs2/gcinode.c @@ -83,10 +83,8 @@ int nilfs_gccache_submit_read_data(struct inode *inode, sector_t blkoff, goto out; } - if (!buffer_mapped(bh)) { - bh->b_bdev = inode->i_sb->s_bdev; + if (!buffer_mapped(bh)) set_buffer_mapped(bh); - } bh->b_blocknr = pbn; bh->b_end_io = end_buffer_read_sync; get_bh(bh); diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c index ceb7dc0b5bad..2db6350b5ac2 100644 --- a/fs/nilfs2/mdt.c +++ b/fs/nilfs2/mdt.c @@ -89,7 +89,6 @@ static int nilfs_mdt_create_block(struct inode *inode, unsigned long block, if (buffer_uptodate(bh)) goto failed_bh; - bh->b_bdev = sb->s_bdev; err = nilfs_mdt_insert_new_block(inode, block, bh, init_block); if (likely(!err)) { get_bh(bh); diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c index 296dbf9cca22..6dd8b854cd1f 100644 --- a/fs/nilfs2/page.c +++ b/fs/nilfs2/page.c @@ -63,6 +63,7 @@ struct buffer_head *nilfs_grab_buffer(struct inode *inode, folio_put(folio); return NULL; } + bh->b_bdev = inode->i_sb->s_bdev; return bh; } From 23aab037106d46e6168ce1214a958ce9bf317f2e Mon Sep 17 00:00:00 2001 From: Dmitry Antipov Date: Wed, 6 Nov 2024 12:21:00 +0300 Subject: [PATCH 186/235] ocfs2: fix UBSAN warning in ocfs2_verify_volume() Syzbot has reported the following splat triggered by UBSAN: UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10 shift exponent 32768 is too large for 32-bit type 'int' CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted 6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: dump_stack_lvl+0x241/0x360 ? __pfx_dump_stack_lvl+0x10/0x10 ? __pfx__printk+0x10/0x10 ? __asan_memset+0x23/0x50 ? lockdep_init_map_type+0xa1/0x910 __ubsan_handle_shift_out_of_bounds+0x3c8/0x420 ocfs2_fill_super+0xf9c/0x5750 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? validate_chain+0x11e/0x5920 ? __lock_acquire+0x1384/0x2050 ? __pfx_validate_chain+0x10/0x10 ? string+0x26a/0x2b0 ? widen_string+0x3a/0x310 ? string+0x26a/0x2b0 ? bdev_name+0x2b1/0x3c0 ? pointer+0x703/0x1210 ? __pfx_pointer+0x10/0x10 ? __pfx_format_decode+0x10/0x10 ? __lock_acquire+0x1384/0x2050 ? vsnprintf+0x1ccd/0x1da0 ? snprintf+0xda/0x120 ? __pfx_lock_release+0x10/0x10 ? do_raw_spin_lock+0x14f/0x370 ? __pfx_snprintf+0x10/0x10 ? set_blocksize+0x1f9/0x360 ? sb_set_blocksize+0x98/0xf0 ? setup_bdev_super+0x4e6/0x5d0 mount_bdev+0x20c/0x2d0 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_mount_bdev+0x10/0x10 ? vfs_parse_fs_string+0x190/0x230 ? __pfx_vfs_parse_fs_string+0x10/0x10 legacy_get_tree+0xf0/0x190 ? __pfx_ocfs2_mount+0x10/0x10 vfs_get_tree+0x92/0x2b0 do_new_mount+0x2be/0xb40 ? __pfx_do_new_mount+0x10/0x10 __se_sys_mount+0x2d6/0x3c0 ? __pfx___se_sys_mount+0x10/0x10 ? do_syscall_64+0x100/0x230 ? __x64_sys_mount+0x20/0xc0 do_syscall_64+0xf3/0x230 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f37cae96fda Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007fff6c1aa228 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007fff6c1aa240 RCX: 00007f37cae96fda RDX: 00000000200002c0 RSI: 0000000020000040 RDI: 00007fff6c1aa240 RBP: 0000000000000004 R08: 00007fff6c1aa280 R09: 0000000000000000 R10: 00000000000008c0 R11: 0000000000000206 R12: 00000000000008c0 R13: 00007fff6c1aa280 R14: 0000000000000003 R15: 0000000001000000 For a really damaged superblock, the value of 'i_super.s_blocksize_bits' may exceed the maximum possible shift for an underlying 'int'. So add an extra check whether the aforementioned field represents the valid block size, which is 512 bytes, 1K, 2K, or 4K. Link: https://lkml.kernel.org/r/20241106092100.2661330-1-dmantipov@yandex.ru Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Dmitry Antipov Reported-by: syzbot+56f7cd1abe4b8e475180@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=56f7cd1abe4b8e475180 Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Jun Piao Cc: Signed-off-by: Andrew Morton --- fs/ocfs2/super.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 3d404624bb96..c79b4291777f 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -2319,6 +2319,7 @@ static int ocfs2_verify_volume(struct ocfs2_dinode *di, struct ocfs2_blockcheck_stats *stats) { int status = -EAGAIN; + u32 blksz_bits; if (memcmp(di->i_signature, OCFS2_SUPER_BLOCK_SIGNATURE, strlen(OCFS2_SUPER_BLOCK_SIGNATURE)) == 0) { @@ -2333,11 +2334,15 @@ static int ocfs2_verify_volume(struct ocfs2_dinode *di, goto out; } status = -EINVAL; - if ((1 << le32_to_cpu(di->id2.i_super.s_blocksize_bits)) != blksz) { + /* Acceptable block sizes are 512 bytes, 1K, 2K and 4K. */ + blksz_bits = le32_to_cpu(di->id2.i_super.s_blocksize_bits); + if (blksz_bits < 9 || blksz_bits > 12) { mlog(ML_ERROR, "found superblock with incorrect block " - "size: found %u, should be %u\n", - 1 << le32_to_cpu(di->id2.i_super.s_blocksize_bits), - blksz); + "size bits: found %u, should be 9, 10, 11, or 12\n", + blksz_bits); + } else if ((1 << le32_to_cpu(blksz_bits)) != blksz) { + mlog(ML_ERROR, "found superblock with incorrect block " + "size: found %u, should be %u\n", 1 << blksz_bits, blksz); } else if (le16_to_cpu(di->id2.i_super.s_major_rev_level) != OCFS2_MAJOR_REV_LEVEL || le16_to_cpu(di->id2.i_super.s_minor_rev_level) != From 247d720b2c5d22f7281437fd6054a138256986ba Mon Sep 17 00:00:00 2001 From: Hajime Tazaki Date: Sat, 9 Nov 2024 07:28:34 +0900 Subject: [PATCH 187/235] nommu: pass NULL argument to vma_iter_prealloc() When deleting a vma entry from a maple tree, it has to pass NULL to vma_iter_prealloc() in order to calculate internal state of the tree, but it passed a wrong argument. As a result, nommu kernels crashed upon accessing a vma iterator, such as acct_collect() reading the size of vma entries after do_munmap(). This commit fixes this issue by passing a right argument to the preallocation call. Link: https://lkml.kernel.org/r/20241108222834.3625217-1-thehajime@gmail.com Fixes: b5df09226450 ("mm: set up vma iterator for vma_iter_prealloc() calls") Signed-off-by: Hajime Tazaki Reviewed-by: Liam R. Howlett Cc: Signed-off-by: Andrew Morton --- mm/nommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/nommu.c b/mm/nommu.c index e9b5f527ab5b..9cb6e99215e2 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -573,7 +573,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma) VMA_ITERATOR(vmi, vma->vm_mm, vma->vm_start); vma_iter_config(&vmi, vma->vm_start, vma->vm_end); - if (vma_iter_prealloc(&vmi, vma)) { + if (vma_iter_prealloc(&vmi, NULL)) { pr_warn("Allocation of vma tree for process %d failed\n", current->pid); return -ENOMEM; From f2685c00c3222305f5b6740a8b16ea044640283a Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Thu, 7 Nov 2024 21:03:30 +0000 Subject: [PATCH 188/235] net: fix SO_DEVMEM_DONTNEED looping too long Exit early if we're freeing more than 1024 frags, to prevent looping too long. Also minor code cleanups: - Flip checks to reduce indentation. - Use sizeof(*tokens) everywhere for consistentcy. Cc: Yi Lai Signed-off-by: Mina Almasry Acked-by: Stanislav Fomichev Link: https://patch.msgid.link/20241107210331.3044434-1-almasrymina@google.com Signed-off-by: Jakub Kicinski --- net/core/sock.c | 42 ++++++++++++++++++++++++------------------ 1 file changed, 24 insertions(+), 18 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 039be95c40cf..da50df485090 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1052,32 +1052,34 @@ static int sock_reserve_memory(struct sock *sk, int bytes) #ifdef CONFIG_PAGE_POOL -/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in - * 1 syscall. The limit exists to limit the amount of memory the kernel - * allocates to copy these tokens. +/* This is the number of tokens and frags that the user can SO_DEVMEM_DONTNEED + * in 1 syscall. The limit exists to limit the amount of memory the kernel + * allocates to copy these tokens, and to prevent looping over the frags for + * too long. */ #define MAX_DONTNEED_TOKENS 128 +#define MAX_DONTNEED_FRAGS 1024 static noinline_for_stack int sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) { unsigned int num_tokens, i, j, k, netmem_num = 0; struct dmabuf_token *tokens; + int ret = 0, num_frags = 0; netmem_ref netmems[16]; - int ret = 0; if (!sk_is_tcp(sk)) return -EBADF; - if (optlen % sizeof(struct dmabuf_token) || + if (optlen % sizeof(*tokens) || optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS) return -EINVAL; - tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL); + num_tokens = optlen / sizeof(*tokens); + tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL); if (!tokens) return -ENOMEM; - num_tokens = optlen / sizeof(struct dmabuf_token); if (copy_from_sockptr(tokens, optval, optlen)) { kvfree(tokens); return -EFAULT; @@ -1086,24 +1088,28 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) xa_lock_bh(&sk->sk_user_frags); for (i = 0; i < num_tokens; i++) { for (j = 0; j < tokens[i].token_count; j++) { + if (++num_frags > MAX_DONTNEED_FRAGS) + goto frag_limit_reached; + netmem_ref netmem = (__force netmem_ref)__xa_erase( &sk->sk_user_frags, tokens[i].token_start + j); - if (netmem && - !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) { - netmems[netmem_num++] = netmem; - if (netmem_num == ARRAY_SIZE(netmems)) { - xa_unlock_bh(&sk->sk_user_frags); - for (k = 0; k < netmem_num; k++) - WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); - netmem_num = 0; - xa_lock_bh(&sk->sk_user_frags); - } - ret++; + if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem))) + continue; + + netmems[netmem_num++] = netmem; + if (netmem_num == ARRAY_SIZE(netmems)) { + xa_unlock_bh(&sk->sk_user_frags); + for (k = 0; k < netmem_num; k++) + WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); + netmem_num = 0; + xa_lock_bh(&sk->sk_user_frags); } + ret++; } } +frag_limit_reached: xa_unlock_bh(&sk->sk_user_frags); for (k = 0; k < netmem_num; k++) WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); From 102d1404c385611c574498b1e0d1f3762e253359 Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Thu, 7 Nov 2024 21:03:31 +0000 Subject: [PATCH 189/235] net: clarify SO_DEVMEM_DONTNEED behavior in documentation Document new behavior when the number of frags passed is too big. Signed-off-by: Mina Almasry Link: https://patch.msgid.link/20241107210331.3044434-2-almasrymina@google.com Signed-off-by: Jakub Kicinski --- Documentation/networking/devmem.rst | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst index a55bf21f671c..d95363645331 100644 --- a/Documentation/networking/devmem.rst +++ b/Documentation/networking/devmem.rst @@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner. Failure to do so will exhaust the limited dmabuf that is bound to the RX queue and will lead to packet drops. +The user must pass no more than 128 tokens, with no more than 1024 total frags +among the token->token_count across all the tokens. If the user provides more +than 1024 frags, the kernel will free up to 1024 frags and return early. + +The kernel returns the number of actual frags freed. The number of frags freed +can be less than the tokens provided by the user in case of: + +(a) an internal kernel leak bug. +(b) the user passed more than 1024 frags. Implementation & Caveats ======================== From 581302298524e9d77c4c44ff5156a6cd112227ae Mon Sep 17 00:00:00 2001 From: Paolo Abeni Date: Fri, 8 Nov 2024 11:58:16 +0100 Subject: [PATCH 190/235] mptcp: error out earlier on disconnect Eric reported a division by zero splat in the MPTCP protocol: Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 UID: 0 PID: 6094 Comm: syz-executor317 Not tainted 6.12.0-rc5-syzkaller-00291-g05b92660cdfe #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:__tcp_select_window+0x5b4/0x1310 net/ipv4/tcp_output.c:3163 Code: f6 44 01 e3 89 df e8 9b 75 09 f8 44 39 f3 0f 8d 11 ff ff ff e8 0d 74 09 f8 45 89 f4 e9 04 ff ff ff e8 00 74 09 f8 44 89 f0 99 7c 24 14 41 29 d6 45 89 f4 e9 ec fe ff ff e8 e8 73 09 f8 48 89 RSP: 0018:ffffc900041f7930 EFLAGS: 00010293 RAX: 0000000000017e67 RBX: 0000000000017e67 RCX: ffffffff8983314b RDX: 0000000000000000 RSI: ffffffff898331b0 RDI: 0000000000000004 RBP: 00000000005d6000 R08: 0000000000000004 R09: 0000000000017e67 R10: 0000000000003e80 R11: 0000000000000000 R12: 0000000000003e80 R13: ffff888031d9b440 R14: 0000000000017e67 R15: 00000000002eb000 FS: 00007feb5d7f16c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007feb5d8adbb8 CR3: 0000000074e4c000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __tcp_cleanup_rbuf+0x3e7/0x4b0 net/ipv4/tcp.c:1493 mptcp_rcv_space_adjust net/mptcp/protocol.c:2085 [inline] mptcp_recvmsg+0x2156/0x2600 net/mptcp/protocol.c:2289 inet_recvmsg+0x469/0x6a0 net/ipv4/af_inet.c:885 sock_recvmsg_nosec net/socket.c:1051 [inline] sock_recvmsg+0x1b2/0x250 net/socket.c:1073 __sys_recvfrom+0x1a5/0x2e0 net/socket.c:2265 __do_sys_recvfrom net/socket.c:2283 [inline] __se_sys_recvfrom net/socket.c:2279 [inline] __x64_sys_recvfrom+0xe0/0x1c0 net/socket.c:2279 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7feb5d857559 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007feb5d7f1208 EFLAGS: 00000246 ORIG_RAX: 000000000000002d RAX: ffffffffffffffda RBX: 00007feb5d8e1318 RCX: 00007feb5d857559 RDX: 000000800000000e RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007feb5d8e1310 R08: 0000000000000000 R09: ffffffff81000000 R10: 0000000000000100 R11: 0000000000000246 R12: 00007feb5d8e131c R13: 00007feb5d8ae074 R14: 000000800000000e R15: 00000000fffffdef and provided a nice reproducer. The root cause is the current bad handling of racing disconnect. After the blamed commit below, sk_wait_data() can return (with error) with the underlying socket disconnected and a zero rcv_mss. Catch the error and return without performing any additional operations on the current socket. Reported-by: Eric Dumazet Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting") Signed-off-by: Paolo Abeni Reviewed-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/8c82ecf71662ecbc47bf390f9905de70884c9f2d.1731060874.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski --- net/mptcp/protocol.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index d263091659e0..95a5a3da3944 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2205,7 +2205,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, cmsg_flags = MPTCP_CMSG_INQ; while (copied < len) { - int bytes_read; + int err, bytes_read; bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied, flags, &tss, &cmsg_flags); if (unlikely(bytes_read < 0)) { @@ -2267,9 +2267,16 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, } pr_debug("block timeout %ld\n", timeo); - sk_wait_data(sk, &timeo, NULL); + mptcp_rcv_space_adjust(msk, copied); + err = sk_wait_data(sk, &timeo, NULL); + if (err < 0) { + err = copied ? : err; + goto out_err; + } } + mptcp_rcv_space_adjust(msk, copied); + out_err: if (cmsg_flags && copied >= 0) { if (cmsg_flags & MPTCP_CMSG_TS) @@ -2285,8 +2292,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, pr_debug("msk=%p rx queue empty=%d:%d copied=%d\n", msk, skb_queue_empty_lockless(&sk->sk_receive_queue), skb_queue_empty(&msk->receive_queue), copied); - if (!(flags & MSG_PEEK)) - mptcp_rcv_space_adjust(msk, copied); release_sock(sk); return copied; From ce7356ae35943cc6494cc692e62d51a734062b7d Mon Sep 17 00:00:00 2001 From: Paolo Abeni Date: Fri, 8 Nov 2024 11:58:17 +0100 Subject: [PATCH 191/235] mptcp: cope racing subflow creation in mptcp_rcv_space_adjust Additional active subflows - i.e. created by the in kernel path manager - are included into the subflow list before starting the 3whs. A racing recvmsg() spooling data received on an already established subflow would unconditionally call tcp_cleanup_rbuf() on all the current subflows, potentially hitting a divide by zero error on the newly created ones. Explicitly check that the subflow is in a suitable state before invoking tcp_cleanup_rbuf(). Fixes: c76c6956566f ("mptcp: call tcp_cleanup_rbuf on subflows") Signed-off-by: Paolo Abeni Reviewed-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/02374660836e1b52afc91966b7535c8c5f7bafb0.1731060874.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski --- net/mptcp/protocol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 95a5a3da3944..48d480982b78 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2082,7 +2082,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied) slow = lock_sock_fast(ssk); WRITE_ONCE(ssk->sk_rcvbuf, rcvbuf); WRITE_ONCE(tcp_sk(ssk)->window_clamp, window_clamp); - tcp_cleanup_rbuf(ssk, 1); + if (tcp_can_send_ack(ssk)) + tcp_cleanup_rbuf(ssk, 1); unlock_sock_fast(ssk, slow); } } From 1220965d619178713844ef365beb9d9b88267e13 Mon Sep 17 00:00:00 2001 From: Chiara Meiohas Date: Thu, 7 Nov 2024 20:35:21 +0200 Subject: [PATCH 192/235] net/mlx5: E-switch, unload IB representors when unloading ETH representors IB representors depend on ETH representors, so the IB representors should not exist without the ETH ones. When unloading the ETH representors, the corresponding IB representors should be also unloaded. The commit 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") introduced the use of the ib_device_set_netdev API in IB repsresentors. ib_device_set_netdev() increments the refcount of the representor's netdev when loading an IB representor and decrements it when unloading. Without the unloading of the IB representor, the refcount of the representor's netdev remains greater than 0, preventing it from being unregistered. The patch uncovered an underlying bug where the eth representor is unloaded, without unloading the IB representor. This issue happened when using multiport E-switch and rebooting, causing the shutdown to hang when unloading the ETH representor because the refcount of the representor's netdevice was greater than 0. Call trace: unregister_netdevice: waiting for eth3 to become free. Usage count = 2 ref_tracker: eth%d@00000000661d60f7 has 1/1 users at ib_device_set_netdev+0x160/0x2d0 [ib_core] mlx5_ib_vport_rep_load+0x104/0x3f0 [mlx5_ib] mlx5_eswitch_reload_ib_reps+0xfc/0x110 [mlx5_core] mlx5_mpesw_work+0x236/0x330 [mlx5_core] process_one_work+0x169/0x320 worker_thread+0x288/0x3a0 kthread+0xb8/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas Reviewed-by: Mark Bloch Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/20241107183527.676877-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index f24f91d213f2..8cf61ae8b89d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -2527,8 +2527,11 @@ static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw, struct mlx5_eswitch_rep *rep, u8 rep_type) { if (atomic_cmpxchg(&rep->rep_data[rep_type].state, - REP_LOADED, REP_REGISTERED) == REP_LOADED) + REP_LOADED, REP_REGISTERED) == REP_LOADED) { + if (rep_type == REP_ETH) + __esw_offloads_unload_rep(esw, rep, REP_IB); esw->offloads.rep_ops[rep_type]->unload(rep); + } } static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type) From d0989c9d2b3a89ae5e4ad45fe6d7bbe449fc49fe Mon Sep 17 00:00:00 2001 From: Parav Pandit Date: Thu, 7 Nov 2024 20:35:22 +0200 Subject: [PATCH 193/235] net/mlx5: Fix msix vectors to respect platform limit The number of PCI vectors allocated by the platform (which may be fewer than requested) is currently not honored when creating the SF pool; only the PCI MSI-X capability is considered. As a result, when a platform allocates fewer vectors (in non-dynamic mode) than requested, the PF and SF pools end up with an invalid vector range. This causes incorrect SF vector accounting, which leads to the following call trace when an invalid IRQ vector is allocated. This issue is resolved by ensuring that the platform's vector limit is respected for both the SF and PF pools. Workqueue: mlx5_vhca_event0 mlx5_sf_dev_add_active_work [mlx5_core] RIP: 0010:pci_irq_vector+0x23/0x80 RSP: 0018:ffffabd5cebd7248 EFLAGS: 00010246 RAX: ffff980880e7f308 RBX: ffff9808932fb880 RCX: 0000000000000001 RDX: 00000000000001ff RSI: 0000000000000200 RDI: ffff980880e7f308 RBP: 0000000000000200 R08: 0000000000000010 R09: ffff97a9116f0860 R10: 0000000000000002 R11: 0000000000000228 R12: ffff980897cd0160 R13: 0000000000000000 R14: ffff97a920fec0c0 R15: ffffabd5cebd72d0 FS: 0000000000000000(0000) GS:ffff97c7ff9c0000(0000) knlGS:0000000000000000 ? rescuer_thread+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core.sf mlx5_core.sf.6: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) mlx5_core.sf mlx5_core.sf.7: firmware version: 32.43.356 mlx5_core.sf mlx5_core.sf.6 enpa1s0f0s4: renamed from eth0 mlx5_core.sf mlx5_core.sf.7: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation") Signed-off-by: Parav Pandit Signed-off-by: Amir Tzin Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/20241107183527.676877-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 32 ++++++++++++++++--- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c index 81a9232a03e1..7db9cab9bedf 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c @@ -593,9 +593,11 @@ static void irq_pool_free(struct mlx5_irq_pool *pool) kvfree(pool); } -static int irq_pools_init(struct mlx5_core_dev *dev, int sf_vec, int pcif_vec) +static int irq_pools_init(struct mlx5_core_dev *dev, int sf_vec, int pcif_vec, + bool dynamic_vec) { struct mlx5_irq_table *table = dev->priv.irq_table; + int sf_vec_available = sf_vec; int num_sf_ctrl; int err; @@ -616,6 +618,13 @@ static int irq_pools_init(struct mlx5_core_dev *dev, int sf_vec, int pcif_vec) num_sf_ctrl = DIV_ROUND_UP(mlx5_sf_max_functions(dev), MLX5_SFS_PER_CTRL_IRQ); num_sf_ctrl = min_t(int, MLX5_IRQ_CTRL_SF_MAX, num_sf_ctrl); + if (!dynamic_vec && (num_sf_ctrl + 1) > sf_vec_available) { + mlx5_core_dbg(dev, + "Not enough IRQs for SFs control and completion pool, required=%d avail=%d\n", + num_sf_ctrl + 1, sf_vec_available); + return 0; + } + table->sf_ctrl_pool = irq_pool_alloc(dev, pcif_vec, num_sf_ctrl, "mlx5_sf_ctrl", MLX5_EQ_SHARE_IRQ_MIN_CTRL, @@ -624,9 +633,11 @@ static int irq_pools_init(struct mlx5_core_dev *dev, int sf_vec, int pcif_vec) err = PTR_ERR(table->sf_ctrl_pool); goto err_pf; } - /* init sf_comp_pool */ + sf_vec_available -= num_sf_ctrl; + + /* init sf_comp_pool, remaining vectors are for the SF completions */ table->sf_comp_pool = irq_pool_alloc(dev, pcif_vec + num_sf_ctrl, - sf_vec - num_sf_ctrl, "mlx5_sf_comp", + sf_vec_available, "mlx5_sf_comp", MLX5_EQ_SHARE_IRQ_MIN_COMP, MLX5_EQ_SHARE_IRQ_MAX_COMP); if (IS_ERR(table->sf_comp_pool)) { @@ -715,6 +726,7 @@ int mlx5_irq_table_get_num_comp(struct mlx5_irq_table *table) int mlx5_irq_table_create(struct mlx5_core_dev *dev) { int num_eqs = mlx5_max_eq_cap_get(dev); + bool dynamic_vec; int total_vec; int pcif_vec; int req_vec; @@ -724,21 +736,31 @@ int mlx5_irq_table_create(struct mlx5_core_dev *dev) if (mlx5_core_is_sf(dev)) return 0; + /* PCI PF vectors usage is limited by online cpus, device EQs and + * PCI MSI-X capability. + */ pcif_vec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + 1; pcif_vec = min_t(int, pcif_vec, num_eqs); + pcif_vec = min_t(int, pcif_vec, pci_msix_vec_count(dev->pdev)); total_vec = pcif_vec; if (mlx5_sf_max_functions(dev)) total_vec += MLX5_MAX_MSIX_PER_SF * mlx5_sf_max_functions(dev); total_vec = min_t(int, total_vec, pci_msix_vec_count(dev->pdev)); - pcif_vec = min_t(int, pcif_vec, pci_msix_vec_count(dev->pdev)); req_vec = pci_msix_can_alloc_dyn(dev->pdev) ? 1 : total_vec; n = pci_alloc_irq_vectors(dev->pdev, 1, req_vec, PCI_IRQ_MSIX); if (n < 0) return n; - err = irq_pools_init(dev, total_vec - pcif_vec, pcif_vec); + /* Further limit vectors of the pools based on platform for non dynamic case */ + dynamic_vec = pci_msix_can_alloc_dyn(dev->pdev); + if (!dynamic_vec) { + pcif_vec = min_t(int, n, pcif_vec); + total_vec = min_t(int, n, total_vec); + } + + err = irq_pools_init(dev, total_vec - pcif_vec, pcif_vec, dynamic_vec); if (err) pci_free_irq_vectors(dev->pdev); From 9ca314419930f9135727e39d77e66262d5f7bef6 Mon Sep 17 00:00:00 2001 From: Mark Bloch Date: Thu, 7 Nov 2024 20:35:23 +0200 Subject: [PATCH 194/235] net/mlx5: fs, lock FTE when checking if active The referenced commits introduced a two-step process for deleting FTEs: - Lock the FTE, delete it from hardware, set the hardware deletion function to NULL and unlock the FTE. - Lock the parent flow group, delete the software copy of the FTE, and remove it from the xarray. However, this approach encounters a race condition if a rule with the same match value is added simultaneously. In this scenario, fs_core may set the hardware deletion function to NULL prematurely, causing a panic during subsequent rule deletions. To prevent this, ensure the active flag of the FTE is checked under a lock, which will prevent the fs_core layer from attaching a new steering rule to an FTE that is in the process of deletion. [ 438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func [ 438.968205] ------------[ cut here ]------------ [ 438.968654] refcount_t: decrement hit 0; leaking memory. [ 438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110 [ 438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower] [ 438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8 [ 438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110 [ 438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 [ 438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286 [ 438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000 [ 438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0 [ 438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0 [ 438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0 [ 438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0 [ 438.980607] FS: 00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 [ 438.983984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0 [ 438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 438.986507] Call Trace: [ 438.986799] [ 438.987070] ? __warn+0x7d/0x110 [ 438.987426] ? refcount_warn_saturate+0xfb/0x110 [ 438.987877] ? report_bug+0x17d/0x190 [ 438.988261] ? prb_read_valid+0x17/0x20 [ 438.988659] ? handle_bug+0x53/0x90 [ 438.989054] ? exc_invalid_op+0x14/0x70 [ 438.989458] ? asm_exc_invalid_op+0x16/0x20 [ 438.989883] ? refcount_warn_saturate+0xfb/0x110 [ 438.990348] mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core] [ 438.990932] __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core] [ 438.991519] ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core] [ 438.992054] ? xas_load+0x9/0xb0 [ 438.992407] mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core] [ 438.993037] mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core] [ 438.993623] mlx5e_flow_put+0x29/0x60 [mlx5_core] [ 438.994161] mlx5e_delete_flower+0x261/0x390 [mlx5_core] [ 438.994728] tc_setup_cb_destroy+0xb9/0x190 [ 438.995150] fl_hw_destroy_filter+0x94/0xc0 [cls_flower] [ 438.995650] fl_change+0x11a4/0x13c0 [cls_flower] [ 438.996105] tc_new_tfilter+0x347/0xbc0 [ 438.996503] ? ___slab_alloc+0x70/0x8c0 [ 438.996929] rtnetlink_rcv_msg+0xf9/0x3e0 [ 438.997339] ? __netlink_sendskb+0x4c/0x70 [ 438.997751] ? netlink_unicast+0x286/0x2d0 [ 438.998171] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 438.998625] netlink_rcv_skb+0x54/0x100 [ 438.999020] netlink_unicast+0x203/0x2d0 [ 438.999421] netlink_sendmsg+0x1e4/0x420 [ 438.999820] __sock_sendmsg+0xa1/0xb0 [ 439.000203] ____sys_sendmsg+0x207/0x2a0 [ 439.000600] ? copy_msghdr_from_user+0x6d/0xa0 [ 439.001072] ___sys_sendmsg+0x80/0xc0 [ 439.001459] ? ___sys_recvmsg+0x8b/0xc0 [ 439.001848] ? generic_update_time+0x4d/0x60 [ 439.002282] __sys_sendmsg+0x51/0x90 [ 439.002658] do_syscall_64+0x50/0x110 [ 439.003040] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 718ce4d601db ("net/mlx5: Consolidate update FTE for all removal changes") Fixes: cefc23554fc2 ("net/mlx5: Fix FTE cleanup") Signed-off-by: Mark Bloch Reviewed-by: Maor Gottlieb Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/20241107183527.676877-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- .../net/ethernet/mellanox/mlx5/core/fs_core.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 8505d5e241e1..6e4f8aaf8d2f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -2105,13 +2105,22 @@ lookup_fte_locked(struct mlx5_flow_group *g, fte_tmp = NULL; goto out; } - if (!fte_tmp->node.active) { - tree_put_node(&fte_tmp->node, false); - fte_tmp = NULL; - goto out; - } nested_down_write_ref_node(&fte_tmp->node, FS_LOCK_CHILD); + + if (!fte_tmp->node.active) { + up_write_ref_node(&fte_tmp->node, false); + + if (take_write) + up_write_ref_node(&g->node, false); + else + up_read_ref_node(&g->node); + + tree_put_node(&fte_tmp->node, false); + + return NULL; + } + out: if (take_write) up_write_ref_node(&g->node, false); From dd6e972cc5890d91d6749bb48e3912721c4e4b25 Mon Sep 17 00:00:00 2001 From: Dragos Tatulea Date: Thu, 7 Nov 2024 20:35:24 +0200 Subject: [PATCH 195/235] net/mlx5e: kTLS, Fix incorrect page refcounting The kTLS tx handling code is using a mix of get_page() and page_ref_inc() APIs to increment the page reference. But on the release path (mlx5e_ktls_tx_handle_resync_dump_comp()), only put_page() is used. This is an issue when using pages from large folios: the get_page() references are stored on the folio page while the page_ref_inc() references are stored directly in the given page. On release the folio page will be dereferenced too many times. This was found while doing kTLS testing with sendfile() + ZC when the served file was read from NFS on a kernel with NFS large folios support (commit 49b29a573da8 ("nfs: add support for large folios")). Fixes: 84d1bb2b139e ("net/mlx5e: kTLS, Limit DUMP wqe size") Signed-off-by: Dragos Tatulea Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/20241107183527.676877-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- .../net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c index d61be26a4df1..3db31cc10719 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c @@ -660,7 +660,7 @@ tx_sync_info_get(struct mlx5e_ktls_offload_context_tx *priv_tx, while (remaining > 0) { skb_frag_t *frag = &record->frags[i]; - get_page(skb_frag_page(frag)); + page_ref_inc(skb_frag_page(frag)); remaining -= skb_frag_size(frag); info->frags[i++] = *frag; } @@ -763,7 +763,7 @@ void mlx5e_ktls_tx_handle_resync_dump_comp(struct mlx5e_txqsq *sq, stats = sq->stats; mlx5e_tx_dma_unmap(sq->pdev, dma); - put_page(wi->resync_dump_frag_page); + page_ref_dec(wi->resync_dump_frag_page); stats->tls_dump_packets++; stats->tls_dump_bytes += wi->num_bytes; } @@ -816,12 +816,12 @@ mlx5e_ktls_tx_handle_ooo(struct mlx5e_ktls_offload_context_tx *priv_tx, err_out: for (; i < info.nr_frags; i++) - /* The put_page() here undoes the page ref obtained in tx_sync_info_get(). + /* The page_ref_dec() here undoes the page ref obtained in tx_sync_info_get(). * Page refs obtained for the DUMP WQEs above (by page_ref_add) will be * released only upon their completions (or in mlx5e_free_txqsq_descs, * if channel closes). */ - put_page(skb_frag_page(&info.frags[i])); + page_ref_dec(skb_frag_page(&info.frags[i])); return MLX5E_KTLS_SYNC_FAIL; } From c079389878debf767dc4e52fe877b9117258dfe2 Mon Sep 17 00:00:00 2001 From: William Tu Date: Thu, 7 Nov 2024 20:35:25 +0200 Subject: [PATCH 196/235] net/mlx5e: clear xdp features on non-uplink representors Non-uplink representor port does not support XDP. The patch clears the xdp feature by checking the net_device_ops.ndo_bpf is set or not. Verify using the netlink tool: $ tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump dev-get Representor netdev before the patch: {'ifindex': 8, 'xdp-features': {'basic', 'ndo-xmit', 'ndo-xmit-sg', 'redirect', 'rx-sg', 'xsk-zerocopy'}, 'xdp-rx-metadata-features': set(), 'xdp-zc-max-segs': 1, 'xsk-features': set()}, With the patch: {'ifindex': 8, 'xdp-features': set(), 'xdp-rx-metadata-features': set(), 'xsk-features': set()}, Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag") Signed-off-by: William Tu Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/20241107183527.676877-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index e601324a690a..13a3fa8dc0cb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -4267,7 +4267,8 @@ void mlx5e_set_xdp_feature(struct net_device *netdev) struct mlx5e_params *params = &priv->channels.params; xdp_features_t val; - if (params->packet_merge.type != MLX5E_PACKET_MERGE_NONE) { + if (!netdev->netdev_ops->ndo_bpf || + params->packet_merge.type != MLX5E_PACKET_MERGE_NONE) { xdp_clear_features_flag(netdev); return; } From e99c6873229fe0482e7ceb7d5600e32d623ed9d9 Mon Sep 17 00:00:00 2001 From: Moshe Shemesh Date: Thu, 7 Nov 2024 20:35:26 +0200 Subject: [PATCH 197/235] net/mlx5e: CT: Fix null-ptr-deref in add rule err flow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In error flow of mlx5_tc_ct_entry_add_rule(), in case ct_rule_add() callback returns error, zone_rule->attr is used uninitiated. Fix it to use attr which has the needed pointer value. Kernel log: BUG: kernel NULL pointer dereference, address: 0000000000000110 RIP: 0010:mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] … Call Trace: ? __die+0x20/0x70 ? page_fault_oops+0x150/0x3e0 ? exc_page_fault+0x74/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] ? mlx5_tc_ct_entry_add_rule+0x1d5/0x2f0 [mlx5_core] mlx5_tc_ct_block_flow_offload+0xc6a/0xf90 [mlx5_core] ? nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] flow_offload_work_handler+0x142/0x320 [nf_flow_table] ? finish_task_switch.isra.0+0x15b/0x2b0 process_one_work+0x16c/0x320 worker_thread+0x28c/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xb8/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Fixes: 7fac5c2eced3 ("net/mlx5: CT: Avoid reusing modify header context for natted entries") Signed-off-by: Moshe Shemesh Reviewed-by: Cosmin Ratiu Reviewed-by: Yevgeny Kliteynik Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/20241107183527.676877-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c index dcfccaaa8d91..92d5cfec3dc0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c @@ -866,7 +866,7 @@ mlx5_tc_ct_entry_add_rule(struct mlx5_tc_ct_priv *ct_priv, return 0; err_rule: - mlx5_tc_ct_entry_destroy_mod_hdr(ct_priv, zone_rule->attr, zone_rule->mh); + mlx5_tc_ct_entry_destroy_mod_hdr(ct_priv, attr, zone_rule->mh); mlx5_put_label_mapping(ct_priv, attr->ct_attr.ct_labels_id); err_mod_hdr: kfree(attr); From d1ac33934a66e8d58a52668999bf9e8f59e56c81 Mon Sep 17 00:00:00 2001 From: Carolina Jubran Date: Thu, 7 Nov 2024 20:35:27 +0200 Subject: [PATCH 198/235] net/mlx5e: Disable loopback self-test on multi-PF netdev In Multi-PF (Socket Direct) configurations, when a loopback packet is sent through one of the secondary devices, it will always be received on the primary device. This causes the loopback layer to fail in identifying the loopback packet as the devices are different. To avoid false test failures, disable the loopback self-test in Multi-PF configurations. Fixes: ed29705e4ed1 ("net/mlx5: Enable SD feature") Signed-off-by: Carolina Jubran Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/20241107183527.676877-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c index 5bf8318cc48b..1d60465cc2ca 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c @@ -36,6 +36,7 @@ #include "en.h" #include "en/port.h" #include "eswitch.h" +#include "lib/mlx5.h" static int mlx5e_test_health_info(struct mlx5e_priv *priv) { @@ -247,6 +248,9 @@ static int mlx5e_cond_loopback(struct mlx5e_priv *priv) if (is_mdev_switchdev_mode(priv->mdev)) return -EOPNOTSUPP; + if (mlx5_get_sd(priv->mdev)) + return -EOPNOTSUPP; + return 0; } From a6654a40a852a4ca18aacced4cf5ca87997818d7 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Tue, 12 Nov 2024 16:35:36 +0800 Subject: [PATCH 199/235] LoongArch: For all possible CPUs setup logical-physical CPU mapping In order to support ACPI-based physical CPU hotplug, we suppose for all "possible" CPUs cpu_logical_map() can work. Because some drivers want to use cpu_logical_map() for all "possible" CPUs, while currently we only setup logical-physical mapping for "present" CPUs. This lack of mapping also causes cpu_to_node() cannot work for hot-added CPUs. All "possible" CPUs are listed in MADT, and the "present" subset is marked as ACPI_MADT_ENABLED. To setup logical-physical CPU mapping for all possible CPUs and keep present CPUs continuous in cpu_present_mask, we parse MADT twice. The first pass handles CPUs with ACPI_MADT_ENABLED and the second pass handles CPUs without ACPI_MADT_ENABLED. The global flag (cpu_enumerated) is removed because acpi_map_cpu() calls cpu_number_map() rather than set_processor_mask() now. Reported-by: Bibo Mao Signed-off-by: Huacai Chen --- arch/loongarch/kernel/acpi.c | 81 +++++++++++++++++++++++------------- arch/loongarch/kernel/smp.c | 3 +- 2 files changed, 55 insertions(+), 29 deletions(-) diff --git a/arch/loongarch/kernel/acpi.c b/arch/loongarch/kernel/acpi.c index f1a74b80f22c..382a09a7152c 100644 --- a/arch/loongarch/kernel/acpi.c +++ b/arch/loongarch/kernel/acpi.c @@ -58,48 +58,48 @@ void __iomem *acpi_os_ioremap(acpi_physical_address phys, acpi_size size) return ioremap_cache(phys, size); } -static int cpu_enumerated = 0; - #ifdef CONFIG_SMP -static int set_processor_mask(u32 id, u32 flags) +static int set_processor_mask(u32 id, u32 pass) { - int nr_cpus; - int cpu, cpuid = id; + int cpu = -1, cpuid = id; - if (!cpu_enumerated) - nr_cpus = NR_CPUS; - else - nr_cpus = nr_cpu_ids; - - if (num_processors >= nr_cpus) { + if (num_processors >= NR_CPUS) { pr_warn(PREFIX "nr_cpus limit of %i reached." - " processor 0x%x ignored.\n", nr_cpus, cpuid); + " processor 0x%x ignored.\n", NR_CPUS, cpuid); return -ENODEV; } + if (cpuid == loongson_sysconf.boot_cpu_id) cpu = 0; - else - cpu = find_first_zero_bit(cpumask_bits(cpu_present_mask), NR_CPUS); - if (!cpu_enumerated) - set_cpu_possible(cpu, true); - - if (flags & ACPI_MADT_ENABLED) { + switch (pass) { + case 1: /* Pass 1 handle enabled processors */ + if (cpu < 0) + cpu = find_first_zero_bit(cpumask_bits(cpu_present_mask), NR_CPUS); num_processors++; set_cpu_present(cpu, true); - __cpu_number_map[cpuid] = cpu; - __cpu_logical_map[cpu] = cpuid; - } else + break; + case 2: /* Pass 2 handle disabled processors */ + if (cpu < 0) + cpu = find_first_zero_bit(cpumask_bits(cpu_possible_mask), NR_CPUS); disabled_cpus++; + break; + default: + return cpu; + } + + set_cpu_possible(cpu, true); + __cpu_number_map[cpuid] = cpu; + __cpu_logical_map[cpu] = cpuid; return cpu; } #endif static int __init -acpi_parse_processor(union acpi_subtable_headers *header, const unsigned long end) +acpi_parse_p1_processor(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_core_pic *processor = NULL; @@ -110,12 +110,29 @@ acpi_parse_processor(union acpi_subtable_headers *header, const unsigned long en acpi_table_print_madt_entry(&header->common); #ifdef CONFIG_SMP acpi_core_pic[processor->core_id] = *processor; - set_processor_mask(processor->core_id, processor->flags); + if (processor->flags & ACPI_MADT_ENABLED) + set_processor_mask(processor->core_id, 1); #endif return 0; } +static int __init +acpi_parse_p2_processor(union acpi_subtable_headers *header, const unsigned long end) +{ + struct acpi_madt_core_pic *processor = NULL; + + processor = (struct acpi_madt_core_pic *)header; + if (BAD_MADT_ENTRY(processor, end)) + return -EINVAL; + +#ifdef CONFIG_SMP + if (!(processor->flags & ACPI_MADT_ENABLED)) + set_processor_mask(processor->core_id, 2); +#endif + + return 0; +} static int __init acpi_parse_eio_master(union acpi_subtable_headers *header, const unsigned long end) { @@ -143,12 +160,14 @@ static void __init acpi_process_madt(void) } #endif acpi_table_parse_madt(ACPI_MADT_TYPE_CORE_PIC, - acpi_parse_processor, MAX_CORE_PIC); + acpi_parse_p1_processor, MAX_CORE_PIC); + + acpi_table_parse_madt(ACPI_MADT_TYPE_CORE_PIC, + acpi_parse_p2_processor, MAX_CORE_PIC); acpi_table_parse_madt(ACPI_MADT_TYPE_EIO_PIC, acpi_parse_eio_master, MAX_IO_PICS); - cpu_enumerated = 1; loongson_sysconf.nr_cpus = num_processors; } @@ -310,6 +329,10 @@ static int __ref acpi_map_cpu2node(acpi_handle handle, int cpu, int physid) int nid; nid = acpi_get_node(handle); + + if (nid != NUMA_NO_NODE) + nid = early_cpu_to_node(cpu); + if (nid != NUMA_NO_NODE) { set_cpuid_to_node(physid, nid); node_set(nid, numa_nodes_parsed); @@ -324,12 +347,14 @@ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, u32 acpi_id, int *pcpu { int cpu; - cpu = set_processor_mask(physid, ACPI_MADT_ENABLED); - if (cpu < 0) { + cpu = cpu_number_map(physid); + if (cpu < 0 || cpu >= nr_cpu_ids) { pr_info(PREFIX "Unable to map lapic to logical cpu number\n"); - return cpu; + return -ERANGE; } + num_processors++; + set_cpu_present(cpu, true); acpi_map_cpu2node(handle, cpu, physid); *pcpu = cpu; diff --git a/arch/loongarch/kernel/smp.c b/arch/loongarch/kernel/smp.c index 9afc2d8b3414..c0b498467ffa 100644 --- a/arch/loongarch/kernel/smp.c +++ b/arch/loongarch/kernel/smp.c @@ -331,11 +331,11 @@ void __init loongson_prepare_cpus(unsigned int max_cpus) int i = 0; parse_acpi_topology(); + cpu_data[0].global_id = cpu_logical_map(0); for (i = 0; i < loongson_sysconf.nr_cpus; i++) { set_cpu_present(i, true); csr_mail_send(0, __cpu_logical_map[i], 0); - cpu_data[i].global_id = __cpu_logical_map[i]; } per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; @@ -380,6 +380,7 @@ void loongson_init_secondary(void) cpu_logical_map(cpu) / loongson_sysconf.cores_per_package; cpu_data[cpu].core = pptt_enabled ? cpu_data[cpu].core : cpu_logical_map(cpu) % loongson_sysconf.cores_per_package; + cpu_data[cpu].global_id = cpu_logical_map(cpu); } void loongson_smp_finish(void) From 30cec747d6bf2c3e915c075d76d9712e54cde0a6 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Tue, 12 Nov 2024 16:35:36 +0800 Subject: [PATCH 200/235] LoongArch: Fix early_numa_add_cpu() usage for FDT systems early_numa_add_cpu() applies on physical CPU id rather than logical CPU id, so use cpuid instead of cpu. Cc: stable@vger.kernel.org Fixes: 3de9c42d02a79a5 ("LoongArch: Add all CPUs enabled by fdt to NUMA node 0") Reported-by: Bibo Mao Signed-off-by: Huacai Chen --- arch/loongarch/kernel/smp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/loongarch/kernel/smp.c b/arch/loongarch/kernel/smp.c index c0b498467ffa..5d59e9ce2772 100644 --- a/arch/loongarch/kernel/smp.c +++ b/arch/loongarch/kernel/smp.c @@ -302,7 +302,7 @@ static void __init fdt_smp_setup(void) __cpu_number_map[cpuid] = cpu; __cpu_logical_map[cpu] = cpuid; - early_numa_add_cpu(cpu, 0); + early_numa_add_cpu(cpuid, 0); set_cpuid_to_node(cpuid, 0); } From c859900a841b0a6cd9a73d16426465e44cdde29c Mon Sep 17 00:00:00 2001 From: Yuli Wang Date: Tue, 12 Nov 2024 16:35:39 +0800 Subject: [PATCH 201/235] LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS This is a trivial cleanup, commit c62da0c35d58518d ("mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS") has unified default values of VM_DATA_DEFAULT_FLAGS across different platforms. Apply the same consistency to LoongArch. Suggested-by: Wentao Guan Signed-off-by: Yuli Wang Signed-off-by: Huacai Chen --- arch/loongarch/include/asm/page.h | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h index e85df33f11c7..8f21567a3188 100644 --- a/arch/loongarch/include/asm/page.h +++ b/arch/loongarch/include/asm/page.h @@ -113,10 +113,7 @@ struct page *tlb_virt_to_page(unsigned long kaddr); extern int __virt_addr_valid(volatile void *kaddr); #define virt_addr_valid(kaddr) __virt_addr_valid((volatile void *)(kaddr)) -#define VM_DATA_DEFAULT_FLAGS \ - (VM_READ | VM_WRITE | \ - ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0) | \ - VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) +#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC #include #include From a410656643ce4844ba9875aa4e87a7779308259b Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Tue, 12 Nov 2024 16:35:39 +0800 Subject: [PATCH 202/235] LoongArch: Make KASAN work with 5-level page-tables Make KASAN work with 5-level page-tables, including: 1. Implement and use __pgd_none() and kasan_p4d_offset(). 2. As done in kasan_pmd_populate() and kasan_pte_populate(), restrict the loop conditions of kasan_p4d_populate() and kasan_pud_populate() to avoid unnecessary population. Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen --- arch/loongarch/mm/kasan_init.c | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/arch/loongarch/mm/kasan_init.c b/arch/loongarch/mm/kasan_init.c index 427d6b1aec09..4a0d1880dd71 100644 --- a/arch/loongarch/mm/kasan_init.c +++ b/arch/loongarch/mm/kasan_init.c @@ -13,6 +13,13 @@ static pgd_t kasan_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE); +#ifdef __PAGETABLE_P4D_FOLDED +#define __pgd_none(early, pgd) (0) +#else +#define __pgd_none(early, pgd) (early ? (pgd_val(pgd) == 0) : \ +(__pa(pgd_val(pgd)) == (unsigned long)__pa(kasan_early_shadow_p4d))) +#endif + #ifdef __PAGETABLE_PUD_FOLDED #define __p4d_none(early, p4d) (0) #else @@ -142,6 +149,19 @@ static pud_t *__init kasan_pud_offset(p4d_t *p4dp, unsigned long addr, int node, return pud_offset(p4dp, addr); } +static p4d_t *__init kasan_p4d_offset(pgd_t *pgdp, unsigned long addr, int node, bool early) +{ + if (__pgd_none(early, pgdp_get(pgdp))) { + phys_addr_t p4d_phys = early ? + __pa_symbol(kasan_early_shadow_p4d) : kasan_alloc_zeroed_page(node); + if (!early) + memcpy(__va(p4d_phys), kasan_early_shadow_p4d, sizeof(kasan_early_shadow_p4d)); + pgd_populate(&init_mm, pgdp, (p4d_t *)__va(p4d_phys)); + } + + return p4d_offset(pgdp, addr); +} + static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr, unsigned long end, int node, bool early) { @@ -178,19 +198,19 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr, do { next = pud_addr_end(addr, end); kasan_pmd_populate(pudp, addr, next, node, early); - } while (pudp++, addr = next, addr != end); + } while (pudp++, addr = next, addr != end && __pud_none(early, READ_ONCE(*pudp))); } static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr, unsigned long end, int node, bool early) { unsigned long next; - p4d_t *p4dp = p4d_offset(pgdp, addr); + p4d_t *p4dp = kasan_p4d_offset(pgdp, addr, node, early); do { next = p4d_addr_end(addr, end); kasan_pud_populate(p4dp, addr, next, node, early); - } while (p4dp++, addr = next, addr != end); + } while (p4dp++, addr = next, addr != end && __p4d_none(early, READ_ONCE(*p4dp))); } static void __init kasan_pgd_populate(unsigned long addr, unsigned long end, From 227ca9f6f6aeb8aa8f0c10430b955f1fe2aeab91 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Tue, 12 Nov 2024 16:35:39 +0800 Subject: [PATCH 203/235] LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will overflow UINTPTR_MAX because KASAN_SHADOW_START/KASAN_SHADOW_END are aligned up by PGDIR_SIZE. And then the overflowed KASAN_SHADOW_END looks like a user space address. For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too large for Loongson-2K series whose cpu_vabits = 39. Since CONFIG_4KB_4LEVEL is completely legal for CPUs with cpu_vabits <= 39, we just disable KASAN via early return in kasan_init(). Otherwise we get a boot failure. Moreover, we change KASAN_SHADOW_END from the first address after KASAN shadow area to the last address in KASAN shadow area, in order to avoid the end address exactly overflow to 0 (which is a legal case). We don't need to worry about alignment because pgd_addr_end() can handle it. Cc: stable@vger.kernel.org Reviewed-by: Jiaxun Yang Signed-off-by: Huacai Chen --- arch/loongarch/include/asm/kasan.h | 2 +- arch/loongarch/mm/kasan_init.c | 15 +++++++++++++-- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/arch/loongarch/include/asm/kasan.h b/arch/loongarch/include/asm/kasan.h index c6bce5fbff57..cb74a47f620e 100644 --- a/arch/loongarch/include/asm/kasan.h +++ b/arch/loongarch/include/asm/kasan.h @@ -51,7 +51,7 @@ /* KAsan shadow memory start right after vmalloc. */ #define KASAN_SHADOW_START round_up(KFENCE_AREA_END, PGDIR_SIZE) #define KASAN_SHADOW_SIZE (XKVRANGE_VC_SHADOW_END - XKPRANGE_CC_KASAN_OFFSET) -#define KASAN_SHADOW_END round_up(KASAN_SHADOW_START + KASAN_SHADOW_SIZE, PGDIR_SIZE) +#define KASAN_SHADOW_END (round_up(KASAN_SHADOW_START + KASAN_SHADOW_SIZE, PGDIR_SIZE) - 1) #define XKPRANGE_CC_SHADOW_OFFSET (KASAN_SHADOW_START + XKPRANGE_CC_KASAN_OFFSET) #define XKPRANGE_UC_SHADOW_OFFSET (KASAN_SHADOW_START + XKPRANGE_UC_KASAN_OFFSET) diff --git a/arch/loongarch/mm/kasan_init.c b/arch/loongarch/mm/kasan_init.c index 4a0d1880dd71..7277b7583e1b 100644 --- a/arch/loongarch/mm/kasan_init.c +++ b/arch/loongarch/mm/kasan_init.c @@ -238,7 +238,7 @@ static void __init kasan_map_populate(unsigned long start, unsigned long end, asmlinkage void __init kasan_early_init(void) { BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_START, PGDIR_SIZE)); - BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE)); + BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END + 1, PGDIR_SIZE)); } static inline void kasan_set_pgd(pgd_t *pgdp, pgd_t pgdval) @@ -253,7 +253,7 @@ static void __init clear_pgds(unsigned long start, unsigned long end) * swapper_pg_dir. pgd_clear() can't be used * here because it's nop on 2,3-level pagetable setups */ - for (; start < end; start += PGDIR_SIZE) + for (; start < end; start = pgd_addr_end(start, end)) kasan_set_pgd((pgd_t *)pgd_offset_k(start), __pgd(0)); } @@ -262,6 +262,17 @@ void __init kasan_init(void) u64 i; phys_addr_t pa_start, pa_end; + /* + * If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will + * overflow UINTPTR_MAX and then looks like a user space address. + * For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too + * large for Loongson-2K series whose cpu_vabits = 39. + */ + if (KASAN_SHADOW_END < vm_map_base) { + pr_warn("PGDIR_SIZE too large for cpu_vabits, KernelAddressSanitizer disabled.\n"); + return; + } + /* * PGD was populated as invalid_pmd_table or invalid_pud_table * in pagetable_init() which depends on how many levels of page From 139d42ca51018c1d43ab5f35829179f060d1ab31 Mon Sep 17 00:00:00 2001 From: Kanglong Wang Date: Tue, 12 Nov 2024 16:35:39 +0800 Subject: [PATCH 204/235] LoongArch: Add WriteCombine shadow mapping in KASAN Currently, the kernel couldn't boot when ARCH_IOREMAP, ARCH_WRITECOMBINE and KASAN are enabled together. Because DMW2 is used by kernel now which is configured as 0xa000000000000000 for WriteCombine, but KASAN has no segment mapping for it. This patch fix this issue. Solution: Add the relevant definitions for WriteCombine (DMW2) in KASAN. Cc: stable@vger.kernel.org Fixes: 8e02c3b782ec ("LoongArch: Add writecombine support for DMW-based ioremap()") Signed-off-by: Kanglong Wang Signed-off-by: Huacai Chen --- arch/loongarch/include/asm/kasan.h | 11 ++++++++++- arch/loongarch/mm/kasan_init.c | 5 +++++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/loongarch/include/asm/kasan.h b/arch/loongarch/include/asm/kasan.h index cb74a47f620e..7f52bd31b9d4 100644 --- a/arch/loongarch/include/asm/kasan.h +++ b/arch/loongarch/include/asm/kasan.h @@ -25,6 +25,7 @@ /* 64-bit segment value. */ #define XKPRANGE_UC_SEG (0x8000) #define XKPRANGE_CC_SEG (0x9000) +#define XKPRANGE_WC_SEG (0xa000) #define XKVRANGE_VC_SEG (0xffff) /* Cached */ @@ -41,10 +42,17 @@ #define XKPRANGE_UC_SHADOW_SIZE (XKPRANGE_UC_SIZE >> KASAN_SHADOW_SCALE_SHIFT) #define XKPRANGE_UC_SHADOW_END (XKPRANGE_UC_KASAN_OFFSET + XKPRANGE_UC_SHADOW_SIZE) +/* WriteCombine */ +#define XKPRANGE_WC_START WRITECOMBINE_BASE +#define XKPRANGE_WC_SIZE XRANGE_SIZE +#define XKPRANGE_WC_KASAN_OFFSET XKPRANGE_UC_SHADOW_END +#define XKPRANGE_WC_SHADOW_SIZE (XKPRANGE_WC_SIZE >> KASAN_SHADOW_SCALE_SHIFT) +#define XKPRANGE_WC_SHADOW_END (XKPRANGE_WC_KASAN_OFFSET + XKPRANGE_WC_SHADOW_SIZE) + /* VMALLOC (Cached or UnCached) */ #define XKVRANGE_VC_START MODULES_VADDR #define XKVRANGE_VC_SIZE round_up(KFENCE_AREA_END - MODULES_VADDR + 1, PGDIR_SIZE) -#define XKVRANGE_VC_KASAN_OFFSET XKPRANGE_UC_SHADOW_END +#define XKVRANGE_VC_KASAN_OFFSET XKPRANGE_WC_SHADOW_END #define XKVRANGE_VC_SHADOW_SIZE (XKVRANGE_VC_SIZE >> KASAN_SHADOW_SCALE_SHIFT) #define XKVRANGE_VC_SHADOW_END (XKVRANGE_VC_KASAN_OFFSET + XKVRANGE_VC_SHADOW_SIZE) @@ -55,6 +63,7 @@ #define XKPRANGE_CC_SHADOW_OFFSET (KASAN_SHADOW_START + XKPRANGE_CC_KASAN_OFFSET) #define XKPRANGE_UC_SHADOW_OFFSET (KASAN_SHADOW_START + XKPRANGE_UC_KASAN_OFFSET) +#define XKPRANGE_WC_SHADOW_OFFSET (KASAN_SHADOW_START + XKPRANGE_WC_KASAN_OFFSET) #define XKVRANGE_VC_SHADOW_OFFSET (KASAN_SHADOW_START + XKVRANGE_VC_KASAN_OFFSET) extern bool kasan_early_stage; diff --git a/arch/loongarch/mm/kasan_init.c b/arch/loongarch/mm/kasan_init.c index 7277b7583e1b..d2681272d8f0 100644 --- a/arch/loongarch/mm/kasan_init.c +++ b/arch/loongarch/mm/kasan_init.c @@ -62,6 +62,9 @@ void *kasan_mem_to_shadow(const void *addr) case XKPRANGE_UC_SEG: offset = XKPRANGE_UC_SHADOW_OFFSET; break; + case XKPRANGE_WC_SEG: + offset = XKPRANGE_WC_SHADOW_OFFSET; + break; case XKVRANGE_VC_SEG: offset = XKVRANGE_VC_SHADOW_OFFSET; break; @@ -86,6 +89,8 @@ const void *kasan_shadow_to_mem(const void *shadow_addr) if (addr >= XKVRANGE_VC_SHADOW_OFFSET) return (void *)(((addr - XKVRANGE_VC_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT) + XKVRANGE_VC_START); + else if (addr >= XKPRANGE_WC_SHADOW_OFFSET) + return (void *)(((addr - XKPRANGE_WC_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT) + XKPRANGE_WC_START); else if (addr >= XKPRANGE_UC_SHADOW_OFFSET) return (void *)(((addr - XKPRANGE_UC_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT) + XKPRANGE_UC_START); else if (addr >= XKPRANGE_CC_SHADOW_OFFSET) From 6ce031e5d6f475d476bab55ab7d8ea168fedc4c1 Mon Sep 17 00:00:00 2001 From: Bibo Mao Date: Tue, 12 Nov 2024 16:35:39 +0800 Subject: [PATCH 205/235] LoongArch: Fix AP booting issue in VM mode Native IPI is used for AP booting, because it is the booting interface between OS and BIOS firmware. The paravirt IPI is only used inside OS, and native IPI is necessary to boot AP. When booting AP, we write the kernel entry address in the HW mailbox of AP and send IPI interrupt to it. AP executes idle instruction and waits for interrupts or SW events, then clears IPI interrupt and jumps to the kernel entry from HW mailbox. Between writing HW mailbox and sending IPI, AP can be woken up by SW events and jumps to the kernel entry, so ACTION_BOOT_CPU IPI interrupt will keep pending during AP booting. And native IPI interrupt handler needs be registered so that it can clear pending native IPI, else there will be endless interrupts during AP booting stage. Here native IPI interrupt is initialized even if paravirt IPI is used. Cc: stable@vger.kernel.org Fixes: 74c16b2e2b0c ("LoongArch: KVM: Add PV IPI support on guest side") Signed-off-by: Bibo Mao Signed-off-by: Huacai Chen --- arch/loongarch/kernel/paravirt.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index a5fc61f8b348..e5a39bbad078 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -51,11 +51,18 @@ static u64 paravt_steal_clock(int cpu) } #ifdef CONFIG_SMP +static struct smp_ops native_ops; + static void pv_send_ipi_single(int cpu, unsigned int action) { int min, old; irq_cpustat_t *info = &per_cpu(irq_stat, cpu); + if (unlikely(action == ACTION_BOOT_CPU)) { + native_ops.send_ipi_single(cpu, action); + return; + } + old = atomic_fetch_or(BIT(action), &info->message); if (old) return; @@ -75,6 +82,11 @@ static void pv_send_ipi_mask(const struct cpumask *mask, unsigned int action) if (cpumask_empty(mask)) return; + if (unlikely(action == ACTION_BOOT_CPU)) { + native_ops.send_ipi_mask(mask, action); + return; + } + action = BIT(action); for_each_cpu(i, mask) { info = &per_cpu(irq_stat, i); @@ -147,6 +159,8 @@ static void pv_init_ipi(void) { int r, swi; + /* Init native ipi irq for ACTION_BOOT_CPU */ + native_ops.init_ipi(); swi = get_percpu_irq(INT_SWI0); if (swi < 0) panic("SWI0 IRQ mapping failed\n"); @@ -193,6 +207,7 @@ int __init pv_ipi_init(void) return 0; #ifdef CONFIG_SMP + native_ops = mp_ops; mp_ops.init_ipi = pv_init_ipi; mp_ops.send_ipi_single = pv_send_ipi_single; mp_ops.send_ipi_mask = pv_send_ipi_mask; From 657d4282d8c4ac2349472529c9a6f20c503d1aee Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Mon, 11 Nov 2024 16:01:38 -0500 Subject: [PATCH 206/235] bcachefs: Fix journal_entry_dev_usage_to_text() overrun If the jset_entry_dev_usage is malformed, and too small, our nr_entries calculation will be incorrect - just bail out. Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/journal_io.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c index ccaafa90f4f4..fb35dd336331 100644 --- a/fs/bcachefs/journal_io.c +++ b/fs/bcachefs/journal_io.c @@ -708,6 +708,9 @@ static void journal_entry_dev_usage_to_text(struct printbuf *out, struct bch_fs container_of(entry, struct jset_entry_dev_usage, entry); unsigned i, nr_types = jset_entry_dev_usage_nr_types(u); + if (vstruct_bytes(entry) < sizeof(*u)) + return; + prt_printf(out, "dev=%u", le32_to_cpu(u->dev)); printbuf_indent_add(out, 2); From 840c2fbcc5cd33ba8fab180f09da0bb7f354ea71 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Mon, 11 Nov 2024 16:15:15 -0500 Subject: [PATCH 207/235] bcachefs: Fix assertion pop in bch2_ptr_swab() This runs on extents that haven't yet been validated, so we don't want to assert that we have a valid entry type. Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/extents.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/bcachefs/extents.c b/fs/bcachefs/extents.c index c4e91d123849..37e3d69bec06 100644 --- a/fs/bcachefs/extents.c +++ b/fs/bcachefs/extents.c @@ -1364,7 +1364,7 @@ void bch2_ptr_swab(struct bkey_s k) for (entry = ptrs.start; entry < ptrs.end; entry = extent_entry_next(entry)) { - switch (extent_entry_type(entry)) { + switch (__extent_entry_type(entry)) { case BCH_EXTENT_ENTRY_ptr: break; case BCH_EXTENT_ENTRY_crc32: @@ -1384,6 +1384,9 @@ void bch2_ptr_swab(struct bkey_s k) break; case BCH_EXTENT_ENTRY_rebalance: break; + default: + /* Bad entry type: will be caught by validate() */ + return; } } } From d7b0ff5a866724c3ad21f2628c22a63336deec3f Mon Sep 17 00:00:00 2001 From: Michal Luczaj Date: Thu, 7 Nov 2024 21:46:12 +0100 Subject: [PATCH 208/235] virtio/vsock: Fix accept_queue memory leak As the final stages of socket destruction may be delayed, it is possible that virtio_transport_recv_listen() will be called after the accept_queue has been flushed, but before the SOCK_DONE flag has been set. As a result, sockets enqueued after the flush would remain unremoved, leading to a memory leak. vsock_release __vsock_release lock virtio_transport_release virtio_transport_close schedule_delayed_work(close_work) sk_shutdown = SHUTDOWN_MASK (!) flush accept_queue release virtio_transport_recv_pkt vsock_find_bound_socket lock if flag(SOCK_DONE) return virtio_transport_recv_listen child = vsock_create_connected (!) vsock_enqueue_accept(child) release close_work lock virtio_transport_do_close set_flag(SOCK_DONE) virtio_transport_remove_sock vsock_remove_sock vsock_remove_bound release Introduce a sk_shutdown check to disallow vsock_enqueue_accept() during socket destruction. unreferenced object 0xffff888109e3f800 (size 2040): comm "kworker/5:2", pid 371, jiffies 4294940105 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 28 00 0b 40 00 00 00 00 00 00 00 00 00 00 00 00 (..@............ backtrace (crc 9e5f4e84): [] kmem_cache_alloc_noprof+0x2c1/0x360 [] sk_prot_alloc+0x30/0x120 [] sk_alloc+0x2c/0x4b0 [] __vsock_create.constprop.0+0x2a/0x310 [] virtio_transport_recv_pkt+0x4dc/0x9a0 [] vsock_loopback_work+0xfd/0x140 [] process_one_work+0x20c/0x570 [] worker_thread+0x1bf/0x3a0 [] kthread+0xdd/0x110 [] ret_from_fork+0x2d/0x50 [] ret_from_fork_asm+0x1a/0x30 Fixes: 3fe356d58efa ("vsock/virtio: discard packets only when socket is really closed") Reviewed-by: Stefano Garzarella Signed-off-by: Michal Luczaj Signed-off-by: Paolo Abeni --- net/vmw_vsock/virtio_transport_common.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index ccbd2bc0d210..cd075f608d4f 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1512,6 +1512,14 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb, return -ENOMEM; } + /* __vsock_release() might have already flushed accept_queue. + * Subsequent enqueues would lead to a memory leak. + */ + if (sk->sk_shutdown == SHUTDOWN_MASK) { + virtio_transport_reset_no_sock(t, skb); + return -ESHUTDOWN; + } + child = vsock_create_connected(sk); if (!child) { virtio_transport_reset_no_sock(t, skb); From fbf7085b3ad1c7cc0677834c90f985f1b4f77a33 Mon Sep 17 00:00:00 2001 From: Michal Luczaj Date: Thu, 7 Nov 2024 21:46:13 +0100 Subject: [PATCH 209/235] vsock: Fix sk_error_queue memory leak Kernel queues MSG_ZEROCOPY completion notifications on the error queue. Where they remain, until explicitly recv()ed. To prevent memory leaks, clean up the queue when the socket is destroyed. unreferenced object 0xffff8881028beb00 (size 224): comm "vsock_test", pid 1218, jiffies 4294694897 hex dump (first 32 bytes): 90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff ..!.......!..... 00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff ..........!..... backtrace (crc 6c7031ca): [] kmem_cache_alloc_node_noprof+0x2f7/0x370 [] __alloc_skb+0x132/0x180 [] sock_omalloc+0x4b/0x80 [] msg_zerocopy_realloc+0x9e/0x240 [] virtio_transport_send_pkt_info+0x412/0x4c0 [] virtio_transport_stream_enqueue+0x43/0x50 [] vsock_connectible_sendmsg+0x373/0x450 [] ____sys_sendmsg+0x365/0x3a0 [] ___sys_sendmsg+0x84/0xd0 [] __sys_sendmsg+0x47/0x80 [] do_syscall_64+0x93/0x180 [] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") Signed-off-by: Michal Luczaj Reviewed-by: Stefano Garzarella Acked-by: Arseniy Krasnov Signed-off-by: Paolo Abeni --- net/vmw_vsock/af_vsock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 35681adedd9a..dfd29160fe11 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk) { struct vsock_sock *vsk = vsock_sk(sk); + /* Flush MSG_ZEROCOPY leftovers. */ + __skb_queue_purge(&sk->sk_error_queue); + vsock_deassign_transport(vsk); /* When clearing these addresses, there's no need to set the family and From 60cf6206a1f513512f5d73fa4d3dbbcad2e7dcd6 Mon Sep 17 00:00:00 2001 From: Michal Luczaj Date: Thu, 7 Nov 2024 21:46:14 +0100 Subject: [PATCH 210/235] virtio/vsock: Improve MSG_ZEROCOPY error handling Add a missing kfree_skb() to prevent memory leaks. Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") Reviewed-by: Stefano Garzarella Signed-off-by: Michal Luczaj Acked-by: Arseniy Krasnov Signed-off-by: Paolo Abeni --- net/vmw_vsock/virtio_transport_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index cd075f608d4f..e2e6a30b759b 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -400,6 +400,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, if (virtio_transport_init_zcopy_skb(vsk, skb, info->msg, can_zcopy)) { + kfree_skb(skb); ret = -ENOMEM; break; } From 7967dc8f797f454d4f4acec15c7df0cdf4801617 Mon Sep 17 00:00:00 2001 From: Luiz Augusto von Dentz Date: Fri, 8 Nov 2024 11:19:54 -0500 Subject: [PATCH 211/235] Bluetooth: hci_core: Fix calling mgmt_device_connected Since 61a939c68ee0 ("Bluetooth: Queue incoming ACL data until BT_CONNECTED state is reached") there is no long the need to call mgmt_device_connected as ACL data will be queued until BT_CONNECTED state. Link: https://bugzilla.kernel.org/show_bug.cgi?id=219458 Link: https://github.com/bluez/bluez/issues/1014 Fixes: 333b4fd11e89 ("Bluetooth: L2CAP: Fix uaf in l2cap_connect") Signed-off-by: Luiz Augusto von Dentz --- net/bluetooth/hci_core.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c index 96d097b21d13..0ac354db8177 100644 --- a/net/bluetooth/hci_core.c +++ b/net/bluetooth/hci_core.c @@ -3788,8 +3788,6 @@ static void hci_acldata_packet(struct hci_dev *hdev, struct sk_buff *skb) hci_dev_lock(hdev); conn = hci_conn_hash_lookup_handle(hdev, handle); - if (conn && hci_dev_test_flag(hdev, HCI_MGMT)) - mgmt_device_connected(hdev, conn, NULL, 0); hci_dev_unlock(hdev); if (conn) { From d5359a7f583ab9b7706915213b54deac065bcb81 Mon Sep 17 00:00:00 2001 From: Kiran K Date: Tue, 22 Oct 2024 14:41:34 +0530 Subject: [PATCH 212/235] Bluetooth: btintel: Direct exception event to bluetooth stack Have exception event part of HCI traces which helps for debug. snoop traces: > HCI Event: Vendor (0xff) plen 79 Vendor Prefix (0x8780) Intel Extended Telemetry (0x03) Unknown extended telemetry event type (0xde) 01 01 de Unknown extended subevent 0x07 01 01 de 07 01 de 06 1c ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 05 14 ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 43 10 ef be ad de ef be ad de ef be ad de ef be ad de Fixes: af395330abed ("Bluetooth: btintel: Add Intel devcoredump support") Signed-off-by: Kiran K Signed-off-by: Luiz Augusto von Dentz --- drivers/bluetooth/btintel.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/bluetooth/btintel.c b/drivers/bluetooth/btintel.c index 438b92967bc3..30a32ebbcc68 100644 --- a/drivers/bluetooth/btintel.c +++ b/drivers/bluetooth/btintel.c @@ -3288,13 +3288,12 @@ static int btintel_diagnostics(struct hci_dev *hdev, struct sk_buff *skb) case INTEL_TLV_TEST_EXCEPTION: /* Generate devcoredump from exception */ if (!hci_devcd_init(hdev, skb->len)) { - hci_devcd_append(hdev, skb); + hci_devcd_append(hdev, skb_clone(skb, GFP_ATOMIC)); hci_devcd_complete(hdev); } else { bt_dev_err(hdev, "Failed to generate devcoredump"); - kfree_skb(skb); } - return 0; + break; default: bt_dev_err(hdev, "Invalid exception type %02X", tlv->val[0]); } From 94efde1d15399f5c88e576923db9bcd422d217f2 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Mon, 4 Nov 2024 19:29:44 -0800 Subject: [PATCH 213/235] mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases commit 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()") created a new constraint on the pin_user_pages*() API family: a potentially large internal allocation must now occur, for FOLL_LONGTERM cases. A user-visible consequence has now appeared: user space can no longer pin more than 2GB of memory anymore on x86_64. That's because, on a 4KB PAGE_SIZE system, when user space tries to (indirectly, via a device driver that calls pin_user_pages()) pin 2GB, this requires an allocation of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for kmalloc(). In addition to the directly visible effect described above, there is also the problem of adding an unnecessary allocation. The **pages array argument has already been allocated, and there is no need for a redundant **folios array allocation in this case. Fix this by avoiding the new allocation entirely. This is done by referring to either the original page[i] within **pages, or to the associated folio. Thanks to David Hildenbrand for suggesting this approach and for providing the initial implementation (which I've tested and adjusted slightly) as well. [jhubbard@nvidia.com: whitespace tweak, per David] Link: https://lkml.kernel.org/r/131cf9c8-ebc0-4cbb-b722-22fa8527bf3c@nvidia.com [jhubbard@nvidia.com: bypass pofs_get_folio(), per Oscar] Link: https://lkml.kernel.org/r/c1587c7f-9155-45be-bd62-1e36c0dd6923@nvidia.com Link: https://lkml.kernel.org/r/20241105032944.141488-2-jhubbard@nvidia.com Fixes: 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()") Signed-off-by: John Hubbard Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Reviewed-by: Oscar Salvador Cc: Vivek Kasireddy Cc: Dave Airlie Cc: Gerd Hoffmann Cc: Matthew Wilcox Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Peter Xu Cc: Arnd Bergmann Cc: Daniel Vetter Cc: Dongwon Kim Cc: Hugh Dickins Cc: Junxiao Chang Cc: Signed-off-by: Andrew Morton --- mm/gup.c | 116 ++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 77 insertions(+), 39 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 4637dab7b54f..ad0c8922dac3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2273,20 +2273,57 @@ struct page *get_dump_page(unsigned long addr) #endif /* CONFIG_ELF_CORE */ #ifdef CONFIG_MIGRATION + +/* + * An array of either pages or folios ("pofs"). Although it may seem tempting to + * avoid this complication, by simply interpreting a list of folios as a list of + * pages, that approach won't work in the longer term, because eventually the + * layouts of struct page and struct folio will become completely different. + * Furthermore, this pof approach avoids excessive page_folio() calls. + */ +struct pages_or_folios { + union { + struct page **pages; + struct folio **folios; + void **entries; + }; + bool has_folios; + long nr_entries; +}; + +static struct folio *pofs_get_folio(struct pages_or_folios *pofs, long i) +{ + if (pofs->has_folios) + return pofs->folios[i]; + return page_folio(pofs->pages[i]); +} + +static void pofs_clear_entry(struct pages_or_folios *pofs, long i) +{ + pofs->entries[i] = NULL; +} + +static void pofs_unpin(struct pages_or_folios *pofs) +{ + if (pofs->has_folios) + unpin_folios(pofs->folios, pofs->nr_entries); + else + unpin_user_pages(pofs->pages, pofs->nr_entries); +} + /* * Returns the number of collected folios. Return value is always >= 0. */ static unsigned long collect_longterm_unpinnable_folios( - struct list_head *movable_folio_list, - unsigned long nr_folios, - struct folio **folios) + struct list_head *movable_folio_list, + struct pages_or_folios *pofs) { unsigned long i, collected = 0; struct folio *prev_folio = NULL; bool drain_allow = true; - for (i = 0; i < nr_folios; i++) { - struct folio *folio = folios[i]; + for (i = 0; i < pofs->nr_entries; i++) { + struct folio *folio = pofs_get_folio(pofs, i); if (folio == prev_folio) continue; @@ -2327,16 +2364,15 @@ static unsigned long collect_longterm_unpinnable_folios( * Returns -EAGAIN if all folios were successfully migrated or -errno for * failure (or partial success). */ -static int migrate_longterm_unpinnable_folios( - struct list_head *movable_folio_list, - unsigned long nr_folios, - struct folio **folios) +static int +migrate_longterm_unpinnable_folios(struct list_head *movable_folio_list, + struct pages_or_folios *pofs) { int ret; unsigned long i; - for (i = 0; i < nr_folios; i++) { - struct folio *folio = folios[i]; + for (i = 0; i < pofs->nr_entries; i++) { + struct folio *folio = pofs_get_folio(pofs, i); if (folio_is_device_coherent(folio)) { /* @@ -2344,7 +2380,7 @@ static int migrate_longterm_unpinnable_folios( * convert the pin on the source folio to a normal * reference. */ - folios[i] = NULL; + pofs_clear_entry(pofs, i); folio_get(folio); gup_put_folio(folio, 1, FOLL_PIN); @@ -2363,8 +2399,8 @@ static int migrate_longterm_unpinnable_folios( * calling folio_isolate_lru() which takes a reference so the * folio won't be freed if it's migrating. */ - unpin_folio(folios[i]); - folios[i] = NULL; + unpin_folio(folio); + pofs_clear_entry(pofs, i); } if (!list_empty(movable_folio_list)) { @@ -2387,12 +2423,26 @@ static int migrate_longterm_unpinnable_folios( return -EAGAIN; err: - unpin_folios(folios, nr_folios); + pofs_unpin(pofs); putback_movable_pages(movable_folio_list); return ret; } +static long +check_and_migrate_movable_pages_or_folios(struct pages_or_folios *pofs) +{ + LIST_HEAD(movable_folio_list); + unsigned long collected; + + collected = collect_longterm_unpinnable_folios(&movable_folio_list, + pofs); + if (!collected) + return 0; + + return migrate_longterm_unpinnable_folios(&movable_folio_list, pofs); +} + /* * Check whether all folios are *allowed* to be pinned indefinitely (long term). * Rather confusingly, all folios in the range are required to be pinned via @@ -2417,16 +2467,13 @@ static int migrate_longterm_unpinnable_folios( static long check_and_migrate_movable_folios(unsigned long nr_folios, struct folio **folios) { - unsigned long collected; - LIST_HEAD(movable_folio_list); + struct pages_or_folios pofs = { + .folios = folios, + .has_folios = true, + .nr_entries = nr_folios, + }; - collected = collect_longterm_unpinnable_folios(&movable_folio_list, - nr_folios, folios); - if (!collected) - return 0; - - return migrate_longterm_unpinnable_folios(&movable_folio_list, - nr_folios, folios); + return check_and_migrate_movable_pages_or_folios(&pofs); } /* @@ -2436,22 +2483,13 @@ static long check_and_migrate_movable_folios(unsigned long nr_folios, static long check_and_migrate_movable_pages(unsigned long nr_pages, struct page **pages) { - struct folio **folios; - long i, ret; + struct pages_or_folios pofs = { + .pages = pages, + .has_folios = false, + .nr_entries = nr_pages, + }; - folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL); - if (!folios) { - unpin_user_pages(pages, nr_pages); - return -ENOMEM; - } - - for (i = 0; i < nr_pages; i++) - folios[i] = page_folio(pages[i]); - - ret = check_and_migrate_movable_folios(nr_pages, folios); - - kfree(folios); - return ret; + return check_and_migrate_movable_pages_or_folios(&pofs); } #else static long check_and_migrate_movable_pages(unsigned long nr_pages, From a3477c9e02cc9d62a7c8bfc4e7453f5af9a175aa Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 10 Nov 2024 13:11:21 -0800 Subject: [PATCH 214/235] mm/thp: fix deferred split queue not partially_mapped: fix Though even more elusive than before, list_del corruption has still been seen on THP's deferred split queue. The idea in commit e66f3185fa04 was right, but its implementation wrong. The context omitted an important comment just before the critical test: "split_folio() removes folio from list on success." In ignoring that comment, when a THP split succeeded, the code went on to release the preceding safe folio, preserving instead an irrelevant (formerly head) folio: which gives no safety because it's not on the list. Fix the logic. Link: https://lkml.kernel.org/r/3c995a30-31ce-0998-1b9f-3a2cb9354c91@google.com Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped") Signed-off-by: Hugh Dickins Acked-by: Usama Arif Reviewed-by: Zi Yan Cc: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: David Hildenbrand Cc: Johannes Weiner Cc: Kefeng Wang Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Nhat Pham Cc: Ryan Roberts Cc: Shakeel Butt Cc: Wei Yang Cc: Yang Shi Signed-off-by: Andrew Morton --- mm/huge_memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 03fd4bc39ea1..5734d5d5060f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3790,7 +3790,9 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, * in the case it was underused, then consider it used and * don't add it back to split_queue. */ - if (!did_split && !folio_test_partially_mapped(folio)) { + if (did_split) { + ; /* folio already removed from list */ + } else if (!folio_test_partially_mapped(folio)) { list_del_init(&folio->_deferred_list); removed++; } else { From fae1980347bfd23325099b69db6638b94149a94c Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Sun, 10 Nov 2024 00:49:03 -0600 Subject: [PATCH 215/235] selftests: hugetlb_dio: fixup check for initial conditions to skip in the start This test verifies that a hugepage, used as a user buffer for DIO operations, is correctly freed upon unmapping. To test this, we read the count of free hugepages before and after the mmap, DIO, and munmap operations, then check if the free hugepage count is the same. Reading free hugepages before the test was removed by commit 0268d4579901 ('selftests: hugetlb_dio: check for initial conditions to skip at the start'), causing the test to always fail. This patch adds back reading the free hugepages before starting the test. With this patch, the tests are now passing. Test results without this patch: ./tools/testing/selftests/mm/hugetlb_dio TAP version 13 1..4 # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 1 : Huge pages not freed! # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 2 : Huge pages not freed! # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 3 : Huge pages not freed! # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 4 : Huge pages not freed! # Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0 Test results with this patch: /tools/testing/selftests/mm/hugetlb_dio TAP version 13 1..4 # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 1 : Huge pages freed successfully ! # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 2 : Huge pages freed successfully ! # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 3 : Huge pages freed successfully ! # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 4 : Huge pages freed successfully ! # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 Link: https://lkml.kernel.org/r/20241110064903.23626-1-donettom@linux.ibm.com Fixes: 0268d4579901 ("selftests: hugetlb_dio: check for initial conditions to skip in the start") Signed-off-by: Donet Tom Cc: Muhammad Usama Anjum Cc: Shuah Khan Cc: Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/hugetlb_dio.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c index 60001c142ce9..432d5af15e66 100644 --- a/tools/testing/selftests/mm/hugetlb_dio.c +++ b/tools/testing/selftests/mm/hugetlb_dio.c @@ -44,6 +44,13 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) if (fd < 0) ksft_exit_fail_perror("Error opening file\n"); + /* Get the free huge pages before allocation */ + free_hpage_b = get_free_hugepages(); + if (free_hpage_b == 0) { + close(fd); + ksft_exit_skip("No free hugepage, exiting!\n"); + } + /* Allocate a hugetlb page */ orig_buffer = mmap(NULL, h_pagesize, mmap_prot, mmap_flags, -1, 0); if (orig_buffer == MAP_FAILED) { From 29ce8b8a4fa74e841342c8b8f8941848a3c6f29f Mon Sep 17 00:00:00 2001 From: Si-Wei Liu Date: Mon, 21 Oct 2024 16:40:39 +0300 Subject: [PATCH 216/235] vdpa/mlx5: Fix PA offset with unaligned starting iotlb map When calculating the physical address range based on the iotlb and mr [start,end) ranges, the offset of mr->start relative to map->start is not taken into account. This leads to some incorrect and duplicate mappings. For the case when mr->start < map->start the code is already correct: the range in [mr->start, map->start) was handled by a different iteration. Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code") Cc: stable@vger.kernel.org Signed-off-by: Si-Wei Liu Signed-off-by: Dragos Tatulea Message-Id: <20241021134040.975221-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang --- drivers/vdpa/mlx5/core/mr.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 2dd21e0b399e..7d0c83b5b071 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -373,7 +373,7 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr struct page *pg; unsigned int nsg; int sglen; - u64 pa; + u64 pa, offset; u64 paend; struct scatterlist *sg; struct device *dma = mvdev->vdev.dma_dev; @@ -396,8 +396,10 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr sg = mr->sg_head.sgl; for (map = vhost_iotlb_itree_first(iotlb, mr->start, mr->end - 1); map; map = vhost_iotlb_itree_next(map, mr->start, mr->end - 1)) { - paend = map->addr + maplen(map, mr); - for (pa = map->addr; pa < paend; pa += sglen) { + offset = mr->start > map->start ? mr->start - map->start : 0; + pa = map->addr + offset; + paend = map->addr + offset + maplen(map, mr); + for (; pa < paend; pa += sglen) { pg = pfn_to_page(__phys_to_pfn(pa)); if (!sg) { mlx5_vdpa_warn(mvdev, "sg null. start 0x%llx, end 0x%llx\n", From dcf32ea7ecede94796fb30231b3969d7c838374c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 7 Nov 2024 09:08:36 -0500 Subject: [PATCH 217/235] mm: swapfile: fix cluster reclaim work crash on rotational devices syzbot and Daan report a NULL pointer crash in the new full swap cluster reclaim work: > Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI > KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] > CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 > Workqueue: events swap_reclaim_work > RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49 > Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00 > RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202 > RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078 > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008 > RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000 > R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000 > FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > > __list_del_entry_valid include/linux/list.h:124 [inline] > __list_del_entry include/linux/list.h:215 [inline] > list_move_tail include/linux/list.h:310 [inline] > swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748 > swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779 The syzbot console output indicates a virtual environment where swapfile is on a rotational device. In this case, clusters aren't actually used, and si->full_clusters is not initialized. Daan's report is from qemu, so likely rotational too. Make sure to only schedule the cluster reclaim work when clusters are actually in use. Link: https://lkml.kernel.org/r/20241107142335.GB1172372@cmpxchg.org Link: https://lore.kernel.org/lkml/672ac50b.050a0220.2edce.1517.GAE@google.com/ Link: https://github.com/systemd/systemd/issues/35044 Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters") Reported-by: syzbot+078be8bfa863cb9e0c6b@syzkaller.appspotmail.com Signed-off-by: Johannes Weiner Reported-by: Daan De Meyer Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 46bd4b1a3c07..9c85bd46ab7f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -929,7 +929,7 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, si->highest_bit = 0; del_from_avail_list(si); - if (vm_swap_full()) + if (si->cluster_info && vm_swap_full()) schedule_work(&si->reclaim_work); } } From 50d325bb05cef24a2105e40e7cace5e2b237236d Mon Sep 17 00:00:00 2001 From: Wander Lairson Costa Date: Wed, 6 Nov 2024 08:14:26 -0300 Subject: [PATCH 218/235] Revert "igb: Disable threaded IRQ for igb_msix_other" This reverts commit 338c4d3902feb5be49bfda530a72c7ab860e2c9f. Sebastian noticed the ISR indirectly acquires spin_locks, which are sleeping locks under PREEMPT_RT, which leads to kernel splats. Fixes: 338c4d3902feb ("igb: Disable threaded IRQ for igb_msix_other") Reported-by: Sebastian Andrzej Siewior Signed-off-by: Wander Lairson Costa Reviewed-by: Sebastian Andrzej Siewior Acked-by: Przemek Kitszel Link: https://patch.msgid.link/20241106111427.7272-1-wander@redhat.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index b83df5f94b1f..f1d088168723 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -907,7 +907,7 @@ static int igb_request_msix(struct igb_adapter *adapter) int i, err = 0, vector = 0, free_vector = 0; err = request_irq(adapter->msix_entries[vector].vector, - igb_msix_other, IRQF_NO_THREAD, netdev->name, adapter); + igb_msix_other, 0, netdev->name, adapter); if (err) goto err_out; From 2b99b2532593b5a4c7dc6bff2486e98d211a8596 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Mon, 11 Nov 2024 11:03:21 +0100 Subject: [PATCH 219/235] MAINTAINERS: Re-add cancelled Renesas driver sections MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Removing full driver sections also removed mailing list entries, causing submitters of future patches to forget CCing these mailing lists. Hence re-add the sections for the Renesas Ethernet AVB, R-Car SATA, and SuperH Ethernet drivers. Add people who volunteered to maintain these drivers (thanks a lot!), and mark all of them as supported. Fixes: 6e90b675cf942e50 ("MAINTAINERS: Remove some entries due to various compliance requirements.") Signed-off-by: Geert Uytterhoeven Acked-by: Greg Kroah-Hartman Reviewed-by: Simon Horman Acked-by: Niklas Cassel Acked-by: Niklas Söderlund Reviewed-by: Paul Barker Link: https://patch.msgid.link/4b2105332edca277f07ffa195796975e9ddce994.1731319098.git.geert+renesas@glider.be Signed-off-by: Jakub Kicinski --- MAINTAINERS | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 1d30b808d221..2bc7cc4769f5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19578,6 +19578,17 @@ S: Supported F: Documentation/devicetree/bindings/i2c/renesas,iic-emev2.yaml F: drivers/i2c/busses/i2c-emev2.c +RENESAS ETHERNET AVB DRIVER +M: Paul Barker +M: Niklas Söderlund +L: netdev@vger.kernel.org +L: linux-renesas-soc@vger.kernel.org +S: Supported +F: Documentation/devicetree/bindings/net/renesas,etheravb.yaml +F: drivers/net/ethernet/renesas/Kconfig +F: drivers/net/ethernet/renesas/Makefile +F: drivers/net/ethernet/renesas/ravb* + RENESAS ETHERNET SWITCH DRIVER R: Yoshihiro Shimoda L: netdev@vger.kernel.org @@ -19627,6 +19638,14 @@ F: Documentation/devicetree/bindings/i2c/renesas,rmobile-iic.yaml F: drivers/i2c/busses/i2c-rcar.c F: drivers/i2c/busses/i2c-sh_mobile.c +RENESAS R-CAR SATA DRIVER +M: Geert Uytterhoeven +L: linux-ide@vger.kernel.org +L: linux-renesas-soc@vger.kernel.org +S: Supported +F: Documentation/devicetree/bindings/ata/renesas,rcar-sata.yaml +F: drivers/ata/sata_rcar.c + RENESAS R-CAR THERMAL DRIVERS M: Niklas Söderlund L: linux-renesas-soc@vger.kernel.org @@ -19702,6 +19721,17 @@ S: Supported F: Documentation/devicetree/bindings/i2c/renesas,rzv2m.yaml F: drivers/i2c/busses/i2c-rzv2m.c +RENESAS SUPERH ETHERNET DRIVER +M: Niklas Söderlund +L: netdev@vger.kernel.org +L: linux-renesas-soc@vger.kernel.org +S: Supported +F: Documentation/devicetree/bindings/net/renesas,ether.yaml +F: drivers/net/ethernet/renesas/Kconfig +F: drivers/net/ethernet/renesas/Makefile +F: drivers/net/ethernet/renesas/sh_eth* +F: include/linux/sh_eth.h + RENESAS USB PHY DRIVER M: Yoshihiro Shimoda L: linux-renesas-soc@vger.kernel.org From 73af53d82076bbe184d9ece9e14b0dc8599e6055 Mon Sep 17 00:00:00 2001 From: Alexandre Ferrieux Date: Sun, 10 Nov 2024 18:28:36 +0100 Subject: [PATCH 220/235] net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes. To generate hnode handles (in gen_new_htid()), u32 uses IDR and encodes the returned small integer into a structured 32-bit word. Unfortunately, at disposal time, the needed decoding is not done. As a result, idr_remove() fails, and the IDR fills up. Since its size is 2048, the following script ends up with "Filter already exists": tc filter add dev myve $FILTER1 tc filter add dev myve $FILTER2 for i in {1..2048} do echo $i tc filter del dev myve $FILTER2 tc filter add dev myve $FILTER2 done This patch adds the missing decoding logic for handles that deserve it. Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles") Reviewed-by: Eric Dumazet Acked-by: Jamal Hadi Salim Signed-off-by: Alexandre Ferrieux Tested-by: Victor Nogueira Link: https://patch.msgid.link/20241110172836.331319-1-alexandre.ferrieux@orange.com Signed-off-by: Jakub Kicinski --- net/sched/cls_u32.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 9412d88a99bc..d3a03c57545b 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -92,6 +92,16 @@ struct tc_u_common { long knodes; }; +static u32 handle2id(u32 h) +{ + return ((h & 0x80000000) ? ((h >> 20) & 0x7FF) : h); +} + +static u32 id2handle(u32 id) +{ + return (id | 0x800U) << 20; +} + static inline unsigned int u32_hash_fold(__be32 key, const struct tc_u32_sel *sel, u8 fshift) @@ -310,7 +320,7 @@ static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr) int id = idr_alloc_cyclic(&tp_c->handle_idr, ptr, 1, 0x7FF, GFP_KERNEL); if (id < 0) return 0; - return (id | 0x800U) << 20; + return id2handle(id); } static struct hlist_head *tc_u_common_hash; @@ -360,7 +370,7 @@ static int u32_init(struct tcf_proto *tp) return -ENOBUFS; refcount_set(&root_ht->refcnt, 1); - root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x80000000; + root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : id2handle(0); root_ht->prio = tp->prio; root_ht->is_root = true; idr_init(&root_ht->handle_idr); @@ -612,7 +622,7 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, if (phn == ht) { u32_clear_hw_hnode(tp, ht, extack); idr_destroy(&ht->handle_idr); - idr_remove(&tp_c->handle_idr, ht->handle); + idr_remove(&tp_c->handle_idr, handle2id(ht->handle)); RCU_INIT_POINTER(*hn, ht->next); kfree_rcu(ht, rcu); return 0; @@ -989,7 +999,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, err = u32_replace_hw_hnode(tp, ht, userflags, extack); if (err) { - idr_remove(&tp_c->handle_idr, handle); + idr_remove(&tp_c->handle_idr, handle2id(handle)); kfree(ht); return err; } From 27184f8905ba680f22abf1707fbed24036a67119 Mon Sep 17 00:00:00 2001 From: Jarkko Sakkinen Date: Wed, 13 Nov 2024 07:54:12 +0200 Subject: [PATCH 221/235] tpm: Opt-in in disable PCR integrity protection The initial HMAC session feature added TPM bus encryption and/or integrity protection to various in-kernel TPM operations. This can cause performance bottlenecks with IMA, as it heavily utilizes PCR extend operations. In order to mitigate this performance issue, introduce a kernel command-line parameter to the TPM driver for disabling the integrity protection for PCR extend operations (i.e. TPM2_PCR_Extend). Cc: James Bottomley Link: https://lore.kernel.org/linux-integrity/20241015193916.59964-1-zohar@linux.ibm.com/ Fixes: 6519fea6fd37 ("tpm: add hmac checks to tpm2_pcr_extend()") Tested-by: Mimi Zohar Co-developed-by: Roberto Sassu Signed-off-by: Roberto Sassu Co-developed-by: Mimi Zohar Signed-off-by: Mimi Zohar Signed-off-by: Jarkko Sakkinen --- .../admin-guide/kernel-parameters.txt | 9 ++++ drivers/char/tpm/tpm-buf.c | 20 ++++++++ drivers/char/tpm/tpm2-cmd.c | 30 ++++++++--- drivers/char/tpm/tpm2-sessions.c | 51 ++++++++++--------- include/linux/tpm.h | 3 ++ 5 files changed, 82 insertions(+), 31 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1666576acc0e..d401577b5a6a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6727,6 +6727,15 @@ torture.verbose_sleep_duration= [KNL] Duration of each verbose-printk() sleep in jiffies. + tpm.disable_pcr_integrity= [HW,TPM] + Do not protect PCR registers from unintended physical + access, or interposers in the bus by the means of + having an integrity protected session wrapped around + TPM2_PCR_Extend command. Consider this in a situation + where TPM is heavily utilized by IMA, thus protection + causing a major performance hit, and the space where + machines are deployed is by other means guarded. + tpm_suspend_pcr=[HW,TPM] Format: integer pcr id Specify that at suspend time, the tpm driver diff --git a/drivers/char/tpm/tpm-buf.c b/drivers/char/tpm/tpm-buf.c index cad0048bcc3c..e49a19fea3bd 100644 --- a/drivers/char/tpm/tpm-buf.c +++ b/drivers/char/tpm/tpm-buf.c @@ -146,6 +146,26 @@ void tpm_buf_append_u32(struct tpm_buf *buf, const u32 value) } EXPORT_SYMBOL_GPL(tpm_buf_append_u32); +/** + * tpm_buf_append_handle() - Add a handle + * @chip: &tpm_chip instance + * @buf: &tpm_buf instance + * @handle: a TPM object handle + * + * Add a handle to the buffer, and increase the count tracking the number of + * handles in the command buffer. Works only for command buffers. + */ +void tpm_buf_append_handle(struct tpm_chip *chip, struct tpm_buf *buf, u32 handle) +{ + if (buf->flags & TPM_BUF_TPM2B) { + dev_err(&chip->dev, "Invalid buffer type (TPM2B)\n"); + return; + } + + tpm_buf_append_u32(buf, handle); + buf->handles++; +} + /** * tpm_buf_read() - Read from a TPM buffer * @buf: &tpm_buf instance diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c index 1e856259219e..dfdcbd009720 100644 --- a/drivers/char/tpm/tpm2-cmd.c +++ b/drivers/char/tpm/tpm2-cmd.c @@ -14,6 +14,10 @@ #include "tpm.h" #include +static bool disable_pcr_integrity; +module_param(disable_pcr_integrity, bool, 0444); +MODULE_PARM_DESC(disable_pcr_integrity, "Disable integrity protection of TPM2_PCR_Extend"); + static struct tpm2_hash tpm2_hash_map[] = { {HASH_ALGO_SHA1, TPM_ALG_SHA1}, {HASH_ALGO_SHA256, TPM_ALG_SHA256}, @@ -232,18 +236,26 @@ int tpm2_pcr_extend(struct tpm_chip *chip, u32 pcr_idx, int rc; int i; - rc = tpm2_start_auth_session(chip); - if (rc) - return rc; + if (!disable_pcr_integrity) { + rc = tpm2_start_auth_session(chip); + if (rc) + return rc; + } rc = tpm_buf_init(&buf, TPM2_ST_SESSIONS, TPM2_CC_PCR_EXTEND); if (rc) { - tpm2_end_auth_session(chip); + if (!disable_pcr_integrity) + tpm2_end_auth_session(chip); return rc; } - tpm_buf_append_name(chip, &buf, pcr_idx, NULL); - tpm_buf_append_hmac_session(chip, &buf, 0, NULL, 0); + if (!disable_pcr_integrity) { + tpm_buf_append_name(chip, &buf, pcr_idx, NULL); + tpm_buf_append_hmac_session(chip, &buf, 0, NULL, 0); + } else { + tpm_buf_append_handle(chip, &buf, pcr_idx); + tpm_buf_append_auth(chip, &buf, 0, NULL, 0); + } tpm_buf_append_u32(&buf, chip->nr_allocated_banks); @@ -253,9 +265,11 @@ int tpm2_pcr_extend(struct tpm_chip *chip, u32 pcr_idx, chip->allocated_banks[i].digest_size); } - tpm_buf_fill_hmac_session(chip, &buf); + if (!disable_pcr_integrity) + tpm_buf_fill_hmac_session(chip, &buf); rc = tpm_transmit_cmd(chip, &buf, 0, "attempting extend a PCR value"); - rc = tpm_buf_check_hmac_response(chip, &buf, rc); + if (!disable_pcr_integrity) + rc = tpm_buf_check_hmac_response(chip, &buf, rc); tpm_buf_destroy(&buf); diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c index 0739830904b2..52d3523042fb 100644 --- a/drivers/char/tpm/tpm2-sessions.c +++ b/drivers/char/tpm/tpm2-sessions.c @@ -237,9 +237,7 @@ void tpm_buf_append_name(struct tpm_chip *chip, struct tpm_buf *buf, #endif if (!tpm2_chip_auth(chip)) { - tpm_buf_append_u32(buf, handle); - /* count the number of handles in the upper bits of flags */ - buf->handles++; + tpm_buf_append_handle(chip, buf, handle); return; } @@ -272,6 +270,31 @@ void tpm_buf_append_name(struct tpm_chip *chip, struct tpm_buf *buf, } EXPORT_SYMBOL_GPL(tpm_buf_append_name); +void tpm_buf_append_auth(struct tpm_chip *chip, struct tpm_buf *buf, + u8 attributes, u8 *passphrase, int passphrase_len) +{ + /* offset tells us where the sessions area begins */ + int offset = buf->handles * 4 + TPM_HEADER_SIZE; + u32 len = 9 + passphrase_len; + + if (tpm_buf_length(buf) != offset) { + /* not the first session so update the existing length */ + len += get_unaligned_be32(&buf->data[offset]); + put_unaligned_be32(len, &buf->data[offset]); + } else { + tpm_buf_append_u32(buf, len); + } + /* auth handle */ + tpm_buf_append_u32(buf, TPM2_RS_PW); + /* nonce */ + tpm_buf_append_u16(buf, 0); + /* attributes */ + tpm_buf_append_u8(buf, 0); + /* passphrase */ + tpm_buf_append_u16(buf, passphrase_len); + tpm_buf_append(buf, passphrase, passphrase_len); +} + /** * tpm_buf_append_hmac_session() - Append a TPM session element * @chip: the TPM chip structure @@ -309,26 +332,8 @@ void tpm_buf_append_hmac_session(struct tpm_chip *chip, struct tpm_buf *buf, #endif if (!tpm2_chip_auth(chip)) { - /* offset tells us where the sessions area begins */ - int offset = buf->handles * 4 + TPM_HEADER_SIZE; - u32 len = 9 + passphrase_len; - - if (tpm_buf_length(buf) != offset) { - /* not the first session so update the existing length */ - len += get_unaligned_be32(&buf->data[offset]); - put_unaligned_be32(len, &buf->data[offset]); - } else { - tpm_buf_append_u32(buf, len); - } - /* auth handle */ - tpm_buf_append_u32(buf, TPM2_RS_PW); - /* nonce */ - tpm_buf_append_u16(buf, 0); - /* attributes */ - tpm_buf_append_u8(buf, 0); - /* passphrase */ - tpm_buf_append_u16(buf, passphrase_len); - tpm_buf_append(buf, passphrase, passphrase_len); + tpm_buf_append_auth(chip, buf, attributes, passphrase, + passphrase_len); return; } diff --git a/include/linux/tpm.h b/include/linux/tpm.h index 587b96b4418e..20a40ade8030 100644 --- a/include/linux/tpm.h +++ b/include/linux/tpm.h @@ -421,6 +421,7 @@ void tpm_buf_append_u32(struct tpm_buf *buf, const u32 value); u8 tpm_buf_read_u8(struct tpm_buf *buf, off_t *offset); u16 tpm_buf_read_u16(struct tpm_buf *buf, off_t *offset); u32 tpm_buf_read_u32(struct tpm_buf *buf, off_t *offset); +void tpm_buf_append_handle(struct tpm_chip *chip, struct tpm_buf *buf, u32 handle); /* * Check if TPM device is in the firmware upgrade mode. @@ -505,6 +506,8 @@ void tpm_buf_append_name(struct tpm_chip *chip, struct tpm_buf *buf, void tpm_buf_append_hmac_session(struct tpm_chip *chip, struct tpm_buf *buf, u8 attributes, u8 *passphrase, int passphraselen); +void tpm_buf_append_auth(struct tpm_chip *chip, struct tpm_buf *buf, + u8 attributes, u8 *passphrase, int passphraselen); static inline void tpm_buf_append_hmac_session_opt(struct tpm_chip *chip, struct tpm_buf *buf, u8 attributes, From 423893fcbe7e9adc875bce4e55b9b25fc1424977 Mon Sep 17 00:00:00 2001 From: Jarkko Sakkinen Date: Wed, 13 Nov 2024 20:35:39 +0200 Subject: [PATCH 222/235] tpm: Disable TPM on tpm2_create_primary() failure The earlier bug fix misplaced the error-label when dealing with the tpm2_create_primary() return value, which the original completely ignored. Cc: stable@vger.kernel.org Reported-by: Christoph Anton Mitterer Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1087331 Fixes: cc7d8594342a ("tpm: Rollback tpm2_load_null()") Signed-off-by: Jarkko Sakkinen --- drivers/char/tpm/tpm2-sessions.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c index 52d3523042fb..b0f13c8ea79c 100644 --- a/drivers/char/tpm/tpm2-sessions.c +++ b/drivers/char/tpm/tpm2-sessions.c @@ -953,10 +953,13 @@ static int tpm2_load_null(struct tpm_chip *chip, u32 *null_key) /* Deduce from the name change TPM interference: */ dev_err(&chip->dev, "null key integrity check failed\n"); tpm2_flush_context(chip, tmp_null_key); - chip->flags |= TPM_CHIP_FLAG_DISABLE; err: - return rc ? -ENODEV : 0; + if (rc) { + chip->flags |= TPM_CHIP_FLAG_DISABLE; + rc = -ENODEV; + } + return rc; } /** From e0266319413d5d687ba7b6df7ca99e4b9724a4f2 Mon Sep 17 00:00:00 2001 From: Geliang Tang Date: Tue, 12 Nov 2024 20:18:33 +0100 Subject: [PATCH 223/235] mptcp: update local address flags when setting it Just like in-kernel pm, when userspace pm does set_flags, it needs to send out MP_PRIO signal, and also modify the flags of the corresponding address entry in the local address list. This patch implements the missing logic. Traverse all address entries on userspace_pm_local_addr_list to find the local address entry, if bkup is true, set the flags of this entry with FLAG_BACKUP, otherwise, clear FLAG_BACKUP. Fixes: 892f396c8e68 ("mptcp: netlink: issue MP_PRIO signals from userspace PMs") Cc: stable@vger.kernel.org Signed-off-by: Geliang Tang Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20241112-net-mptcp-misc-6-12-pm-v1-1-b835580cefa8@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/pm_userspace.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/net/mptcp/pm_userspace.c b/net/mptcp/pm_userspace.c index 56dfea9862b7..3f888bfe1462 100644 --- a/net/mptcp/pm_userspace.c +++ b/net/mptcp/pm_userspace.c @@ -560,6 +560,7 @@ int mptcp_userspace_pm_set_flags(struct sk_buff *skb, struct genl_info *info) struct nlattr *token = info->attrs[MPTCP_PM_ATTR_TOKEN]; struct nlattr *attr = info->attrs[MPTCP_PM_ATTR_ADDR]; struct net *net = sock_net(skb->sk); + struct mptcp_pm_addr_entry *entry; struct mptcp_sock *msk; int ret = -EINVAL; struct sock *sk; @@ -601,6 +602,17 @@ int mptcp_userspace_pm_set_flags(struct sk_buff *skb, struct genl_info *info) if (loc.flags & MPTCP_PM_ADDR_FLAG_BACKUP) bkup = 1; + spin_lock_bh(&msk->pm.lock); + list_for_each_entry(entry, &msk->pm.userspace_pm_local_addr_list, list) { + if (mptcp_addresses_equal(&entry->addr, &loc.addr, false)) { + if (bkup) + entry->flags |= MPTCP_PM_ADDR_FLAG_BACKUP; + else + entry->flags &= ~MPTCP_PM_ADDR_FLAG_BACKUP; + } + } + spin_unlock_bh(&msk->pm.lock); + lock_sock(sk); ret = mptcp_pm_nl_mp_prio_send_ack(msk, &loc.addr, &rem.addr, bkup); release_sock(sk); From f642c5c4d528d11bd78b6c6f84f541cd3c0bea86 Mon Sep 17 00:00:00 2001 From: Geliang Tang Date: Tue, 12 Nov 2024 20:18:34 +0100 Subject: [PATCH 224/235] mptcp: hold pm lock when deleting entry When traversing userspace_pm_local_addr_list and deleting an entry from it in mptcp_pm_nl_remove_doit(), msk->pm.lock should be held. This patch holds this lock before mptcp_userspace_pm_lookup_addr_by_id() and releases it after list_move() in mptcp_pm_nl_remove_doit(). Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE") Cc: stable@vger.kernel.org Signed-off-by: Geliang Tang Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20241112-net-mptcp-misc-6-12-pm-v1-2-b835580cefa8@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/pm_userspace.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/mptcp/pm_userspace.c b/net/mptcp/pm_userspace.c index 3f888bfe1462..e35178f5205f 100644 --- a/net/mptcp/pm_userspace.c +++ b/net/mptcp/pm_userspace.c @@ -308,14 +308,17 @@ int mptcp_pm_nl_remove_doit(struct sk_buff *skb, struct genl_info *info) lock_sock(sk); + spin_lock_bh(&msk->pm.lock); match = mptcp_userspace_pm_lookup_addr_by_id(msk, id_val); if (!match) { GENL_SET_ERR_MSG(info, "address with specified id not found"); + spin_unlock_bh(&msk->pm.lock); release_sock(sk); goto out; } list_move(&match->list, &free_list); + spin_unlock_bh(&msk->pm.lock); mptcp_pm_remove_addrs(msk, &free_list); From db3eab8110bc0520416101b6a5b52f44a43fb4cf Mon Sep 17 00:00:00 2001 From: "Matthieu Baerts (NGI0)" Date: Tue, 12 Nov 2024 20:18:35 +0100 Subject: [PATCH 225/235] mptcp: pm: use _rcu variant under rcu_read_lock In mptcp_pm_create_subflow_or_signal_addr(), rcu_read_(un)lock() are used as expected to iterate over the list of local addresses, but list_for_each_entry() was used instead of list_for_each_entry_rcu() in __lookup_addr(). It is important to use this variant which adds the required READ_ONCE() (and diagnostic checks if enabled). Because __lookup_addr() is also used in mptcp_pm_nl_set_flags() where it is called under the pernet->lock and not rcu_read_lock(), an extra condition is then passed to help the diagnostic checks making sure either the associated spin lock or the RCU lock is held. Fixes: 86e39e04482b ("mptcp: keep track of local endpoint still available for each msk") Cc: stable@vger.kernel.org Reviewed-by: Geliang Tang Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20241112-net-mptcp-misc-6-12-pm-v1-3-b835580cefa8@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/pm_netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c index db586a5b3866..45a2b5f05d38 100644 --- a/net/mptcp/pm_netlink.c +++ b/net/mptcp/pm_netlink.c @@ -524,7 +524,8 @@ __lookup_addr(struct pm_nl_pernet *pernet, const struct mptcp_addr_info *info) { struct mptcp_pm_addr_entry *entry; - list_for_each_entry(entry, &pernet->local_addr_list, list) { + list_for_each_entry_rcu(entry, &pernet->local_addr_list, list, + lockdep_is_held(&pernet->lock)) { if (mptcp_addresses_equal(&entry->addr, info, entry->addr.port)) return entry; } From 671154f174e0e7f242507cd074497661deb41bfd Mon Sep 17 00:00:00 2001 From: "Russell King (Oracle)" Date: Tue, 12 Nov 2024 16:20:00 +0000 Subject: [PATCH 226/235] net: phylink: ensure PHY momentary link-fails are handled Normally, phylib won't notify changes in quick succession. However, as a result of commit 3e43b903da04 ("net: phy: Immediately call adjust_link if only tx_lpi_enabled changes") this is no longer true - it is now possible that phy_link_down() and phy_link_up() will both complete before phylink's resolver has run, which means it'll miss that pl->phy_state.link momentarily became false. Rename "mac_link_dropped" to be more generic "link_failed" since it will cover more than the MAC/PCS end of the link failing, and arrange to set this in phylink_phy_change() if we notice that the PHY reports that the link is down. This will ensure that we capture an EEE reconfiguration event. Fixes: 3e43b903da04 ("net: phy: Immediately call adjust_link if only tx_lpi_enabled changes") Signed-off-by: Russell King (Oracle) Reviewed-by: Oleksij Rempel Link: https://patch.msgid.link/E1tAtcW-002RBS-LB@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski --- drivers/net/phy/phylink.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 4309317de3d1..3e9957b6aa14 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -78,7 +78,7 @@ struct phylink { unsigned int pcs_neg_mode; unsigned int pcs_state; - bool mac_link_dropped; + bool link_failed; bool using_mac_select_pcs; struct sfp_bus *sfp_bus; @@ -1475,9 +1475,9 @@ static void phylink_resolve(struct work_struct *w) cur_link_state = pl->old_link_state; if (pl->phylink_disable_state) { - pl->mac_link_dropped = false; + pl->link_failed = false; link_state.link = false; - } else if (pl->mac_link_dropped) { + } else if (pl->link_failed) { link_state.link = false; retrigger = true; } else { @@ -1572,7 +1572,7 @@ static void phylink_resolve(struct work_struct *w) phylink_link_up(pl, link_state); } if (!link_state.link && retrigger) { - pl->mac_link_dropped = false; + pl->link_failed = false; queue_work(system_power_efficient_wq, &pl->resolve); } mutex_unlock(&pl->state_mutex); @@ -1835,6 +1835,8 @@ static void phylink_phy_change(struct phy_device *phydev, bool up) pl->phy_state.pause |= MLO_PAUSE_RX; pl->phy_state.interface = phydev->interface; pl->phy_state.link = up; + if (!up) + pl->link_failed = true; mutex_unlock(&pl->state_mutex); phylink_run_resolve(pl); @@ -2158,7 +2160,7 @@ EXPORT_SYMBOL_GPL(phylink_disconnect_phy); static void phylink_link_changed(struct phylink *pl, bool up, const char *what) { if (!up) - pl->mac_link_dropped = true; + pl->link_failed = true; phylink_run_resolve(pl); phylink_dbg(pl, "%s link %s\n", what, up ? "up" : "down"); } @@ -2792,7 +2794,7 @@ int phylink_ethtool_set_pauseparam(struct phylink *pl, * link will cycle. */ if (manual_changed) { - pl->mac_link_dropped = true; + pl->link_failed = true; phylink_run_resolve(pl); } From 3342dc8b4623d835e7dd76a15cec2e5a94fe2f93 Mon Sep 17 00:00:00 2001 From: Wei Fang Date: Tue, 12 Nov 2024 11:03:47 +0800 Subject: [PATCH 227/235] samples: pktgen: correct dev to DEV In the pktgen_sample01_simple.sh script, the device variable is uppercase 'DEV' instead of lowercase 'dev'. Because of this typo, the script cannot enable UDP tx checksum. Fixes: 460a9aa23de6 ("samples: pktgen: add UDP tx checksum support") Signed-off-by: Wei Fang Reviewed-by: Simon Horman Acked-by: Jesper Dangaard Brouer Link: https://patch.msgid.link/20241112030347.1849335-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski --- samples/pktgen/pktgen_sample01_simple.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/samples/pktgen/pktgen_sample01_simple.sh b/samples/pktgen/pktgen_sample01_simple.sh index cdb9f497f87d..66cb707479e6 100755 --- a/samples/pktgen/pktgen_sample01_simple.sh +++ b/samples/pktgen/pktgen_sample01_simple.sh @@ -76,7 +76,7 @@ if [ -n "$DST_PORT" ]; then pg_set $DEV "udp_dst_max $UDP_DST_MAX" fi -[ ! -z "$UDP_CSUM" ] && pg_set $dev "flag UDPCSUM" +[ ! -z "$UDP_CSUM" ] && pg_set $DEV "flag UDPCSUM" # Setup random UDP port src range pg_set $DEV "flag UDPSRC_RND" From e28acc9c1ccfcb24c08e020828f69d0a915b06ae Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Fri, 8 Nov 2024 06:08:36 -0800 Subject: [PATCH 228/235] ipmr: Fix access to mfc_cache_list without lock held Accessing `mr_table->mfc_cache_list` is protected by an RCU lock. In the following code flow, the RCU read lock is not held, causing the following error when `RCU_PROVE` is not held. The same problem might show up in the IPv6 code path. 6.12.0-rc5-kbuilder-01145-gbac17284bdcb #33 Tainted: G E N ----------------------------- net/ipv4/ipmr_base.c:313 RCU-list traversed in non-reader section!! rcu_scheduler_active = 2, debug_locks = 1 2 locks held by RetransmitAggre/3519: #0: ffff88816188c6c0 (nlk_cb_mutex-ROUTE){+.+.}-{3:3}, at: __netlink_dump_start+0x8a/0x290 #1: ffffffff83fcf7a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_dumpit+0x6b/0x90 stack backtrace: lockdep_rcu_suspicious mr_table_dump ipmr_rtm_dumproute rtnl_dump_all rtnl_dumpit netlink_dump __netlink_dump_start rtnetlink_rcv_msg netlink_rcv_skb netlink_unicast netlink_sendmsg This is not a problem per see, since the RTNL lock is held here, so, it is safe to iterate in the list without the RCU read lock, as suggested by Eric. To alleviate the concern, modify the code to use list_for_each_entry_rcu() with the RTNL-held argument. The annotation will raise an error only if RTNL or RCU read lock are missing during iteration, signaling a legitimate problem, otherwise it will avoid this false positive. This will solve the IPv6 case as well, since ip6mr_rtm_dumproute() calls this function as well. Signed-off-by: Breno Leitao Reviewed-by: David Ahern Link: https://patch.msgid.link/20241108-ipmr_rcu-v2-1-c718998e209b@debian.org Signed-off-by: Jakub Kicinski --- net/ipv4/ipmr_base.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c index 271dc03fc6db..f0af12a2f70b 100644 --- a/net/ipv4/ipmr_base.c +++ b/net/ipv4/ipmr_base.c @@ -310,7 +310,8 @@ int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, if (filter->filter_set) flags |= NLM_F_DUMP_FILTERED; - list_for_each_entry_rcu(mfc, &mrt->mfc_cache_list, list) { + list_for_each_entry_rcu(mfc, &mrt->mfc_cache_list, list, + lockdep_rtnl_is_held()) { if (e < s_e) goto next_entry; if (filter->dev && From a03b18a71c128846360cc81ac6fdb0e7d41597b4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?N=C3=ADcolas=20F=2E=20R=2E=20A=2E=20Prado?= Date: Sat, 9 Nov 2024 10:16:32 -0500 Subject: [PATCH 229/235] net: stmmac: dwmac-mediatek: Fix inverted handling of mediatek,mac-wol MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The mediatek,mac-wol property is being handled backwards to what is described in the binding: it currently enables PHY WOL when the property is present and vice versa. Invert the driver logic so it matches the binding description. Fixes: fd1d62d80ebc ("net: stmmac: replace the use_phy_wol field with a flag") Signed-off-by: Nícolas F. R. A. Prado Link: https://patch.msgid.link/20241109-mediatek-mac-wol-noninverted-v2-1-0e264e213878@collabora.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c index 2a9132d6d743..001857c294fb 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c @@ -589,9 +589,9 @@ static int mediatek_dwmac_common_data(struct platform_device *pdev, plat->mac_interface = priv_plat->phy_mode; if (priv_plat->mac_wol) - plat->flags |= STMMAC_FLAG_USE_PHY_WOL; - else plat->flags &= ~STMMAC_FLAG_USE_PHY_WOL; + else + plat->flags |= STMMAC_FLAG_USE_PHY_WOL; plat->riwt_off = 1; plat->maxmtu = ETH_DATA_LEN; plat->host_dma_width = priv_plat->variant->dma_bit_mask; From eb94b7bb10109a14a5431a67e5d8e31cfa06b395 Mon Sep 17 00:00:00 2001 From: Michal Luczaj Date: Mon, 11 Nov 2024 00:17:34 +0100 Subject: [PATCH 230/235] net: Make copy_safe_from_sockptr() match documentation copy_safe_from_sockptr() return copy_from_sockptr() return copy_from_sockptr_offset() return copy_from_user() copy_from_user() does not return an error on fault. Instead, it returns a number of bytes that were not copied. Have it handled. Patch has a side effect: it un-breaks garbage input handling of nfc_llcp_setsockopt() and mISDN's data_sock_setsockopt(). Fixes: 6309863b31dd ("net: add copy_safe_from_sockptr() helper") Signed-off-by: Michal Luczaj Link: https://patch.msgid.link/20241111-sockptr-copy-ret-fix-v1-1-a520083a93fb@rbox.co Signed-off-by: Jakub Kicinski --- include/linux/sockptr.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/sockptr.h b/include/linux/sockptr.h index fc5a206c4043..195debe2b1db 100644 --- a/include/linux/sockptr.h +++ b/include/linux/sockptr.h @@ -77,7 +77,9 @@ static inline int copy_safe_from_sockptr(void *dst, size_t ksize, { if (optlen < ksize) return -EINVAL; - return copy_from_sockptr(dst, optval, ksize); + if (copy_from_sockptr(dst, optval, ksize)) + return -EFAULT; + return 0; } static inline int copy_struct_from_sockptr(void *dst, size_t ksize, From 5b366eae71937ae7412365340b431064625f9617 Mon Sep 17 00:00:00 2001 From: Vitalii Mordan Date: Fri, 8 Nov 2024 20:33:34 +0300 Subject: [PATCH 231/235] stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines If the clock dwmac->tx_clk was not enabled in intel_eth_plat_probe, it should not be disabled in any path. Conversely, if it was enabled in intel_eth_plat_probe, it must be disabled in all error paths to ensure proper cleanup. Found by Linux Verification Center (linuxtesting.org) with Klever. Fixes: 9efc9b2b04c7 ("net: stmmac: Add dwmac-intel-plat for GBE driver") Signed-off-by: Vitalii Mordan Link: https://patch.msgid.link/20241108173334.2973603-1-mordan@ispras.ru Signed-off-by: Jakub Kicinski --- .../stmicro/stmmac/dwmac-intel-plat.c | 25 +++++++++++++------ 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c index d68f0c4e7835..9739bc9867c5 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c @@ -108,7 +108,12 @@ static int intel_eth_plat_probe(struct platform_device *pdev) if (IS_ERR(dwmac->tx_clk)) return PTR_ERR(dwmac->tx_clk); - clk_prepare_enable(dwmac->tx_clk); + ret = clk_prepare_enable(dwmac->tx_clk); + if (ret) { + dev_err(&pdev->dev, + "Failed to enable tx_clk\n"); + return ret; + } /* Check and configure TX clock rate */ rate = clk_get_rate(dwmac->tx_clk); @@ -119,7 +124,7 @@ static int intel_eth_plat_probe(struct platform_device *pdev) if (ret) { dev_err(&pdev->dev, "Failed to set tx_clk\n"); - return ret; + goto err_tx_clk_disable; } } } @@ -133,7 +138,7 @@ static int intel_eth_plat_probe(struct platform_device *pdev) if (ret) { dev_err(&pdev->dev, "Failed to set clk_ptp_ref\n"); - return ret; + goto err_tx_clk_disable; } } } @@ -149,12 +154,15 @@ static int intel_eth_plat_probe(struct platform_device *pdev) } ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res); - if (ret) { - clk_disable_unprepare(dwmac->tx_clk); - return ret; - } + if (ret) + goto err_tx_clk_disable; return 0; + +err_tx_clk_disable: + if (dwmac->data->tx_clk_en) + clk_disable_unprepare(dwmac->tx_clk); + return ret; } static void intel_eth_plat_remove(struct platform_device *pdev) @@ -162,7 +170,8 @@ static void intel_eth_plat_remove(struct platform_device *pdev) struct intel_dwmac *dwmac = get_stmmac_bsp_priv(&pdev->dev); stmmac_pltfr_remove(pdev); - clk_disable_unprepare(dwmac->tx_clk); + if (dwmac->data->tx_clk_en) + clk_disable_unprepare(dwmac->tx_clk); } static struct platform_driver intel_eth_plat_driver = { From dc065076ee7768377d7c16af7d1b0767782d8c98 Mon Sep 17 00:00:00 2001 From: Meghana Malladi Date: Mon, 11 Nov 2024 15:28:42 +0530 Subject: [PATCH 232/235] net: ti: icssg-prueth: Fix 1 PPS sync The first PPS latch time needs to be calculated by the driver (in rounded off seconds) and configured as the start time offset for the cycle. After synchronizing two PTP clocks running as master/slave, missing this would cause master and slave to start immediately with some milliseconds drift which causes the PPS signal to never synchronize with the PTP master. Fixes: 186734c15886 ("net: ti: icssg-prueth: add packet timestamping and ptp support") Signed-off-by: Meghana Malladi Reviewed-by: Vadim Fedorenko Reviewed-by: MD Danish Anwar Link: https://patch.msgid.link/20241111095842.478833-1-m-malladi@ti.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/ti/icssg/icssg_prueth.c | 13 +++++++++++-- drivers/net/ethernet/ti/icssg/icssg_prueth.h | 12 ++++++++++++ 2 files changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c index 5c20ceb164df..fe2fd1bfc904 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -411,6 +412,8 @@ static int prueth_perout_enable(void *clockops_data, struct prueth_emac *emac = clockops_data; u32 reduction_factor = 0, offset = 0; struct timespec64 ts; + u64 current_cycle; + u64 start_offset; u64 ns_period; if (!on) @@ -449,8 +452,14 @@ static int prueth_perout_enable(void *clockops_data, writel(reduction_factor, emac->prueth->shram.va + TIMESYNC_FW_WC_SYNCOUT_REDUCTION_FACTOR_OFFSET); - writel(0, emac->prueth->shram.va + - TIMESYNC_FW_WC_SYNCOUT_START_TIME_CYCLECOUNT_OFFSET); + current_cycle = icssg_read_time(emac->prueth->shram.va + + TIMESYNC_FW_WC_CYCLECOUNT_OFFSET); + + /* Rounding of current_cycle count to next second */ + start_offset = roundup(current_cycle, MSEC_PER_SEC); + + hi_lo_writeq(start_offset, emac->prueth->shram.va + + TIMESYNC_FW_WC_SYNCOUT_START_TIME_CYCLECOUNT_OFFSET); return 0; } diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h index 8722bb4a268a..f5c1d473e9f9 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h @@ -330,6 +330,18 @@ static inline int prueth_emac_slice(struct prueth_emac *emac) extern const struct ethtool_ops icssg_ethtool_ops; extern const struct dev_pm_ops prueth_dev_pm_ops; +static inline u64 icssg_read_time(const void __iomem *addr) +{ + u32 low, high; + + do { + high = readl(addr + 4); + low = readl(addr); + } while (high != readl(addr + 4)); + + return low + ((u64)high << 32); +} + /* Classifier helpers */ void icssg_class_set_mac_addr(struct regmap *miig_rt, int slice, u8 *mac); void icssg_class_set_host_mac_addr(struct regmap *miig_rt, const u8 *mac); From 8eb36164d1a6769a20ed43033510067ff3dab9ee Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Mon, 11 Nov 2024 10:16:49 +0000 Subject: [PATCH 233/235] bonding: add ns target multicast address to slave device MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 4598380f9c54 ("bonding: fix ns validation on backup slaves") tried to resolve the issue where backup slaves couldn't be brought up when receiving IPv6 Neighbor Solicitation (NS) messages. However, this fix only worked for drivers that receive all multicast messages, such as the veth interface. For standard drivers, the NS multicast message is silently dropped because the slave device is not a member of the NS target multicast group. To address this, we need to make the slave device join the NS target multicast group, ensuring it can receive these IPv6 NS messages to validate the slave’s status properly. There are three policies before joining the multicast group: 1. All settings must be under active-backup mode (alb and tlb do not support arp_validate), with backup slaves and slaves supporting multicast. 2. We can add or remove multicast groups when arp_validate changes. 3. Other operations, such as enslaving, releasing, or setting NS targets, need to be guarded by arp_validate. Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets") Signed-off-by: Hangbin Liu Reviewed-by: Nikolay Aleksandrov Signed-off-by: Paolo Abeni --- drivers/net/bonding/bond_main.c | 16 +++++- drivers/net/bonding/bond_options.c | 82 +++++++++++++++++++++++++++++- include/net/bond_options.h | 2 + 3 files changed, 98 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index b1bffd8e9a95..15e0f14d0d49 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1008,6 +1008,8 @@ static void bond_hw_addr_swap(struct bonding *bond, struct slave *new_active, if (bond->dev->flags & IFF_UP) bond_hw_addr_flush(bond->dev, old_active->dev); + + bond_slave_ns_maddrs_add(bond, old_active); } if (new_active) { @@ -1024,6 +1026,8 @@ static void bond_hw_addr_swap(struct bonding *bond, struct slave *new_active, dev_mc_sync(new_active->dev, bond->dev); netif_addr_unlock_bh(bond->dev); } + + bond_slave_ns_maddrs_del(bond, new_active); } } @@ -2341,6 +2345,11 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, bond_compute_features(bond); bond_set_carrier(bond); + /* Needs to be called before bond_select_active_slave(), which will + * remove the maddrs if the slave is selected as active slave. + */ + bond_slave_ns_maddrs_add(bond, new_slave); + if (bond_uses_primary(bond)) { block_netpoll_tx(); bond_select_active_slave(bond); @@ -2350,7 +2359,6 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, if (bond_mode_can_use_xmit_hash(bond)) bond_update_slave_arr(bond, NULL); - if (!slave_dev->netdev_ops->ndo_bpf || !slave_dev->netdev_ops->ndo_xdp_xmit) { if (bond->xdp_prog) { @@ -2548,6 +2556,12 @@ static int __bond_release_one(struct net_device *bond_dev, if (oldcurrent == slave) bond_change_active_slave(bond, NULL); + /* Must be called after bond_change_active_slave () as the slave + * might change from an active slave to a backup slave. Then it is + * necessary to clear the maddrs on the backup slave. + */ + bond_slave_ns_maddrs_del(bond, slave); + if (bond_is_lb(bond)) { /* Must be called only after the slave has been * detached from the list and the curr_active_slave diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index 95d59a18c022..327b6ecdc77e 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -15,6 +15,7 @@ #include #include +#include static int bond_option_active_slave_set(struct bonding *bond, const struct bond_opt_value *newval); @@ -1234,6 +1235,68 @@ static int bond_option_arp_ip_targets_set(struct bonding *bond, } #if IS_ENABLED(CONFIG_IPV6) +static bool slave_can_set_ns_maddr(const struct bonding *bond, struct slave *slave) +{ + return BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP && + !bond_is_active_slave(slave) && + slave->dev->flags & IFF_MULTICAST; +} + +static void slave_set_ns_maddrs(struct bonding *bond, struct slave *slave, bool add) +{ + struct in6_addr *targets = bond->params.ns_targets; + char slot_maddr[MAX_ADDR_LEN]; + int i; + + if (!slave_can_set_ns_maddr(bond, slave)) + return; + + for (i = 0; i < BOND_MAX_NS_TARGETS; i++) { + if (ipv6_addr_any(&targets[i])) + break; + + if (!ndisc_mc_map(&targets[i], slot_maddr, slave->dev, 0)) { + if (add) + dev_mc_add(slave->dev, slot_maddr); + else + dev_mc_del(slave->dev, slot_maddr); + } + } +} + +void bond_slave_ns_maddrs_add(struct bonding *bond, struct slave *slave) +{ + if (!bond->params.arp_validate) + return; + slave_set_ns_maddrs(bond, slave, true); +} + +void bond_slave_ns_maddrs_del(struct bonding *bond, struct slave *slave) +{ + if (!bond->params.arp_validate) + return; + slave_set_ns_maddrs(bond, slave, false); +} + +static void slave_set_ns_maddr(struct bonding *bond, struct slave *slave, + struct in6_addr *target, struct in6_addr *slot) +{ + char target_maddr[MAX_ADDR_LEN], slot_maddr[MAX_ADDR_LEN]; + + if (!bond->params.arp_validate || !slave_can_set_ns_maddr(bond, slave)) + return; + + /* remove the previous maddr from slave */ + if (!ipv6_addr_any(slot) && + !ndisc_mc_map(slot, slot_maddr, slave->dev, 0)) + dev_mc_del(slave->dev, slot_maddr); + + /* add new maddr on slave if target is set */ + if (!ipv6_addr_any(target) && + !ndisc_mc_map(target, target_maddr, slave->dev, 0)) + dev_mc_add(slave->dev, target_maddr); +} + static void _bond_options_ns_ip6_target_set(struct bonding *bond, int slot, struct in6_addr *target, unsigned long last_rx) @@ -1243,8 +1306,10 @@ static void _bond_options_ns_ip6_target_set(struct bonding *bond, int slot, struct slave *slave; if (slot >= 0 && slot < BOND_MAX_NS_TARGETS) { - bond_for_each_slave(bond, slave, iter) + bond_for_each_slave(bond, slave, iter) { slave->target_last_arp_rx[slot] = last_rx; + slave_set_ns_maddr(bond, slave, target, &targets[slot]); + } targets[slot] = *target; } } @@ -1296,15 +1361,30 @@ static int bond_option_ns_ip6_targets_set(struct bonding *bond, { return -EPERM; } + +static void slave_set_ns_maddrs(struct bonding *bond, struct slave *slave, bool add) {} + +void bond_slave_ns_maddrs_add(struct bonding *bond, struct slave *slave) {} + +void bond_slave_ns_maddrs_del(struct bonding *bond, struct slave *slave) {} #endif static int bond_option_arp_validate_set(struct bonding *bond, const struct bond_opt_value *newval) { + bool changed = !!bond->params.arp_validate != !!newval->value; + struct list_head *iter; + struct slave *slave; + netdev_dbg(bond->dev, "Setting arp_validate to %s (%llu)\n", newval->string, newval->value); bond->params.arp_validate = newval->value; + if (changed) { + bond_for_each_slave(bond, slave, iter) + slave_set_ns_maddrs(bond, slave, !!bond->params.arp_validate); + } + return 0; } diff --git a/include/net/bond_options.h b/include/net/bond_options.h index 473a0147769e..18687ccf0638 100644 --- a/include/net/bond_options.h +++ b/include/net/bond_options.h @@ -161,5 +161,7 @@ void bond_option_arp_ip_targets_clear(struct bonding *bond); #if IS_ENABLED(CONFIG_IPV6) void bond_option_ns_ip6_targets_clear(struct bonding *bond); #endif +void bond_slave_ns_maddrs_add(struct bonding *bond, struct slave *slave); +void bond_slave_ns_maddrs_del(struct bonding *bond, struct slave *slave); #endif /* _NET_BOND_OPTIONS_H */ From 86fb6173d11e773a00a5b6d1b7bd17caff8692b8 Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Mon, 11 Nov 2024 10:16:50 +0000 Subject: [PATCH 234/235] selftests: bonding: add ns multicast group testing Add a test to make sure the backup slaves join correct multicast group when arp_validate enabled and ns_ip6_target is set. Here is the result: TEST: arp_validate (active-backup ns_ip6_target arp_validate 0) [ OK ] TEST: arp_validate (join mcast group) [ OK ] TEST: arp_validate (active-backup ns_ip6_target arp_validate 1) [ OK ] TEST: arp_validate (join mcast group) [ OK ] TEST: arp_validate (active-backup ns_ip6_target arp_validate 2) [ OK ] TEST: arp_validate (join mcast group) [ OK ] TEST: arp_validate (active-backup ns_ip6_target arp_validate 3) [ OK ] TEST: arp_validate (join mcast group) [ OK ] TEST: arp_validate (active-backup ns_ip6_target arp_validate 4) [ OK ] TEST: arp_validate (join mcast group) [ OK ] TEST: arp_validate (active-backup ns_ip6_target arp_validate 5) [ OK ] TEST: arp_validate (join mcast group) [ OK ] TEST: arp_validate (active-backup ns_ip6_target arp_validate 6) [ OK ] TEST: arp_validate (join mcast group) [ OK ] Signed-off-by: Hangbin Liu Reviewed-by: Nikolay Aleksandrov Signed-off-by: Paolo Abeni --- .../drivers/net/bonding/bond_options.sh | 54 ++++++++++++++++++- 1 file changed, 53 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/drivers/net/bonding/bond_options.sh b/tools/testing/selftests/drivers/net/bonding/bond_options.sh index 41d0859feb7d..edc56e2cc606 100755 --- a/tools/testing/selftests/drivers/net/bonding/bond_options.sh +++ b/tools/testing/selftests/drivers/net/bonding/bond_options.sh @@ -11,6 +11,8 @@ ALL_TESTS=" lib_dir=$(dirname "$0") source ${lib_dir}/bond_topo_3d1c.sh +c_maddr="33:33:00:00:00:10" +g_maddr="33:33:00:00:02:54" skip_prio() { @@ -240,6 +242,54 @@ arp_validate_test() done } +# Testing correct multicast groups are added to slaves for ns targets +arp_validate_mcast() +{ + RET=0 + local arp_valid=$(cmd_jq "ip -n ${s_ns} -j -d link show bond0" ".[].linkinfo.info_data.arp_validate") + local active_slave=$(cmd_jq "ip -n ${s_ns} -d -j link show bond0" ".[].linkinfo.info_data.active_slave") + + for i in $(seq 0 2); do + maddr_list=$(ip -n ${s_ns} maddr show dev eth${i}) + + # arp_valid == 0 or active_slave should not join any maddrs + if { [ "$arp_valid" == "null" ] || [ "eth${i}" == ${active_slave} ]; } && \ + echo "$maddr_list" | grep -qE "${c_maddr}|${g_maddr}"; then + RET=1 + check_err 1 "arp_valid $arp_valid active_slave $active_slave, eth$i has mcast group" + # arp_valid != 0 and backup_slave should join both maddrs + elif [ "$arp_valid" != "null" ] && [ "eth${i}" != ${active_slave} ] && \ + ( ! echo "$maddr_list" | grep -q "${c_maddr}" || \ + ! echo "$maddr_list" | grep -q "${m_maddr}"); then + RET=1 + check_err 1 "arp_valid $arp_valid active_slave $active_slave, eth$i has mcast group" + fi + done + + # Do failover + ip -n ${s_ns} link set ${active_slave} down + # wait for active link change + slowwait 2 active_slave_changed $active_slave + active_slave=$(cmd_jq "ip -n ${s_ns} -d -j link show bond0" ".[].linkinfo.info_data.active_slave") + + for i in $(seq 0 2); do + maddr_list=$(ip -n ${s_ns} maddr show dev eth${i}) + + # arp_valid == 0 or active_slave should not join any maddrs + if { [ "$arp_valid" == "null" ] || [ "eth${i}" == ${active_slave} ]; } && \ + echo "$maddr_list" | grep -qE "${c_maddr}|${g_maddr}"; then + RET=1 + check_err 1 "arp_valid $arp_valid active_slave $active_slave, eth$i has mcast group" + # arp_valid != 0 and backup_slave should join both maddrs + elif [ "$arp_valid" != "null" ] && [ "eth${i}" != ${active_slave} ] && \ + ( ! echo "$maddr_list" | grep -q "${c_maddr}" || \ + ! echo "$maddr_list" | grep -q "${m_maddr}"); then + RET=1 + check_err 1 "arp_valid $arp_valid active_slave $active_slave, eth$i has mcast group" + fi + done +} + arp_validate_arp() { local mode=$1 @@ -261,8 +311,10 @@ arp_validate_ns() fi for val in $(seq 0 6); do - arp_validate_test "mode $mode arp_interval 100 ns_ip6_target ${g_ip6} arp_validate $val" + arp_validate_test "mode $mode arp_interval 100 ns_ip6_target ${g_ip6},${c_ip6} arp_validate $val" log_test "arp_validate" "$mode ns_ip6_target arp_validate $val" + arp_validate_mcast + log_test "arp_validate" "join mcast group" done } From ca34aceb322bfcd6ab498884f1805ee12f983259 Mon Sep 17 00:00:00 2001 From: Alexandre Ferrieux Date: Wed, 13 Nov 2024 11:04:28 +0100 Subject: [PATCH 235/235] net: sched: u32: Add test case for systematic hnode IDR leaks Add a tdc test case to exercise the just-fixed systematic leak of IDR entries in u32 hnode disposal. Given the IDR in question is confined to the range [1..0x7FF], it is sufficient to create/delete the same filter 2048 times to fill it up and get a nonzero exit status from "tc filter add". Signed-off-by: Alexandre Ferrieux Acked-by: Jamal Hadi Salim Reviewed-by: Victor Nogueira Link: https://patch.msgid.link/20241113100428.360460-1-alexandre.ferrieux@orange.com Signed-off-by: Paolo Abeni --- .../tc-testing/tc-tests/filters/u32.json | 24 +++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/tools/testing/selftests/tc-testing/tc-tests/filters/u32.json b/tools/testing/selftests/tc-testing/tc-tests/filters/u32.json index 24bd0c2a3014..b2ca9d4e991b 100644 --- a/tools/testing/selftests/tc-testing/tc-tests/filters/u32.json +++ b/tools/testing/selftests/tc-testing/tc-tests/filters/u32.json @@ -329,5 +329,29 @@ "teardown": [ "$TC qdisc del dev $DEV1 parent root drr" ] + }, + { + "id": "1234", + "name": "Exercise IDR leaks by creating/deleting a filter many (2048) times", + "category": [ + "filter", + "u32" + ], + "plugins": { + "requires": "nsPlugin" + }, + "setup": [ + "$TC qdisc add dev $DEV1 parent root handle 10: drr", + "$TC filter add dev $DEV1 parent 10:0 protocol ip prio 2 u32 match ip src 0.0.0.2/32 action drop", + "$TC filter add dev $DEV1 parent 10:0 protocol ip prio 3 u32 match ip src 0.0.0.3/32 action drop" + ], + "cmdUnderTest": "bash -c 'for i in {1..2048} ;do echo filter delete dev $DEV1 pref 3;echo filter add dev $DEV1 parent 10:0 protocol ip prio 3 u32 match ip src 0.0.0.3/32 action drop;done | $TC -b -'", + "expExitCode": "0", + "verifyCmd": "$TC filter show dev $DEV1", + "matchPattern": "protocol ip pref 3 u32", + "matchCount": "3", + "teardown": [ + "$TC qdisc del dev $DEV1 parent root drr" + ] } ]