linux-next/Documentation/admin-guide/reporting-regressions.rst
Thorsten Leemhuis d2b40ba2cc docs: *-regressions.rst: explain how quickly issues should be handled
Add a section with a few rules of thumb about how
quickly developers should address regressions to
Documentation/process/handling-regressions.rst; additionally,
add a short paragraph about this to the companion document
Documentation/admin-guide/reporting-regressions.rst as well.

The rules of thumb were written after studying the quotes from Linus
found in handling-regressions.rst and especially influenced by
statements like "Users are literally the _only_ thing that matters" and
"without users, your program is not a program, it's a pointless piece of
code that you might as well throw away". The author interpreted those in
perspective to how the various Linux kernel series are maintained
currently and what those practices might mean for users running into a
regression on a small or big kernel update.

That for example lead to the paragraph starting with "Aim to get fixes
for regressions mainlined within one week after identifying the culprit,
if the regression was introduced in a stable/longterm release or the
devel cycle for the latest mainline release". Some might see this as
pretty high bar, but on the other hand something like that is needed to
not leave users out in the cold for too long -- which can quickly happen
when updating to the latest stable series, as the previous one is
normally stamped "End of Life" about three or four weeks after a new
mainline release. This makes a lot of users switch during this
timeframe. Any of them thus risk running into regressions not promptly
fixed; even worse, once the previous stable series is EOLed for real,
users that face a regression might be left with only three options:

 (1) continue running an outdated and thus potentially insecure kernel
     version from an abandoned stable series

 (2) run the kernel with the regression

 (3) downgrade to an earlier longterm series still supported

This is better avoided, as (1) puts users and their data in danger, (2)
will only be possible if it's a minor regression that doesn't interfere
with booting or serious usage, and (3) might be regression itself or
impossible on the particular machine, as the users might require drivers
or features only introduced after the latest longterm series branched
of.

In the end this lead to the aforementioned "Aim to fix regression within
one week" part. It's also the reason for the "Try to resolve any
regressions introduced in the current development cycle before its
end.".

Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
CC: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/a7b717b52c0d54cdec9b6daf56ed6669feddee2c.1644994117.git.linux@leemhuis.info
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-02-24 12:57:25 -07:00

452 lines
22 KiB
ReStructuredText

.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
.. [see the bottom of this file for redistribution information]
Reporting regressions
+++++++++++++++++++++
"*We don't cause regressions*" is the first rule of Linux kernel development;
Linux founder and lead developer Linus Torvalds established it himself and
ensures it's obeyed.
This document describes what the rule means for users and how the Linux kernel's
development model ensures to address all reported regressions; aspects relevant
for kernel developers are left to Documentation/process/handling-regressions.rst.
The important bits (aka "TL;DR")
================================
#. It's a regression if something running fine with one Linux kernel works worse
or not at all with a newer version. Note, the newer kernel has to be compiled
using a similar configuration; the detailed explanations below describes this
and other fine print in more detail.
#. Report your issue as outlined in Documentation/admin-guide/reporting-issues.rst,
it already covers all aspects important for regressions and repeated
below for convenience. Two of them are important: start your report's subject
with "[REGRESSION]" and CC or forward it to `the regression mailing list
<https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).
#. Optional, but recommended: when sending or forwarding your report, make the
Linux kernel regression tracking bot "regzbot" track the issue by specifying
when the regression started like this::
#regzbot introduced v5.13..v5.14-rc1
All the details on Linux kernel regressions relevant for users
==============================================================
The important basics
--------------------
What is a "regression" and what is the "no regressions rule"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's a regression if some application or practical use case running fine with
one Linux kernel works worse or not at all with a newer version compiled using a
similar configuration. The "no regressions rule" forbids this to take place; if
it happens by accident, developers that caused it are expected to quickly fix
the issue.
It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
It's also a regression if a perfectly working application suddenly shows erratic
behavior with a newer kernel version; such issues can be caused by changes in
procfs, sysfs, or one of the many other interfaces Linux provides to userland
software. But keep in mind, as mentioned earlier: 5.14 in this example needs to
be built from a configuration similar to the one from 5.13. This can be achieved
using ``make olddefconfig``, as explained in more detail below.
Note the "practical use case" in the first sentence of this section: developers
despite the "no regressions" rule are free to change any aspect of the kernel
and even APIs or ABIs to userland, as long as no existing application or use
case breaks.
Also be aware the "no regressions" rule covers only interfaces the kernel
provides to the userland. It thus does not apply to kernel-internal interfaces
like the module API, which some externally developed drivers use to hook into
the kernel.
How do I report a regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Just report the issue as outlined in
Documentation/admin-guide/reporting-issues.rst, it already describes the
important points. The following aspects outlined there are especially relevant
for regressions:
* When checking for existing reports to join, also search the `archives of the
Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
`regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
* Start your report's subject with "[REGRESSION]".
* In your report, clearly mention the last kernel version that worked fine and
the first broken one. Ideally try to find the exact change causing the
regression using a bisection, as explained below in more detail.
* Remember to let the Linux regressions mailing list
(regressions@lists.linux.dev) know about your report:
* If you report the regression by mail, CC the regressions list.
* If you report your regression to some bug tracker, forward the submitted
report by mail to the regressions list while CCing the maintainer and the
mailing list for the subsystem in question.
If it's a regression within a stable or longterm series (e.g.
v5.15.3..v5.15.5), remember to CC the `Linux stable mailing list
<https://lore.kernel.org/stable/>`_ (stable@vger.kernel.org).
In case you performed a successful bisection, add everyone to the CC the
culprit's commit message mentions in lines starting with "Signed-off-by:".
When CCing for forwarding your report to the list, consider directly telling the
aforementioned Linux kernel regression tracking bot about your report. To do
that, include a paragraph like this in your mail::
#regzbot introduced: v5.13..v5.14-rc1
Regzbot will then consider your mail a report for a regression introduced in the
specified version range. In above case Linux v5.13 still worked fine and Linux
v5.14-rc1 was the first version where you encountered the issue. If you
performed a bisection to find the commit that caused the regression, specify the
culprit's commit-id instead::
#regzbot introduced: 1f2e3d4c5d
Placing such a "regzbot command" is in your interest, as it will ensure the
report won't fall through the cracks unnoticed. If you omit this, the Linux
kernel's regressions tracker will take care of telling regzbot about your
regression, as long as you send a copy to the regressions mailing lists. But the
regression tracker is just one human which sometimes has to rest or occasionally
might even enjoy some time away from computers (as crazy as that might sound).
Relying on this person thus will result in an unnecessary delay before the
regressions becomes mentioned `on the list of tracked and unresolved Linux
kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
weekly regression reports sent by regzbot. Such delays can result in Linus
Torvalds being unaware of important regressions when deciding between "continue
development or call this finished and release the final?".
Are really all regressions fixed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nearly all of them are, as long as the change causing the regression (the
"culprit commit") is reliably identified. Some regressions can be fixed without
this, but often it's required.
Who needs to find the root cause of a regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Developers of the affected code area should try to locate the culprit on their
own. But for them that's often impossible to do with reasonable effort, as quite
a lot of issues only occur in a particular environment outside the developer's
reach -- for example, a specific hardware platform, firmware, Linux distro,
system's configuration, or application. That's why in the end it's often up to
the reporter to locate the culprit commit; sometimes users might even need to
run additional tests afterwards to pinpoint the exact root cause. Developers
should offer advice and reasonably help where they can, to make this process
relatively easy and achievable for typical users.
How can I find the culprit?
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Perform a bisection, as roughly outlined in
Documentation/admin-guide/reporting-issues.rst and described in more detail by
Documentation/admin-guide/bug-bisect.rst. It might sound like a lot of work, but
in many cases finds the culprit relatively quickly. If it's hard or
time-consuming to reliably reproduce the issue, consider teaming up with other
affected users to narrow down the search range together.
Who can I ask for advice when it comes to regressions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
issue might better be dealt with in private, feel free to omit the list.
Additional details about regressions
------------------------------------
What is the goal of the "no regressions rule"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Users should feel safe when updating kernel versions and not have to worry
something might break. This is in the interest of the kernel developers to make
updating attractive: they don't want users to stay on stable or longterm Linux
series that are either abandoned or more than one and a half years old. That's
in everybody's interest, as `those series might have known bugs, security
issues, or other problematic aspects already fixed in later versions
<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
Additionally, the kernel developers want to make it simple and appealing for
users to test the latest pre-release or regular release. That's also in
everybody's interest, as it's a lot easier to track down and fix problems, if
they are reported shortly after being introduced.
Is the "no regressions" rule really adhered in practice?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's taken really seriously, as can be seen by many mailing list posts from
Linux creator and lead developer Linus Torvalds, some of which are quoted in
Documentation/process/handling-regressions.rst.
Exceptions to this rule are extremely rare; in the past developers almost always
turned out to be wrong when they assumed a particular situation was warranting
an exception.
Who ensures the "no regressions" is actually followed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The subsystem maintainers should take care of that, which are watched and
supported by the tree maintainers -- e.g. Linus Torvalds for mainline and
Greg Kroah-Hartman et al. for various stable/longterm series.
All of them are helped by people trying to ensure no regression report falls
through the cracks. One of them is Thorsten Leemhuis, who's currently acting as
the Linux kernel's "regressions tracker"; to facilitate this work he relies on
regzbot, the Linux kernel regression tracking bot. That's why you want to bring
your report on the radar of these people by CCing or forwarding each report to
the regressions mailing list, ideally with a "regzbot command" in your mail to
get it tracked immediately.
How quickly are regressions normally fixed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Developers should fix any reported regression as quickly as possible, to provide
affected users with a solution in a timely manner and prevent more users from
running into the issue; nevertheless developers need to take enough time and
care to ensure regression fixes do not cause additional damage.
The answer thus depends on various factors like the impact of a regression, its
age, or the Linux series in which it occurs. In the end though, most regressions
should be fixed within two weeks.
Is it a regression, if the issue can be avoided by updating some software?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Almost always: yes. If a developer tells you otherwise, ask the regression
tracker for advice as outlined above.
Is it a regression, if a newer kernel works slower or consumes more energy?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes, but the difference has to be significant. A five percent slow-down in a
micro-benchmark thus is unlikely to qualify as regression, unless it also
influences the results of a broad benchmark by more than one percent. If in
doubt, ask for advice.
Is it a regression, if an external kernel module breaks when updating Linux?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No, as the "no regression" rule is about interfaces and services the Linux
kernel provides to the userland. It thus does not cover building or running
externally developed kernel modules, as they run in kernel-space and hook into
the kernel using internal interfaces occasionally changed.
How are regressions handled that are caused by security fixes?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In extremely rare situations security issues can't be fixed without causing
regressions; those fixes are given way, as they are the lesser evil in the end.
Luckily this middling almost always can be avoided, as key developers for the
affected area and often Linus Torvalds himself try very hard to fix security
issues without causing regressions.
If you nevertheless face such a case, check the mailing list archives if people
tried their best to avoid the regression. If not, report it; if in doubt, ask
for advice as outlined above.
What happens if fixing a regression is impossible without causing another?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sadly these things happen, but luckily not very often; if they occur, expert
developers of the affected code area should look into the issue to find a fix
that avoids regressions or at least their impact. If you run into such a
situation, do what was outlined already for regressions caused by security
fixes: check earlier discussions if people already tried their best and ask for
advice if in doubt.
A quick note while at it: these situations could be avoided, if people would
regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each
development cycle a test run. This is best explained by imagining a change
integrated between Linux v5.14 and v5.15-rc1 which causes a regression, but at
the same time is a hard requirement for some other improvement applied for
5.15-rc1. All these changes often can simply be reverted and the regression thus
solved, if someone finds and reports it before 5.15 is released. A few days or
weeks later this solution can become impossible, as some software might have
started to rely on aspects introduced by one of the follow-up changes: reverting
all changes would then cause a regression for users of said software and thus is
out of the question.
Is it a regression, if some feature I relied on was removed months ago?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is, but often it's hard to fix such regressions due to the aspects outlined
in the previous section. It hence needs to be dealt with on a case-by-case
basis. This is another reason why it's in everybody's interest to regularly test
mainline pre-releases.
Does the "no regression" rule apply if I seem to be the only affected person?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It does, but only for practical usage: the Linux developers want to be free to
remove support for hardware only to be found in attics and museums anymore.
Note, sometimes regressions can't be avoided to make progress -- and the latter
is needed to prevent Linux from stagnation. Hence, if only very few users seem
to be affected by a regression, it for the greater good might be in their and
everyone else's interest to lettings things pass. Especially if there is an
easy way to circumvent the regression somehow, for example by updating some
software or using a kernel parameter created just for this purpose.
Does the regression rule apply for code in the staging tree as well?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Not according to the `help text for the configuration option covering all
staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
which since its early days states::
Please note that these drivers are under heavy development, may or
may not work, and may contain userspace interfaces that most likely
will be changed in the near future.
The staging developers nevertheless often adhere to the "no regressions" rule,
but sometimes bend it to make progress. That's for example why some users had to
deal with (often negligible) regressions when a WiFi driver from the staging
tree was replaced by a totally different one written from scratch.
Why do later versions have to be "compiled with a similar configuration"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because the Linux kernel developers sometimes integrate changes known to cause
regressions, but make them optional and disable them in the kernel's default
configuration. This trick allows progress, as the "no regressions" rule
otherwise would lead to stagnation.
Consider for example a new security feature blocking access to some kernel
interfaces often abused by malware, which at the same time are required to run a
few rarely used applications. The outlined approach makes both camps happy:
people using these applications can leave the new security feature off, while
everyone else can enable it without running into trouble.
How to create a configuration similar to the one of an older kernel?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Start your machine with a known-good kernel and configure the newer Linux
version with ``make olddefconfig``. This makes the kernel's build scripts pick
up the configuration file (the ".config" file) from the running kernel as base
for the new one you are about to compile; afterwards they set all new
configuration options to their default value, which should disable new features
that might cause regressions.
Can I report a regression I found with pre-compiled vanilla kernels?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You need to ensure the newer kernel was compiled with a similar configuration
file as the older one (see above), as those that built them might have enabled
some known-to-be incompatible feature for the newer kernel. If in doubt, report
the matter to the kernel's provider and ask for advice.
More about regression tracking with "regzbot"
---------------------------------------------
What is regression tracking and why should I care about it?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rules like "no regressions" need someone to ensure they are followed, otherwise
they are broken either accidentally or on purpose. History has shown this to be
true for Linux kernel development as well. That's why Thorsten Leemhuis, the
Linux Kernel's regression tracker, and some people try to ensure all regression
are fixed by keeping an eye on them until they are resolved. Neither of them are
paid for this, that's why the work is done on a best effort basis.
Why and how are Linux kernel regressions tracked using a bot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tracking regressions completely manually has proven to be quite hard due to the
distributed and loosely structured nature of Linux kernel development process.
That's why the Linux kernel's regression tracker developed regzbot to facilitate
the work, with the long term goal to automate regression tracking as much as
possible for everyone involved.
Regzbot works by watching for replies to reports of tracked regressions.
Additionally, it's looking out for posted or committed patches referencing such
reports with "Link:" tags; replies to such patch postings are tracked as well.
Combined this data provides good insights into the current state of the fixing
process.
How to see which regressions regzbot tracks currently?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check out `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
What kind of issues are supposed to be tracked by regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The bot is meant to track regressions, hence please don't involve regzbot for
regular issues. But it's okay for the Linux kernel's regression tracker if you
involve regzbot to track severe issues, like reports about hangs, corrupted
data, or internal errors (Panic, Oops, BUG(), warning, ...).
How to change aspects of a tracked regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By using a 'regzbot command' in a direct or indirect reply to the mail with the
report. The easiest way to do that: find the report in your "Sent" folder or the
mailing list archive and reply to it using your mailer's "Reply-all" function.
In that mail, use one of the following commands in a stand-alone paragraph (IOW:
use blank lines to separate one or multiple of these commands from the rest of
the mail's text).
* Update when the regression started to happen, for example after performing a
bisection::
#regzbot introduced: 1f2e3d4c5d
* Set or update the title::
#regzbot title: foo
* Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
the issue or a fix are discussed:::
#regzbot monitor: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
#regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
* Point to a place with further details of interest, like a mailing list post
or a ticket in a bug tracker that are slightly related, but about a different
topic::
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
* Mark a regression as invalid::
#regzbot invalid: wasn't a regression, problem has always existed
Regzbot supports a few other commands primarily used by developers or people
tracking regressions. They and more details about the aforementioned regzbot
commands can be found in the `getting started guide
<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ and
the `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
for regzbot.
..
end-of-content
..
This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
of the file. If you want to distribute this text under CC-BY-4.0 only,
please use "The Linux kernel developers" for author attribution and link
this as source:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-regressions.rst
..
Note: Only the content of this RST file as found in the Linux kernel sources
is available under CC-BY-4.0, as versions of this text that were processed
(for example by the kernel's build system) might contain content taken from
files which use a more restrictive license.