linux-stable/kernel/time/timer_migration.h

151 lines
5.6 KiB
C
Raw Normal View History

timers: Implement the hierarchical pull model Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into an idle timerqueue and eventually expired by another active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240222103710.32582-1-anna-maria@linutronix.de
2024-02-22 10:37:10 +00:00
/* SPDX-License-Identifier: GPL-2.0-only */
#ifndef _KERNEL_TIME_MIGRATION_H
#define _KERNEL_TIME_MIGRATION_H
/* Per group capacity. Must be a power of 2! */
#define TMIGR_CHILDREN_PER_GROUP 8
/**
* struct tmigr_event - a timer event associated to a CPU
* @nextevt: The node to enqueue an event in the parent group queue
* @cpu: The CPU to which this event belongs
* @ignore: Hint whether the event could be ignored; it is set when
* CPU or group is active;
*/
struct tmigr_event {
struct timerqueue_node nextevt;
unsigned int cpu;
bool ignore;
};
/**
* struct tmigr_group - timer migration hierarchy group
* @lock: Lock protecting the event information and group hierarchy
* information during setup
timers/migration: Do not rely always on group->parent When reading group->parent without holding the group lock it is racy against CPUs coming online the first time and thereby creating another level of the hierarchy. This is not a problem when this value is read once to decide whether to abort a propagation or not. The worst outcome is an unnecessary/early CPU wake up. But it is racy when reading it several times during a single 'action' (like activation, deactivation, checking for remote timer expiry,...) and relying on the consitency of this value without holding the lock. This happens at the moment e.g. in tmigr_inactive_up() which is also calling tmigr_udpate_events(). Code relys on group->parent not to change during this 'action'. Update parent struct member description to explain the above only once. Remove parent pointer checks when they are not mandatory (like update of data->childmask). Remove a warning, which would be nice but the trigger of this warning is not reliable and add expand the data structure member description instead. Expand a comment, why it is safe to rely on parent pointer here (inside hierarchy update). Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-1-757baa7803fe@linutronix.de
2024-07-16 14:19:19 +00:00
* @parent: Pointer to the parent group. Pointer is updated when a
* new hierarchy level is added because of a CPU coming
* online the first time. Once it is set, the pointer will
* not be removed or updated. When accessing parent pointer
* lock less to decide whether to abort a propagation or
* not, it is not a problem. The worst outcome is an
* unnecessary/early CPU wake up. But do not access parent
* pointer several times in the same 'action' (like
* activation, deactivation, check for remote expiry,...)
* without holding the lock as it is not ensured that value
* will not change.
timers: Implement the hierarchical pull model Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into an idle timerqueue and eventually expired by another active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240222103710.32582-1-anna-maria@linutronix.de
2024-02-22 10:37:10 +00:00
* @groupevt: Next event of the group which is only used when the
* group is !active. The group event is then queued into
* the parent timer queue.
* Ignore bit of @groupevt is set when the group is active.
* @next_expiry: Base monotonic expiry time of the next event of the
* group; It is used for the racy lockless check whether a
* remote expiry is required; it is always reliable
* @events: Timer queue for child events queued in the group
* @migr_state: State of the group (see union tmigr_state)
* @level: Hierarchy level of the group; Required during setup
* @numa_node: Required for setup only to make sure CPU and low level
* group information is NUMA local. It is set to NUMA node
* as long as the group level is per NUMA node (level <
* tmigr_crossnode_level); otherwise it is set to
* NUMA_NO_NODE
* @num_children: Counter of group children to make sure the group is only
* filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
* only
* @childmask: childmask of the group in the parent group; is set
* during setup and will never change; can be read
* lockless
* @list: List head that is added to the per level
* tmigr_level_list; is required during setup when a
* new group needs to be connected to the existing
* hierarchy groups
*/
struct tmigr_group {
raw_spinlock_t lock;
struct tmigr_group *parent;
struct tmigr_event groupevt;
u64 next_expiry;
struct timerqueue_head events;
atomic_t migr_state;
unsigned int level;
int numa_node;
unsigned int num_children;
u8 childmask;
struct list_head list;
};
/**
* struct tmigr_cpu - timer migration per CPU group
* @lock: Lock protecting the tmigr_cpu group information
* @online: Indicates whether the CPU is online; In deactivate path
* it is required to know whether the migrator in the top
* level group is to be set offline, while a timer is
* pending. Then another online CPU needs to be notified to
* take over the migrator role. Furthermore the information
* is required in CPU hotplug path as the CPU is able to go
* idle before the timer migration hierarchy hotplug AP is
* reached. During this phase, the CPU has to handle the
* global timers on its own and must not act as a migrator.
* @idle: Indicates whether the CPU is idle in the timer migration
* hierarchy
* @remote: Is set when timers of the CPU are expired remotely
* @tmgroup: Pointer to the parent group
* @childmask: childmask of tmigr_cpu in the parent group
* @wakeup: Stores the first timer when the timer migration
* hierarchy is completely idle and remote expiry was done;
* is returned to timer code in the idle path and is only
* used in idle path.
* @cpuevt: CPU event which could be enqueued into the parent group
*/
struct tmigr_cpu {
raw_spinlock_t lock;
bool online;
bool idle;
bool remote;
struct tmigr_group *tmgroup;
u8 childmask;
u64 wakeup;
struct tmigr_event cpuevt;
};
/**
* union tmigr_state - state of tmigr_group
* @state: Combined version of the state - only used for atomic
* read/cmpxchg function
* @struct: Split version of the state - only use the struct members to
* update information to stay independent of endianness
*/
union tmigr_state {
u32 state;
/**
* struct - split state of tmigr_group
* @active: Contains each childmask bit of the active children
* @migrator: Contains childmask of the child which is migrator
* @seq: Sequence counter needs to be increased when an update
* to the tmigr_state is done. It prevents a race when
* updates in the child groups are propagated in changed
* order. Detailed information about the scenario is
* given in the documentation at the begin of
* timer_migration.c.
*/
struct {
u8 active;
u8 migrator;
u16 seq;
} __packed;
};
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
extern void tmigr_handle_remote(void);
extern bool tmigr_requires_handle_remote(void);
extern void tmigr_cpu_activate(void);
extern u64 tmigr_cpu_deactivate(u64 nextevt);
extern u64 tmigr_cpu_new_timer(u64 nextevt);
extern u64 tmigr_quick_check(u64 nextevt);
#else
static inline void tmigr_handle_remote(void) { }
static inline bool tmigr_requires_handle_remote(void) { return false; }
static inline void tmigr_cpu_activate(void) { }
#endif
#endif