WIP: backport drm fair scheduler to 6.15 #6

aviallon · 2025-06-24T11:46:13Z

This also backports some DRM fixes.
This is not yet tested, patches required quite some modifications to be applied.

[ Upstream commit 8af7a91 ] Driver tests now require GRE tunnels, while we don't configure them with YNL, YNL will complain when it sees link types it doesn't recognize. Teach it decoding ip6gre tunnels. The attrs are largely the same as IPv4 GRE. Correct the type of encap-limit, but note that this attr is only used in ip6gre, so the mistake didn't matter until now. Fixes: 0d0f417 ("selftests: drv-net: add a simple TSO test") Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Donald Hunter <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit db9ae3b ] Enable threaded NAPI by default for WireGuard devices in response to low performance behavior that we observed when multiple tunnels (and thus multiple wg devices) are deployed on a single host. This affects any kind of multi-tunnel deployment, regardless of whether the tunnels share the same endpoints or not (i.e., a VPN concentrator type of gateway would also be affected). The problem is caused by the fact that, in case of a traffic surge that involves multiple tunnels at the same time, the polling of the NAPI instance of all these wg devices tends to converge onto the same core, causing underutilization of the CPU and bottlenecking performance. This happens because NAPI polling is hosted by default in softirq context, but the WireGuard driver only raises this softirq after the rx peer queue has been drained, which doesn't happen during high traffic. In this case, the softirq already active on a core is reused instead of raising a new one. As a result, once two or more tunnel softirqs have been scheduled on the same core, they remain pinned there until the surge ends. In our experiments, this almost always leads to all tunnel NAPIs being handled on a single core shortly after a surge begins, limiting scalability to less than 3× the performance of a single tunnel, despite plenty of unused CPU cores being available. The proposed mitigation is to enable threaded NAPI for all WireGuard devices. This moves the NAPI polling context to a dedicated per-device kernel thread, allowing the scheduler to balance the load across all available cores. On our 32-core gateways, enabling threaded NAPI yields a ~4× performance improvement with 16 tunnels, increasing throughput from ~13 Gbps to ~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck) jumps from 20% to 100%. We have found no performance regressions in any scenario we tested. Single-tunnel throughput remains unchanged. More details are available in our Netdev paper. Link: https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf Signed-off-by: Mirco Barone <[email protected]> Fixes: e7096c1 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit c68804c ] The device type for IPv4 GRE is "gre" not "ipgre", unlike for IPv6 which uses "ip6gre". Not sure how I missed this when writing the test, perhaps because all HW I have access to is on an IPv6-only network. Fixes: 0d0f417 ("selftests: drv-net: add a simple TSO test") Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit e6854be ] Commit 846742f ("selftests: drv-net: add a warning for bkg + shell + terminate") added a warning for bkg() used with terminate=True. The tso test was missed as we didn't have it running anywhere in NIPA. Add exit_wait=True, to avoid: # Warning: combining shell and terminate is risky! # SIGTERM may not reach the child on zsh/ksh! getting printed twice for every variant. Fixes: 0d0f417 ("selftests: drv-net: add a simple TSO test") Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 535caac ] from_cleanup_net() reads cleanup_net_task locklessly. Add READ_ONCE()/WRITE_ONCE() annotations to avoid a potential KCSAN warning, even if the race is harmless. Fixes: 0734d7c ("net: expedite synchronize_net() for cleanup_net()") Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit feafc73 ] At the time rtnl_create_link() is running, dev->netdev_ops is NULL, we must not use netdev_lock_ops() or risk a NULL deref if CONFIG_NET_SHAPER is defined. Use netif_set_group() instead of dev_set_group(). RIP: 0010:netdev_need_ops_lock include/net/netdev_lock.h:33 [inline] RIP: 0010:netdev_lock_ops include/net/netdev_lock.h:41 [inline] RIP: 0010:dev_set_group+0xc0/0x230 net/core/dev_api.c:82 Call Trace: <TASK> rtnl_create_link+0x748/0xd10 net/core/rtnetlink.c:3674 rtnl_newlink_create+0x25c/0xb00 net/core/rtnetlink.c:3813 __rtnl_newlink net/core/rtnetlink.c:3940 [inline] rtnl_newlink+0x16d6/0x1c70 net/core/rtnetlink.c:4055 rtnetlink_rcv_msg+0x7cf/0xb70 net/core/rtnetlink.c:6944 netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2534 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x75b/0x8d0 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/T/#u Signed-off-by: Eric Dumazet <[email protected]> Fixes: 7e4d784 ("net: hold netdev instance lock during rtnetlink operations") Acked-by: Stanislav Fomichev <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 7632fed ] The kernel currently validates that the length of the provided nexthop address does not exceed the specified length. This can lead to the kernel reading uninitialized memory if user space provided a shorter length than the specified one. Fix by validating that the provided length exactly matches the specified one. Fixes: d1df6fd ("ipv6: sr: define core operations for seg6local lightweight tunnel") Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 8cf8cde ] In xe_vm_close_and_put() we need to be able to call xe_svm_fini(), however during vm creation we can call this on the error path, before having actually initialised the svm state, leading to various splats followed by a fatal NPD. Fixes: 6fd979c ("drm/xe: Add SVM init / close / fini to faulting VMs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4967 Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 4f296d7) Signed-off-by: Thomas Hellström <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 2182f35 ] The XE driver can be built with or without VSEC support, but fails to link as built-in if vsec is in a loadable module: x86_64-linux-ld: vmlinux.o: in function `xe_vsec_init': (.text+0x1e83e16): undefined reference to `intel_vsec_register' The normal fix for this is to add a 'depends on INTEL_VSEC || !INTEL_VSEC', forcing XE to be a loadable module as well, but that causes a circular dependency: symbol DRM_XE depends on INTEL_VSEC symbol INTEL_VSEC depends on X86_PLATFORM_DEVICES symbol X86_PLATFORM_DEVICES is selected by DRM_XE The problem here is selecting a symbol from another subsystem, so change that as well and rephrase the 'select' into the corresponding dependency. Since X86_PLATFORM_DEVICES is 'default y', there is no change to defconfig builds here. Fixes: 0c45e76 ("drm/xe/vsec: Support BMG devices") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Lucas De Marchi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit e4931f8) Signed-off-by: Thomas Hellström <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 5cc3325 ] For preempt_fence mode VM's we're rejecting eviction of shared bos during VM_BIND. However, since we do this in the move() callback, we're getting an eviction failure warning from TTM. The TTM callback intended for these things is eviction_valuable(). However, the latter doesn't pass in the struct ttm_operation_ctx needed to determine whether the caller needs this. Instead, attach the needed information to the vm under the vm->resv, until we've been able to update TTM to provide the needed information. And add sufficient lockdep checks to prevent misuse and races. v2: - Fix a copy-paste error in xe_vm_clear_validating() v3: - Fix kerneldoc errors. Signed-off-by: Thomas Hellström <[email protected]> Fixes: 0af944f ("drm/xe: Reject BO eviction if BO is bound to current VM") Reviewed-by: Matthew Brost <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 9d55586) Signed-off-by: Thomas Hellström <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 6bf4d56 ] The define of the extension type was accidentally used instead of the one of the property itself. They're both zero, so no functional issue, but we should use the correct define for code correctness. Fixes: 41a97c4 ("drm/xe/pxp/uapi: Add API to mark a BO as using PXP") Signed-off-by: Daniele Ceraolo Spurio <[email protected]> Cc: John Harrison <[email protected]> Reviewed-by: John Harrison <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 1d891ee) Signed-off-by: Thomas Hellström <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 69a58ef ] The expected flow of operations when using PXP is to query the PXP status and wait for it to transition to "ready" before attempting to create an exec_queue. This flow is followed by the Mesa driver, but there is no guarantee that an incorrectly coded (or malicious) app will not attempt to create the queue first without querying the status. Therefore, we need to clarify what the expected behavior of the queue creation ioctl is in this scenario. Currently, the ioctl always fails with an -EBUSY code no matter the error, but for consistency it is better to distinguish between "failed to init" (-EIO) and "not ready" (-EBUSY), the same way the query ioctl does. Note that, while this is a change in the return code of an ioctl, the behavior of the ioctl in this particular corner case was not clearly spec'd, so no one should have been relying on it (and we know that Mesa, which is the only known userspace for this, didn't). v2: Minor rework of the doc (Rodrigo) Fixes: 72d4796 ("drm/xe/pxp/uapi: Add userspace and LRC support for PXP-using queues") Signed-off-by: Daniele Ceraolo Spurio <[email protected]> Cc: John Harrison <[email protected]> Cc: José Roberto de Souza <[email protected]> Reviewed-by: José Roberto de Souza <[email protected]> Reviewed-by: John Harrison <[email protected]> Acked-by: Rodrigo Vivi <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 21784ca) Signed-off-by: Thomas Hellström <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

…ess handling [ Upstream commit 61a74ad ] Use copy_from_user_nofault() and copy_to_user_nofault() instead of copy_from/to_user functions in the misaligned access trap handlers. The following bug report was found when executing misaligned memory accesses: BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:162 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 115, name: two preempt_count: 0, expected: 0 CPU: 0 UID: 0 PID: 115 Comm: two Not tainted 6.14.0-rc5 torvalds#24 Hardware name: riscv-virtio,qemu (DT) Call Trace: [<ffffffff800160ea>] dump_backtrace+0x1c/0x24 [<ffffffff80002304>] show_stack+0x28/0x34 [<ffffffff80010fae>] dump_stack_lvl+0x4a/0x68 [<ffffffff80010fe0>] dump_stack+0x14/0x1c [<ffffffff8004e44e>] __might_resched+0xfa/0x104 [<ffffffff8004e496>] __might_sleep+0x3e/0x62 [<ffffffff801963c4>] __might_fault+0x1c/0x24 [<ffffffff80425352>] _copy_from_user+0x28/0xaa [<ffffffff8000296c>] handle_misaligned_store+0x204/0x254 [<ffffffff809eae82>] do_trap_store_misaligned+0x24/0xee [<ffffffff809f4f1a>] handle_exception+0x146/0x152 Fixes: b686ecd ("riscv: misaligned: Restrict user access to kernel memory") Fixes: 4413815 ("riscv: misaligned: remove CONFIG_RISCV_M_MODE specific code") Signed-off-by: Zong Li <[email protected]> Signed-off-by: Nylon Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 663d0c1 ] The vop freq_to_gear() may return a gear greater than the negotiated max gear. Return the negotiated max gear if the mapped gear is greater. Fixes: c02fe9e ("scsi: ufs: qcom: Implement the freq_to_gear_speed() vop") Signed-off-by: Ziqi Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reported-by: Neil Armstrong <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Tested-by: Neil Armstrong <[email protected]> Reviewed-by: Bean Huo <[email protected]> Tested-by: Loïc Minier <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 8c5bcb3 ] On some platforms, the devfreq OPP freq may be different than the unipro core clock freq. Implement ufs_qcom_opp_freq_to_clk_freq() and use it to find the unipro core clk freq. Fixes: c02fe9e ("scsi: ufs: qcom: Implement the freq_to_gear_speed() vop") Signed-off-by: Can Guo <[email protected]> Co-developed-by: Ziqi Chen <[email protected]> Signed-off-by: Ziqi Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reported-by: Luca Weiss <[email protected]> Closes: https://lore.kernel.org/linux-arm-msm/[email protected]/ Tested-by: Luca Weiss <[email protected]> Reviewed-by: Bean Huo <[email protected]> Tested-by: Loïc Minier <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 7831003 ] Prevent calling phy_exit() before phy_init() to avoid abnormal power count and the following warning during boot up. [5.146763] phy phy-1d80000.phy.0: phy_power_on was called before phy_init Fixes: 7bac656 ("scsi: ufs: qcom: Power off the PHY if it was already powered on in ufs_qcom_power_up_sequence()") Signed-off-by: Nitin Rawat <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Konrad Dybcio <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit ff0045d ] RPM manipulation in hda_codec_probe_complete()'s error path is superfluous and leads to RPM usage count underflow if the build-controls operation fails. hda_codec_probe_complete() is called in: 1) hda_codec_probe() for all non-HDMI codecs 2) in card->late_probe() for HDMI codecs Error path for hda_codec_probe() takes care of bus' RPM already. For 2) if late_probe() fails, ASoC performs card cleanup what triggers hda_codec_remote() - same treatment is in 1). Fixes: b5df2a7 ("ASoC: codecs: Add HD-Audio codec driver") Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 9ad1f3c ] The procedure handling IPC timeouts and EXCEPTION_CAUGHT notification shall cancel any D0IX work before proceeding with DSP recovery. If SET_D0IX called from delayed_work is the failing IPC the procedure will deadlock. Conditionally skip cancelling the work to fix that. Fixes: 335c4cb ("ASoC: Intel: avs: D0ix power state support") Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 318c9ee ] Starting with LNL platform, Intel HDAudio Links carry IDs specifying non-HDAudio transfer type they help facilitate e.g.: 0xC0 for I2S as defined by AZX_REG_ML_LEPTR_ID_INTEL_SSP. The mechanism accounts for LEPTR register as it is Reserved if LCAP.ALT for given Link equals 0. Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Acked-by: Liam Girdwood <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Stable-dep-of: 347c8d6 ("ASoC: Intel: avs: Fix PPLCxFMT calculation") Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 7166437 ] Starting from LNL platform the so-called non-HDAudio transfer types, e.g.: I2S/DMIC, utilize HDAudio LINK DMA rather than GPDMA for the data streaming. In essence, all transfer types now utilize HDAudio Link. Most of the existing code can be reused with the major difference being HDAudio Link query method: - fetch the Link by codec.addr in standard HDAudio transfer case - fetch the Link by LEPTR.ID in non-HDAudio transfer case To make the unification happen, store pointer to the Link in dma_data and utilize it in the common code. And to avoid confusion in transfer-type naming between cAVS-ACE 1.x (SkyLake till MeteorLake) and ACE 2.0+ architecture (LunarLake onward), use: - 'hda' for typical HDAudio transfer case - 'nonhda' for non-HDAudio transfer case, cAVS-ACE 1.x - 'althda' for non-HDAudio transfer case, ACE 2.0+ Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Acked-by: Liam Girdwood <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Stable-dep-of: 347c8d6 ("ASoC: Intel: avs: Fix PPLCxFMT calculation") Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 347c8d6 ] HDAudio transfer types utilize SDxFMT for front-end (HOST) and PPLCxFMT for back-end (LINK) side when setting up the stream. BE's substream->runtime duplicates FE runtime so switch to using BE's hw_params to address incorrect format values on the LINK side when FE and BE formats differ. The problem is introduced with commit d070002 ("ASoC: Intel: avs: HDA PCM BE operations") but the code has been shuffled around since then so direct 'Fixes:' tag does not apply. Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 2f78724 ] Search result of avs_dai_find_path_template() shall be verified before being used. As 'template' is already known when avs_hw_constraints_init() is fired, drop the search entirely. Fixes: f2f8474 ("ASoC: Intel: avs: Constrain path based on BE capabilities") Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit acd2563 ] A number of Vendor Specific registers utilized on cAVS architecture (SkyLake till RaptorLake) are not present on ACE hardware (MeteorLake onward). Similarly, certain recommended procedures do not apply. Adjust existing code to be ACE-friendly. Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Acked-by: Liam Girdwood <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Stable-dep-of: 9e3285b ("ASoC: Intel: avs: Fix paths in MODULE_FIRMWARE hints") Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit b9a3ec6 ] Starting with LunarLake (LNL) and onward, some hardware capabilities are visible to the sound driver directly. At the same time, these may no longer be visible to the AudioDSP firmware. Update resource allocation function to rely on the registers when possible. Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Acked-by: Liam Girdwood <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Stable-dep-of: 9e3285b ("ASoC: Intel: avs: Fix paths in MODULE_FIRMWARE hints") Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 75f3c60 ] The firmware status and error registers are not part of SRAM on ACE platforms. As these registers take part in IPC on ACE and cAVS platforms both, relocate the field denoting their offset to Host-IPC descriptor. In consequence, code remains cohesive with the ACE specs while still maintaining high readability for the cAVS platforms. Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Acked-by: Liam Girdwood <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Stable-dep-of: 9e3285b ("ASoC: Intel: avs: Fix paths in MODULE_FIRMWARE hints") Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 9e3285b ] The binaries for cAVS architecture are located in "intel/avs" subdirectory, not "intel". Fixes: 94aa347 ("ASoC: Intel: avs: Add MODULE_FIRMWARE to inform about FW") Reviewed-by: Cezary Rojewski <[email protected]> Signed-off-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 5f342ae ] All memory operations shall be checked. Fixes: f2f8474 ("ASoC: Intel: avs: Constrain path based on BE capabilities") Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 93e246b ] The first element of the returned array stores its length. If it is 0, any manipulation beyond the element at index 0 ends with null-ptr-deref. Fixes: 5a565ba ("ASoC: Intel: avs: Probing and firmware tracing over debugfs") Reviewed-by: Amadeusz Sławiński <[email protected]> Signed-off-by: Cezary Rojewski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit bae071a ] The removed dai_link->platform component cause a fail which is exposed at runtime. (ex: when a sound tool is used) This patch re-adds the dai_link->platform component to have a full card registered. Before this patch: $ aplay -l **** List of PLAYBACK Hardware Devices **** card 1: HDMI [HDMI], device 0: HDMI snd-soc-dummy-dai-0 [] Subdevices: 1/1 Subdevice #0: subdevice #0 $ speaker-test -D plughw:1,0 -t sine speaker-test 1.2.8 Playback device is plughw:1,0 Stream parameters are 48000Hz, S16_LE, 1 channels Sine wave rate is 440.0000Hz Playback open error: -22,Invalid argument After this patch which restores the platform component: $ aplay -l **** List of PLAYBACK Hardware Devices **** card 0: HDMI [HDMI], device 0: HDMI snd-soc-dummy-dai-0 [HDMI snd-soc-dummy-dai-0] Subdevices: 0/1 Subdevice #0: subdevice #0 -> Resolve the playback error. Fixes: 3b0db24 ("ASoC: ti: remove unnecessary dai_link->platform") Signed-off-by: Yuuki NAGAO <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 334d7c4 ] If iov_offset is non-zero, then we need to consider iov_offset in length calculation, otherwise we might pass smaller IOs such as 512 bytes, in below scenario [1]. This issue is reproducible using lib-uring test/fixed-seg.c application with fixed buffer on a 512 LBA formatted device. [1] At present we pass the alignment check, for 512 LBA formatted devices, len_mask = 511 when IO is smaller, i->count = 512 has an offset, i->io_offset = 3584 with bvec values, bvec->bv_offset = 256, bvec->bv_len = 3840. In short, the first 256 bytes are in the current page, next 256 bytes are in the another page. Ideally we expect to fail the IO. I can think of 2 userspace scenarios where we experience this. a: From userspace, we observe a different behaviour when device LBA size is 512 vs 4096 bytes. For 4096 LBA formatted device, I see the same liburing test [2] failing, whereas 512 the test passes without this. This is reproducible everytime. [2] https://github.com/axboe/liburing/ b: Although I was not able to reproduce the below condition, but I suspect below case should be possible from user space for devices with 512 LBA formatted device. Lets say from userspace while allocating a virtually single chunk of memory, if we get 2 physical chunk of memory, and IO happens to be at the boundary of first physical chunk with length crossing first chunk, then we allow IOs to proceed and hence we might map wrong physical address length and proceed with IO rather than failing. : --- a/test/fixed-seg.c : +++ b/test/fixed-seg.c : @@ -64,7 +64,7 @@ static int test(struct io_uring *ring, int fd, int : vec_off) : return T_EXIT_FAIL; : } : : - ret = read_it(ring, fd, 4096, vec_off); : + ret = read_it(ring, fd, 4096, 7*512 + 256); : if (ret) { : fprintf(stderr, "4096 0 failed\n"); : return T_EXIT_FAIL; Effectively this is a write crossing the page boundary. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2263639 ("iov_iter: streamline iovec/bvec alignment iteration") Reviewed-by: Jens Axboe <[email protected]> Reviewed-by: Anuj Gupta <[email protected]> Signed-off-by: Nitesh Shetty <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

Signed-off-by: Piotr Gorski <[email protected]>

This reverts commit 891fe0d. Signed-off-by: Eric Naim <[email protected]>

Signed-off-by: Piotr Gorski <[email protected]>

drm_sched_backend_ops.run_job() returns a dma_fence for the scheduler. That fence is signalled by the driver once the hardware completed the associated job. The scheduler does not increment the reference count on that fence, but implicitly expects to inherit this fence from run_job(). This is relatively subtle and prone to misunderstandings. This implies that, to keep a reference for itself, a driver needs to call dma_fence_get() in addition to dma_fence_init() in that callback. It's further complicated by the fact that the scheduler even decrements the refcount in drm_sched_run_job_work() since it created a new reference in drm_sched_fence_scheduled(). It does, however, still use its pointer to the fence after calling dma_fence_put() - which is safe because of the aforementioned new reference, but actually still violates the refcounting rules. Move the call to dma_fence_put() to the position behind the last usage of the fence. Suggested-by: Danilo Krummrich <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

This reverts commit 44d2f31. The function drm_sched_job_arm() is indeed the point of no return. The background is that it is nearly impossible for the driver to correctly retract the fence and signal it in the order enforced by the dma_fence framework. The code in drm_sched_job_cleanup() is for the purpose to cleanup after the job was armed through drm_sched_job_arm() *and* processed by the scheduler. We can certainly improve the documentation, but removing the warning is clearly not a good idea. Signed-off-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

The documentation for drm_sched_job_arm() and especially drm_sched_job_cleanup() does not make it very clear why drm_sched_job_arm() is a point of no return, which it indeed is. Make the nature of drm_sched_job_arm() in the docu as clear as possible. Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

…ests Implement a mock scheduler backend and add some basic test to exercise the core scheduler code paths. Mock backend (kind of like a very simple mock GPU) can either process jobs by tests manually advancing the "timeline" job at a time, or alternatively jobs can be configured with a time duration in which case they get completed asynchronously from the unit test code. Core scheduler classes are subclassed to support this mock implementation. The tests added are just a few simple submission patterns. Signed-off-by: Tvrtko Ursulin <[email protected]> Suggested-by: Philipp Stanner <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Add a very simple timeout test which submits a single job and verifies that the timeout handling will run if the backend failed to complete the job in time. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Add some basic tests for exercising entity priority handling. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Add a basic test for exercising modifying the entities scheduler list at runtime. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Add a basic test for checking whether scheduler respects the configured credit limit. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

When an entity from application B is killed, drm_sched_entity_kill() removes all jobs belonging to that entity through drm_sched_entity_kill_jobs_work(). If application A's job depends on a scheduled fence from application B's job, and that fence is not properly signaled during the killing process, application A's dependency cannot be cleared. This leads to application A hanging indefinitely while waiting for a dependency that will never be resolved. Fix this issue by ensuring that scheduled fences are properly signaled when an entity is killed, allowing dependent applications to continue execution. Signed-off-by: Lin.Cao <[email protected]> Reviewed-by: Philipp Stanner <[email protected]> Signed-off-by: Christian König <[email protected]> Link: https://lore.kernel.org/r/[email protected]

To make evaluating different scheduling policies easier (no need for external benchmarks) and perfectly repeatable, lets add some synthetic workloads built upon mock scheduler unit test infrastructure. Focus is on two parallel clients (two threads) submitting different job patterns and logging their progress and some overall metrics. This is repeated for both scheduler credit limit 1 and 2. Example test output: Normal and low: pct1 cps1 qd1; pct2 cps2 qd2 + 0ms: 0 0 0; 0 0 0 + 104ms: 100 1240 112; 100 1240 125 + 209ms: 100 0 99; 100 0 125 + 313ms: 100 0 86; 100 0 125 + 419ms: 100 0 73; 100 0 125 + 524ms: 100 0 60; 100 0 125 + 628ms: 100 0 47; 100 0 125 + 731ms: 100 0 34; 100 0 125 + 836ms: 100 0 21; 100 0 125 + 939ms: 100 0 8; 100 0 125 + 1043ms: ; 100 0 120 + 1147ms: ; 100 0 107 + 1252ms: ; 100 0 94 + 1355ms: ; 100 0 81 + 1459ms: ; 100 0 68 + 1563ms: ; 100 0 55 + 1667ms: ; 100 0 42 + 1771ms: ; 100 0 29 + 1875ms: ; 100 0 16 + 1979ms: ; 100 0 3 0: prio=normal sync=0 elapsed_ms=1015ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,222,978 us latency_time(min,avg,max)=134,222,978 us 1: prio=low sync=0 elapsed_ms=2009ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,215,806 us latency_time(min,avg,max)=134,215,806 us There we have two clients represented in the two respective columns, with their progress logged roughly every 100 milliseconds. The metrics are: - pct - Percentage progress of the job submit part - cps - Cycles per second - qd - Queue depth - number of submitted unfinished jobs The cycles per second metric is inherent to the fact that workload patterns are a data driven cycling sequence of: - Submit 1..N jobs - Wait for Nth job to finish (optional) - Sleep (optional) - Repeat from start In this particular example we have a normal priority and a low priority clients both spamming the scheduler with 8ms jobs with no sync and no sleeping. Hence they build a very deep queues and we can see how the low priority client is completely starved until the normal finishes. Note that the PCT and CPS metrics are irrelevant for "unsync" clients since they manage to complete all of their cycles instantaneously. A different example would be: Heavy and interactive: pct1 cps1 qd1; pct2 cps2 qd2 + 0ms: 0 0 0; 0 0 0 + 106ms: 5 40 3; 5 40 0 + 209ms: 9 40 0; 9 40 0 + 314ms: 14 50 3; 14 50 0 + 417ms: 18 40 0; 18 40 0 + 522ms: 23 50 3; 23 50 0 + 625ms: 27 40 0; 27 40 1 + 729ms: 32 50 0; 32 50 0 + 833ms: 36 40 1; 36 40 0 + 937ms: 40 40 0; 40 40 0 + 1041ms: 45 50 0; 45 50 0 + 1146ms: 49 40 1; 49 40 1 + 1249ms: 54 50 0; 54 50 0 + 1353ms: 58 40 1; 58 40 0 + 1457ms: 62 40 0; 62 40 1 + 1561ms: 67 50 0; 67 50 0 + 1665ms: 71 40 1; 71 40 0 + 1772ms: 76 50 0; 76 50 0 + 1877ms: 80 40 1; 80 40 0 + 1981ms: 84 40 0; 84 40 0 + 2085ms: 89 50 0; 89 50 0 + 2189ms: 93 40 1; 93 40 0 + 2293ms: 97 40 0; 97 40 1 In this case client one is submitting 3x 2.5ms jobs, waiting for the 3rd and then sleeping for 2.5ms (in effect causing 75% GPU load, minus the overheads). Second client is submitting 1ms jobs, waiting for each to finish and sleeping for 9ms (effective 10% GPU load). Here we can see the PCT and CPS reflecting real progress. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Cc: Pierre-Eric Pelloux-Prayer <[email protected]> Acked-by: Christian König <[email protected]>

This time round we explore the rate of submitted job queue processing with multiple identical parallel clients. Example test output: 3 clients: t cycle: min avg max : ... + 0ms 0 0 0 : 0 0 0 + 102ms 2 2 2 : 2 2 2 + 208ms 5 6 6 : 6 5 5 + 310ms 8 9 9 : 9 9 8 ... + 2616ms 82 83 83 : 83 83 82 + 2717ms 83 83 83 : 83 83 83 avg_max_min_delta(x100)=60 Every 100ms for the duration of the test test logs how many jobs each client had completed, prefixed by minimum, average and maximum numbers. When finished overall average delta between max and min is output as a rough indicator to scheduling fairness. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Cc: Pierre-Eric Pelloux-Prayer <[email protected]> Acked-by: Christian König <[email protected]>

Move work queue allocation into a helper for a more streamlined function body. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Currently the job free work item will lock sched->job_list_lock first time to see if there are any jobs, free a single job, and then lock again to decide whether to re-queue itself if there are more finished jobs. Since drm_sched_get_finished_job() already looks at the second job in the queue we can simply add the signaled check and have it return the presence of more jobs to free to the caller. That way the work item does not have to lock the list again and repeat the signaled check. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Reduce to one spin_unlock for hopefully a little bit clearer flow in the function. It may appear that there is a behavioural change with the drm_sched_start_timeout_unlocked() now not being called if there were initially no jobs on the pending list, and then some appeared after unlock, however if the code would rely on the TDR handler restarting itself then it would fail to do that if the job arrived on the pending list after the check. Also fix one stale comment while touching the function. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Extract out two copies of the identical code to function epilogue to make it smaller and more readable. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Round-robin being the non-default policy and unclear how much it is used, we can notice that it can be implemented using the FIFO data structures if we only invent a fake submit timestamp which is monotonically increasing inside drm_sched_rq instances. So instead of remembering which was the last entity the scheduler worker picked, we can bump the picked one to the bottom of the tree, achieving the same round-robin behaviour. Advantage is that we can consolidate to a single code path and remove a bunch of code. Downside is round-robin mode now needs to lock on the job pop path but that should not be visible. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Move the code dealing with entities entering and exiting run queues to helpers to logically separate it from jobs entering and exiting entities. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Lets move all the code dealing with struct drm_sched_rq into a separate compilation unit. Advantage being sched_main.c is left with a clearer set of responsibilities. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

To implement fair scheduling we will need as accurate as possible view into per entity GPU time utilisation. Because sched fence execution time are only adjusted for accuracy in the free worker we need to process completed jobs as soon as possible so the metric is most up to date when view from the submission side of things. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

To implement fair scheduling we need a view into the GPU time consumed by entities. Problem we have is that jobs and entities objects have decoupled lifetimes, where at the point we have a view into accurate GPU time, we cannot link back to the entity any longer. Solve this by adding a light weight entity stats object which is reference counted by both entity and the job and hence can safely be used from either side. With that, the only other thing we need is to add a helper for adding the job's GPU time into the respective entity stats object, and call it once the accurate GPU time has been calculated. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

There is no need to keep entities with no jobs in the tree so lets remove it once the last job is consumed. This keeps the tree smaller which is nicer and more efficient as entities are removed and re-added on every popped job. Apart from that, the upcoming fair scheduling algorithm will rely on the tree only containing runnable entities. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Fair scheduling policy is built upon the same concepts as the well known CFS kernel scheduler - entity run queue is sorted by the virtual GPU time consumed by entities in a way that the entity with least vruntime runs first. It is able to avoid total priority starvation, which is one of the problems with FIFO, and it also eliminates the need for per priority run queues. As it scales the actual GPU runtime by an exponential factor as the priority decreases, therefore the virtual runtime for low priority entities grows faster than for normal priority, pushing them further down the runqueue order for the same real GPU time spent. Apart from this fundamental fairness, fair policy is especially strong in oversubscription workloads where it is able to give more GPU time to short and bursty workloads when they are running in parallel with GPU heavy clients submitting deep job queues. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Cc: Pierre-Eric Pelloux-Prayer <[email protected]>

If the new fair policy is at least as good as FIFO and we can afford to remove round-robin, we can simplify the scheduler code by making the scheduler to run queue relationship always 1:1 and remove some code. Also, now that the FIFO policy is gone the tree of entities is not a FIFO tree any more so rename it to just the tree. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

There is no reason to queue just a single job if scheduler can take more and re-queue the worker to queue more. We can simply feed the hardware with as much as it can take in one go and hopefully win some latency. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

Now that the run queue to scheduler relationship is always 1:1 we can embed it (the run queue) directly in the scheduler struct and save on some allocation error handling code and such. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]>

kuba-moo and others added 30 commits June 19, 2025 15:40

sirlucjan and others added 29 commits June 24, 2025 14:06

iosched-6.15: back to v1.5.8

05f81de

Signed-off-by: Piotr Gorski <[email protected]>

iosched-6.15: bump ADIOS to v1.5.12

896330e

Signed-off-by: Piotr Gorski <[email protected]>

Revert "HID: amd_sfh: Add support for tablet mode switch sensors"

f7538cd

This reverts commit 891fe0d. Signed-off-by: Eric Naim <[email protected]>

iosched-6.15: bump ADIOS to v2.0.0

37230c0

Signed-off-by: Piotr Gorski <[email protected]>

aviallon force-pushed the aviallon/6.15/drm-fair-scheduler branch from aef20a9 to 7677fde Compare June 24, 2025 12:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: backport drm fair scheduler to 6.15 #6

WIP: backport drm fair scheduler to 6.15 #6

Uh oh!

aviallon commented Jun 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

WIP: backport drm fair scheduler to 6.15 #6

Are you sure you want to change the base?

WIP: backport drm fair scheduler to 6.15 #6

Uh oh!

Conversation

aviallon commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aviallon commented Jun 24, 2025 •

edited

Loading