So I just pushed a kernel fix for Asahi Linux to (hopefully) fix random kernel panics.
The fix? Increase kernel stacks to 32K.
We were running out of stack. It turns out that when you have zram enabled and are running out of physical RAM, a memory allocation can trigger a ridiculous call-chain through zram and back into the allocator. This, combined with one or two large-ish stack frames in our GPU driver (2-3K), was simply overflowing the kernel stack.
Here's the thing though: If we were hitting this with simple GPU stuff (which, yes, has a few large stack frames because Rust, but it's a shallow call stack and all it's doing is a regular memory allocation to trigger the rest all the way into the overflow) I guarantee there are kernel call paths that would also run out of stack, today, in upstream kernels with zram (i.e. vanilla Fedora setups).
I'm honestly baffled that, in this day and age, 1) people still think 16K is acceptable, and 2) we still haven't figured out dynamically sized Linux kernel stacks. If we're so close to the edge that a couple KB of extra stack from Rust nonsense causes kernel panics, you're definitely going over the edge with long-tail corner cases of complex subsystem layering already and people's machines are definitely crashing already, just perhaps less often.
I know there was talk of dynamic kernel stacks recently, and one of the issues was that implementing it is hard on x86 due to a series of bad decisions made many years ago including the x86 double-fault model and the fact that in x86 the CPU implicitly uses the stack on faults. Of course, none of this is a problem for ARM64, so maybe we should just implement it here first and let the x86 people figure something out for their architecture on their own ;).
But on the other hand, why not increase stacks to 32K? ARM64 got bumped to 16K in 2013, over 10 years ago. Minimum RAM size has at least doubled since then, so it stands to reason that doubling the kernel stack size is entirely acceptable. Consider a typical GUI app with ~30 threads: With 32K stacks, that's less than 1MB of RAM, and any random GUI app is already going to use many times more than that in graphics surfaces.
Of course, the hyperscalers will complain because they run services that spawn a billion threads (hi Java) and they like to multiply the RAM usage increase by the size of their fleet to justify their opinions (even though all of this is inherently relative anyway). But the hyperscalers are running custom kernels anyway, so they can crank the size down to 16K if they really want to (or 8K, I heard Google still uses that).
@marcan you should be able to get a compile-time warning for functions that use excessive stack frames by setting CONFIG_FRAME_WARN to a lower value. The default for arm64 is 2048 bytes, but around 1300 is probably a better cut-off to see the worst offenders without too much output overall.
It looks like we are missing a warning flag for the rust compiler, which I would have expected to complain about a >2K stack. I tried passing -Cllvm-args=-fwarn-stack-size=2048, but that doesn't work.
@arnd The bigger question is how do you figure out *what* is bloating the stack. But yeah, I don't even know if there is a working warning flag for Rust.
My point still stands though, if being a few kB deep into the stack and doing a GFP_KERNEL alloc panics, we have bigger problems lurking than our Rust stacks. I'm pretty sure I could repro this entirely upstream sans Rust with the right setup. The Rust side might have been a few chunky frames, but there were dozens of frames above that as the kernel took a detour into the allocator, through zram, back into the allocator, and into a stack overflow panic.
@marcan I did some simple analysis earlier this year, playing around with a patch to make the stack artificially smaller at runtime until it crashed, and then running various workloads. The common theme here was clearly getting into memory reclaim from a deep call chain. One idea I had was to change the slab allocator so it would do the reclaim in a separate thread with a fresh stack, but I did not investigate further at that point. Maybe @vbabka has some other ideas here.
@vbabka right, it's not slab but __alloc_pages_direct_reclaim() that is in most call chains, e.g. https://pastebin.com/raw/KZWvmhNB for a typical syzkaller report with reduced stack.
The question is whether we can force this to take an asynchronous path (kswapd or something new) all the time to avoid stacking a random fs/blkdev call chain on top of a random kmalloc/alloc_pages/... call.
fw_get_filesystem_firmware() is similarly reponsible for most other stack overruns in syzkaller, followed by nl80211().
@vbabka right, only got one backtrace that ends up in ext4 from __alloc_pages_direct_reclaim, but not in writeback: https://pastebin.com/raw/juKwfnBM
Not sure what happens with swap files (instead of partition), would that go through fs code?
I guess we could detect constrained threads in __perform_reclaim() by checking the the amount of free space on the stack, and instead go through queue_work_on(system_unbound_wq, ...); wait_for_completion(); in order to call try_to_free_pages() if it's too low.