<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 19, 2024, 3:51 PM ron minnich <<a href="mailto:rminnich@gmail.com">rminnich@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">to reiterate: you can avoid resetting the machine, and for all the x million systems in data centers around the world, we do avoid resetting the machine. <div><br></div><div>But that comes with its own set of issues, including kernel version x not being able to boot kernel version y (very common with linux, no problem on plan 9); and hardware not behaving well, since few people write drivers that properly reset hardware; or the hardware can't be cleaned up absent a reset (most common problem areas are NICs and graphics). Very few linux drivers can properly shut down hardware for a kexec. The IOMMU and MSIx added a whole new world of fun. Sometimes it feels like it works by accident. </div><div><br></div><div>Also recall that side channels across a kexec are an issue that has to be considered. At Google we've considered them by, e.g., turning on "zero on free" and "zero on alloc" in the kernel that will kexec, only using a small amount of memory in the first kernel (32GiB is small! Ha!), among other things. But since DRAM SPD and Voltage regulator module FLASH provide places to hide things, it's getting messy.<br></div><div><br></div><div>So it's not as simple as "reset bad, not reset good". Hardware is poorly designed, or the drivers are poorly written, and absent a reset, the kexec'ed kernel may fail to boot -- lockup is common, panic is common. That said, no system I know of implements kexec with a reset in the middle. </div><div><br></div><div>Short history: in the Linux world, kernel boots kernel was done, in 1999, by LOBOS (me) and Eric Hendriks (Two Kernel Monte), and Alpha Power (DBLX). Werner Almesberger did his own thing ca. 2000 called (iirc) bootimg. Eric Biederman looked at LOBOS, did not like it, and wrote kexec, I believe around 2001. Plan 9 got kernel boots kernel around that time. As usual, the Plan 9 implementation was the most compact and cleanest. This paper <a href="https://ieeexplore.ieee.org/document/1392643" target="_blank" rel="noreferrer">https://ieeexplore.ieee.org/document/1392643</a> compares them.</div><div><br></div><div>The AlphaPower DBLX code was lost when the company went under. They made a heroic effort to get it to sourceforge but things happened too fast. DBLX means "direct boot linux" -- the acronym reads better.<br></div><div><br></div><div>The first kexec was a very general interface, with a Himalayan learning curve. At some point an Intel engineer found kexec confusing and wrote an entirely new type of kexec, with a different API, that many people found easier. </div></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">The kexec I'm ising is like that. "Load the memory you want and give us a start address." is all the instructions you get. God speed. Best of luck. Have fun storming the castle.</div><div dir="auto"><br></div><div dir="auto">I had to read a ton of code to find the details. Then I needed to set it up like FreeBSD's regular loaders do. And once I guessed wrong about 200 times, i was up and limping. I'd definitely wouldn't call kexec easy to learn or code to...</div><div dir="auto"><br></div><div dir="auto">And it was a blast...</div><div dir="auto"><br></div><div dir="auto">Warner</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>so kexec has been around for 20 years, and we're still getting the hang of it, and there are still people who claim that it will never fully work.</div><div><br></div><div>Anyway, we're far afield of the original question, but it was very interesting to read how far back the idea goes! PDP-7, who knew?</div><div><br></div><div>p.s. as to an unrelated discussion: kernels have been self modifying code since at least module loaders became a thing -- that's almost 40 years. Today, especially for risc-v, Linux is aggressively self-modifying; there's no option for some risc-v SoC if you want them to work correctly. Linux rewrites the entire kernel text in early boot stages. You can consider the last stage linker optimization occurs in Linux early boot code.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 19, 2024 at 12:13 AM Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank" rel="noreferrer">imp@bsdimp.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 19, 2024, 1:05 AM Bakul Shah via TUHS <<a href="mailto:tuhs@tuhs.org" target="_blank" rel="noreferrer">tuhs@tuhs.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Can you not avoid resetting the machine? This can be treated almost as sleep in the old kernel, wakeup in the new one! You do have to reset devices individually (which may not always work if it requires assistance from some undocumented firmware).</div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Kexec does just this. The new kernel boots without going through the reset vector. The old kernel keeps a tiny bit of code around that tears down all the protections, etc and hands off to the new kernel a mostly reset machine.. but it doesn't go through the firmware to do it... it was the original reason for it in linux: fast reboot times.</div><div dir="auto"><br></div><div dir="auto">Warner</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><blockquote type="cite"><div>On Sep 18, 2024, at 4:58 PM, ron minnich <<a href="mailto:rminnich@gmail.com" rel="noreferrer noreferrer" target="_blank">rminnich@gmail.com</a>> wrote:</div><br><div><div><div dir="ltr">well, yes, on many systems, there's a lot that runs before the kernel. But if you have a risc-v system with oreboot, you own the system. The problem is that on most of these systems a reset will stop the dram clock for a little bit, or glitch clock enable, or dram power, or whatever. New systems are not designed to allow this.<div><br></div><div>Ideally, we could force a reset of everything save memory, but modern systems are not designed in this way. Most annoying.</div></div></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Sep 18, 2024 at 4:38 PM Bakul Shah <<a href="mailto:bakul@iitbombay.org" rel="noreferrer noreferrer" target="_blank">bakul@iitbombay.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>I would prefer old kernel to new kernel handoff if it can be made to work reliably. Nowadays there are a lot of things that run before the kernel gets control. <br id="m_8370537176490513364m_-2345776952850080622m_-3736241497013644723m_6871379616508925004m_-8242461479569419083lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On Sep 18, 2024, at 3:38 PM, ron minnich <<a href="mailto:rminnich@gmail.com" rel="noreferrer noreferrer" target="_blank">rminnich@gmail.com</a>> wrote:</div><br><div><div dir="ltr">Interesting about the amiga. I'm assuming their firmware zeros memory on reset, so you have to do handoff from kernel to kernel, not via a reset and so on?<div><br></div><div>What was particularly nice about the V6/PDP-11 case: we were able to yank reset, which let us cleanly reset/disable devices, because everything was in memory when we got back. I miss the simplicity of the old machines.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Sep 18, 2024 at 3:07 PM Christian Hopps <<a href="mailto:chopps@chopps.org" rel="noreferrer noreferrer" target="_blank">chopps@chopps.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
We had/have this functionality in the Amiga port of NetBSD.<br>
<br>
It is implemented as `/dev/reload` device and you copy a kernel image to it. In locore.s there's code that copies the kernel image over top of the running kernel and then restarts. I believe for it to work nothing below the copy code in locore.s can change :)<br>
<br>
Thanks,<br>
Chris.<br>
<br>
Phil Budne <<a href="mailto:phil@ultimate.com" rel="noreferrer noreferrer" target="_blank">phil@ultimate.com</a>> writes:<br>
<br>
> ron minnich wrote:<br>
>> But I'm wondering: is Ed's work in 1977 the first "kernel boots kernel" or<br>
>> was there something before?<br>
><br>
> There was! The PDP-7 UNIX listings contain a program trysys.s<br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/trysys.s" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/trysys.s</a><br>
> that reboots the system by reading a.out into user memory (in the high<br>
> 4K of core), then copies it to low memory and jumping to the entry<br>
> point. The name suggests its original intended use was to test a new<br>
> system (kernel).<br>
><br>
> P.S.<br>
> Normal bootable system images seem to have been stored in reserved<br>
> tracks of the (fixed head) disk (that are inacessible via system calls):<br>
><br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/maksys.s" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/maksys.s</a><br>
> reads a.out and uses I/O instructions to write it out.<br>
><br>
> P.P.S.<br>
> Accordingly, I put together a "paper tape" for booting the system:<br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/other/pbboot.s" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/other/pbboot.s</a><br>
><br>
> P.P.P.S.<br>
> The system (kernel) is 3K words, the last 1K of low memory<br>
> used for the character table for the vector graphics controller.<br>
><br>
> The definitions for the table are compiled by<br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/cas.s" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/cas.s</a><br>
> from definition file<br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/cas.in" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/cas.in</a><br>
> (after, ISTR, figuring out the ordering of the listing pages!)<br>
><br>
> I don't think we ever figured out how the initial character table<br>
> is loaded into core. One thing that was missing from the table<br>
> was the dispatch array, which I recreated:<br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/other/chrtbl.s" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/other/chrtbl.s</a><br>
><br>
> The system (kernel) could be built for a "cold start", reloading the<br>
> disk (prone to head crashes?) from paper tape? But I don't think<br>
> anyone ever reconstructed the procedure for rebuilding a disk that way.<br>
><br>
> The disk was two sided, and the running system only used one side:<br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/dsksav.s" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/dsksav.s</a><br>
> <a href="https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/dskres.s" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/dskres.s</a><br>
> appear to be programs to save and restore the filesystem from the<br>
> "other" side of the disk.<br>
<br>
</blockquote></div>
</div></blockquote></div><br></div></blockquote></div>
</div>
</div></blockquote></div><br></div></blockquote></div></div></div>
</blockquote></div>
</blockquote></div></div></div>