Mojo 的结构化异步

mojo_hub

https://github.com/modular/mojo/pull/3945

The very short version of the proposal, since I know there has been some confusion about parts of it. I’m going to assume some familiarity with Rust’s async model since I think my explanations of that are part of what made the initial proposal hard to understand.

Start with Rust’s async model. Now use move constructors to throw out Pin, one of the pain points of Rust async, since Mojo can use them to handle self-referential structs. After that, use linear types to enable sound scoped tasks, which Rust can’t do because it can’t guarantee that sub-tasks will complete before the current task (Explanation by Conrad Ludgate). For me, this fixes most of the language-level issues with async, since we can design a “try-await” which can handle cancellation at a later point.

Now, I think the ecosystem is the cause of a lot of the issues that people assign to Rust’s async design. The inability to have scoped tasks means that many, many api bounds require Send + Sync + 'static. In my opinion, this API bound is the leading cause of Arc<AsyncMutex<T>> proliferation in Rust. This bound exists because of work stealing executors, which can move tasks between threads at any suspension point. Executors that do not perform work stealing, such as Glommio, do not have this issue, since the compiler can provide much better lifetimes to things which don’t cross thread boundaries. However, work stealing is very useful if you don’t know exactly how much CPU time your tasks take, which includes most web development. This is why I want the default to be “keep this coroutine on this thread”, and we can have a separate APIs for “pick a thread to run this whole thing on” and “this can swap between threads whenever it wants”. This should keep most of the benefits of work stealing without imposing annoying lifetime requirements.

Additionally, I want to have pluggable “subsystems” which can handle various types of async operations. For example, one which handles async CUDA operations, one for epoll, one for io_uring, one for kqueue, etc. These can be enabled as needed, but critically they can be added by libraries, and this is part of why wakers need to exist, so that we don’t need to expose internal details of the stdlib executor to these subsystems. This preserves our ability to evolve the stdlib async executor over time. If Mojo doesn’t have this, and async IO needs to be built directly into the language runtime, Mojo will end up with the Go problem where we won’t be able to evolve as quickly to new APIs (Go still can’t use io_uring), and it will mean every odd piece of hardware will need support in the stdlib. This means everything from Intel DLB, which is essentially a hardware offload for actor system queue management, to RDMA, to POSIX AIO, to io_uring need to live inside of the stdlib. Another good example would be the Nvidia Bluefield 3 Datapath Accelerator, which is a collection of 16 SMT 16 (yes, you read that right, 256 hardware threads) RV64IMAC cores that use hardware cooperative scheduling. This is quite literally hardware designed to run async, but making a single async executor which can handle this, software hard real time in an RTOS on a small SBC or microcontroller, handle a soft core on an FPGA which gets fully deactivated until IO is ready, and which is still usable for a quad socket x86 server with 512 cores, is something that I think should be avoided. Shipping a standard executor is fine, but it either needs to handle every wacky use-case thrown at it, or we need the ability to swap it out. I think the ability to swap out components and have libraries built against a capability-based interface where libraries can still function on alternative executors is the best path forwards.

xuyuanlu

以下是提案的简短版本，因为我知道之前的部分内容可能有些令人困惑。我会假设大家对Rust的异步模型有一定的了解，因为我觉得我之前对这部分内容的解释可能是导致初始提案难以理解的原因之一。

首先，从Rust的异步模型开始。然后，利用移动构造函数（move constructors）来摒弃Pin，这是Rust异步编程中的一个痛点，因为Mojo可以通过移动构造函数来处理自引用结构体。接着，使用线性类型（linear types）来实现安全的局部任务（scoped tasks），这是Rust无法做到的，因为它无法保证子任务会在当前任务之前完成（Conrad Ludgate的解释）。对我来说，这解决了异步编程中大部分语言层面的问题，因为我们可以设计一个“try-await”来处理稍后的取消操作。

现在，我认为生态系统是导致许多人将问题归咎于Rust异步设计的主要原因之一。由于无法实现局部任务，许多API边界需要Send + Sync + 'static。在我看来，这种API边界是Rust中Arc<AsyncMutex<T>>泛滥的主要原因。这种边界的存在是因为工作窃取（work stealing）执行器，它可以在任何挂起点将任务移动到其他线程。不进行工作窃取的执行器（如Glommio）没有这个问题，因为编译器可以为不跨越线程边界的内容提供更好的生命周期。然而，如果你不确定任务会占用多少CPU时间（大多数Web开发都属于这种情况），工作窃取是非常有用的。这就是为什么我希望默认行为是“将此协程保留在当前线程上”，并且我们可以为“选择一个线程来运行整个任务”和“此任务可以随时在线程之间切换”提供单独的API。这应该可以在不引入烦人的生命周期要求的情况下，保留工作窃取的大部分优势。

此外，我希望有可插拔的“子系统”来处理各种类型的异步操作。例如，一个处理异步CUDA操作，一个处理epoll，一个处理io_uring，一个处理kqueue，等等。这些子系统可以根据需要启用，但关键的是它们可以由库添加，这也是为什么需要存在唤醒器（waker）的原因之一，这样我们就不需要向这些子系统暴露标准库执行器的内部细节。这保留了我们在未来演进标准库异步执行器的能力。如果Mojo没有这个功能，并且异步IO需要直接内置到语言运行时中，那么Mojo将面临Go语言的问题，即无法快速演进以支持新的API（Go仍然无法使用io_uring），并且这意味着每种奇怪的硬件都需要在标准库中得到支持。这意味着从Intel DLB（本质上是一种硬件加速的actor系统队列管理）到RDMA、POSIX AIO、io_uring等，都需要内置到标准库中。另一个很好的例子是Nvidia Bluefield 3数据路径加速器，它由16个SMT 16（是的，你没看错，256个硬件线程）RV64IMAC核心组成，使用硬件协作调度。这简直就是为运行异步任务而设计的硬件，但创建一个能够处理这种情况的单一异步执行器，同时还能在小型SBC或微控制器上运行软件硬实时RTOS，处理FPGA上的软核（在IO准备好之前完全停用），并且仍然适用于具有512个核心的四路x86服务器，我认为这是应该避免的。提供一个标准执行器是可以的，但它要么需要处理所有奇怪的使用场景，要么我们需要能够替换它。我认为，能够替换组