https://github.com/modular/mojo/pull/3945
The very short version of the proposal, since I know there has been some confusion about parts of it. I’m going to assume some familiarity with Rust’s async model since I think my explanations of that are part of what made the initial proposal hard to understand.
Start with Rust’s async model. Now use move constructors to throw out Pin, one of the pain points of Rust async, since Mojo can use them to handle self-referential structs. After that, use linear types to enable sound scoped tasks, which Rust can’t do because it can’t guarantee that sub-tasks will complete before the current task (Explanation by Conrad Ludgate). For me, this fixes most of the language-level issues with async, since we can design a “try-await” which can handle cancellation at a later point.
Now, I think the ecosystem is the cause of a lot of the issues that people assign to Rust’s async design. The inability to have scoped tasks means that many, many api bounds require Send + Sync + 'static. In my opinion, this API bound is the leading cause of Arc<AsyncMutex<T>> proliferation in Rust. This bound exists because of work stealing executors, which can move tasks between threads at any suspension point. Executors that do not perform work stealing, such as Glommio, do not have this issue, since the compiler can provide much better lifetimes to things which don’t cross thread boundaries. However, work stealing is very useful if you don’t know exactly how much CPU time your tasks take, which includes most web development. This is why I want the default to be “keep this coroutine on this thread”, and we can have a separate APIs for “pick a thread to run this whole thing on” and “this can swap between threads whenever it wants”. This should keep most of the benefits of work stealing without imposing annoying lifetime requirements.
Additionally, I want to have pluggable “subsystems” which can handle various types of async operations. For example, one which handles async CUDA operations, one for epoll, one for io_uring, one for kqueue, etc. These can be enabled as needed, but critically they can be added by libraries, and this is part of why wakers need to exist, so that we don’t need to expose internal details of the stdlib executor to these subsystems. This preserves our ability to evolve the stdlib async executor over time. If Mojo doesn’t have this, and async IO needs to be built directly into the language runtime, Mojo will end up with the Go problem where we won’t be able to evolve as quickly to new APIs (Go still can’t use io_uring), and it will mean every odd piece of hardware will need support in the stdlib. This means everything from Intel DLB, which is essentially a hardware offload for actor system queue management, to RDMA, to POSIX AIO, to io_uring need to live inside of the stdlib. Another good example would be the Nvidia Bluefield 3 Datapath Accelerator, which is a collection of 16 SMT 16 (yes, you read that right, 256 hardware threads) RV64IMAC cores that use hardware cooperative scheduling. This is quite literally hardware designed to run async, but making a single async executor which can handle this, software hard real time in an RTOS on a small SBC or microcontroller, handle a soft core on an FPGA which gets fully deactivated until IO is ready, and which is still usable for a quad socket x86 server with 512 cores, is something that I think should be avoided. Shipping a standard executor is fine, but it either needs to handle every wacky use-case thrown at it, or we need the ability to swap it out. I think the ability to swap out components and have libraries built against a capability-based interface where libraries can still function on alternative executors is the best path forwards.