in-place construction seems surprisingly simple?
— 2024-06-22

introduction

I've been thinking a little bit about self-referential types recently, and one of its requirements is that a type, when constructed, has a fixed location in memory ¹. That's needed, because a reference is pointer to a memory address - and if the memory address it points to changes, that would invalidate the pointer, which can lead to undefined behavior.

Eric Holk and I toyed around for a little while with the idea of offset-schemes, where we only track offsets from the start of structs. But that gets weird once you involve non-linear memory such as when using references in the heap. Using real, actual addresses is probably the better solution since that should work in all cases - even if it brings its own sets of challenges.

Where this becomes tricky is that returning data from a function constitutes a move. A constructor which creates a type within itself and returns it, will change the address of the type it constructs. Take this program which constructs a type Cat and in the function Cat::new (playground):

use std::ptr::addr_of;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> Self {
        let this = Self { age };
        dbg!(addr_of!(this)); // ← first call to `addr_of!`
        this
    }
}

fn main() {
    let cat = Cat::new(4);
    dbg!(addr_of!(cat));      // ← second call to `addr_of!`
}

If we run this program we get the following output:

[src/main.rs:7:9] addr_of!(this) = 0x00007ffe0b3575d7 # first call
[src/main.rs:14:5] addr_of!(cat) = 0x00007ffe0b357747 # second call

This shows that the address of Cat within Cat::new is different from the address once it has been returned from the function. Returning types from functions means changing addresses. Languages like C++ have ways to work around this by providing something called move constructors, enable types to update their own internal addresses when they are moved in memory. But instead, what if we could just construct types in-place so they didn't have to move in the first place?

in-place construction

We already perform in-place construction when desugaring async {} blocks, so this is something we know how to do. And in the ecosystem there is also the moveit and ouroboros crates. These are all great, but they all do additional things like "self-references" or "move constructors". Constructing in-place rather than returning from a function can be useful just for the sake of reducing copies - so let's roll our own version of just that.

The way we can do this is by creating a stable location in memory we can store our value in. Rather than returning a value, a constructor should take a mutable reference to this memory location and write directly into it instead. And because we're starting with a location and write into it later, this location needs to be MaybeUninit. If we put those pieces together, we can adapt our earlier example to the following (playground):

use std::ptr::{addr_of, addr_of_mut};
use std::mem::MaybeUninit;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8, slot: &mut MaybeUninit<Self>) {
        let this: *mut Self = slot.as_mut_ptr();
        unsafe { 
           addr_of_mut!((*this).age).write(age);
           dbg!(addr_of!(*this));   // ← second call to `addr_of!`
        };
    }
}

fn main() {
    let mut slot = MaybeUninit::uninit();
    dbg!(addr_of!(slot));      // ← first call to `addr_of!`
    Cat::new(4, &mut slot);
    let cat: &mut Cat = unsafe { (slot).assume_init_mut() };
    dbg!(addr_of!(*cat));      // ← third call to `addr_of!`
}

If we run the program it will print the following:

[src/main.rs:15:5] addr_of!(slot) = 0x00007ffc9daa590f  # first call
[src/main.rs:9:9] addr_of!(*this) = 0x00007ffc9daa590f  # second call
[src/main.rs:18:5] addr_of!(*cat) = 0x00007ffc9daa590f  # third call

To folks who aren't used to writing unsafe Rust this might look a little overwhelming. But what we've done is a fairly mechanical translation. Rather than returning the type Self, we're created a MaybeUninit<Self> and passed it by-reference. The constructor then writes into it, initializing the memory. From that point onward, all references to Cat are valid and can be assumed to be initialized.

Unfortunately calling assume_init on the actual value of Cat is not possible because the compiler treats that as a move - which makes sense since it takes a type and returns another. But that's mostly a limitation of how we're doing things - not what we're doing.

indirect in-place construction

Now what happens if there is a degree of indirection? What if rather than construct just a Cat, we want to construct a Cat inside of a Bed. We would have to take the memory location of the outer type, and use that as the location for the inner type. Let's extend our first example by doing exactly that (playground):

use std::ptr::addr_of;

struct Bed { cat: Cat }
impl Bed {
    fn new() -> Self {
        let cat = Cat::new(4);
        Self { cat }
    }
}

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> Self {
        let this = Self { age };
        dbg!(addr_of!(this)); // ← first call to `addr_of!`
        this
    }
}

fn main() {
    let bed = Bed::new();
    dbg!(addr_of!(bed));      // ← second call to `addr_of!`
    dbg!(addr_of!(bed.cat));  // ← third call to `addr_of!`
}

If we run the program it will print the following:

[src/main.rs:15:9] addr_of!(this) = 0x00007fff3910702f
[src/main.rs:22:5] addr_of!(bed) = 0x00007fff391071b7
[src/main.rs:23:5] addr_of!(bed.cat) = 0x00007fff391071b7

Adapting our return-based example to preserve referential stability is once again very mechanical. Rather than returning Self from a function, we pass a mutable reference to MaybeUninit<Self>. In for Cat to be constructed in Bed, all we have to do is make sure Bed contains a slot for Cat to be written to. Put together we end up with the following (playground):

use std::ptr::{addr_of, addr_of_mut};
use std::mem::MaybeUninit;

struct Bed { cat: MaybeUninit<Cat> }
impl Bed {
    fn new(slot: &mut MaybeUninit<Self>) {
        let this: *mut Self = slot.as_mut_ptr();
        Cat::new(4, unsafe { &mut (*this).cat });
    }
}

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8, slot: &mut MaybeUninit<Self>) {
        let this: *mut Self = slot.as_mut_ptr();
        unsafe { 
            addr_of_mut!((*this).age).write(age);
            dbg!(addr_of!(*this)); // ← second call to `addr_of!`
        };
    }
}

fn main() {
    let mut slot = MaybeUninit::uninit();
    dbg!(addr_of!(slot));      // ← first call to `addr_of!`
    Bed::new(&mut slot);
    let bed: &mut Bed = unsafe { (slot).assume_init_mut() };
    dbg!(addr_of!(*bed));      // ← third call to `addr_of!`
}

Which if we run the program will print the following addresses. These are all the same because Cat is the only field inside of Bed, so they happen to point to the same memory location:

[src/main.rs:23:5] addr_of!(slot) = 0x00007fff8271d86f   # first call
[src/main.rs:17:9] addr_of!(*this) = 0x00007fff8271d86f  # second call
[src/main.rs:26:5] addr_of!(*bed) = 0x00007fff8271d86f   # third call

future possibilities

If we squint here it's not hard to see how this could be converted into a language feature. In this post we've mechanically performed a transformation by hand. Rather than returning a type T from a function, we're taking a &mut MaybeUninit<T> and writing into that. It feels like it's basically just a spicy return; and it seems like something we could introduce some kind of notation for.

Though admittedly things get trickier once we want to also enable self-references, perform phased initialization, immovable types, and so on. But those all depend on being able to write to a fixed place in memory - and it feels like perhaps these are concepts which we can decouple from one another? Anyway, if we just take in-place construction as a feature, I think we might be able to get away with something like this for our first example:

use std::ptr::addr_of;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> #[in_place] Self { // ← new notation
        Self { age }
    }
}

fn main() {
    let cat = Cat::new(4);
    // ^ cat was constructed in-place
}

This is obviously not a new idea - but it stood out to me how simple the actual underpinnings of in-place construction seem. The change from a regular return to an in-place return feels mechanical in nature - and that seems like a good sign. For good measure let's also adapt our second example:

use std::ptr::addr_of;

struct Bed { cat: Cat }
impl Bed {
    fn new() -> #[in_place] Self {         // ← new notation
        Self { cat: Cat::new(4) }
        //     ^ cat was constructed in-place
    }
}

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> #[in_place] Self {  // ← new notation
        Self { age }
    }
}

fn main() {
    let bed = Bed::new();
    // ^ bed was constructed in-place
}

Admittedly I haven't read up on the 10-year discourse of placement-new, so I assume there are plenty of details left out that make this hard in the general sense. Things like heap-addresses and intermediate references. But for the simplest case? It seems surprisingly doable. Not quite something which we can proc-macro - but not far off either. And maybe scoping that as a first target would be enough? I don't know.

The connection to super let

edit 2024-06-25: This section was added after first publishing this post.

Jack Huey reached out after publishing this post, and mentioned there might be a connection with the super let feature. I think that's a super interesting point, and it's not hard to see why! Take this example from Mara's post:

let writer = {
    println!("opening file...");
    let filename = "hello.txt";
    super let file = File::create(filename).unwrap();
    Writer::new(&file)
};

The super let file notation here allows the file's lifetime to be scoped to the outer scope, making it valid for Writer to take a reference to file and return. Without super let this would result in a lifetime error. Its vibes are very similar to the #[in_place] notation we posited in this post.

Perhaps there a synthesis of both features could exist to create a form of "generalized super scope" feature? There definitely appears like there might be some kind of connection. Like, we could imagine writing something like super let to denote a "value which is allocated in the caller's frame". And a function returning super Type to signal from the type signature that this is actually an out-pointer ².

Shout out to James Munns for teaching me about the term "outpointer" - that's apparently the term C++ uses for this using the outptr keyword.

use std::ptr::addr_of;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> super Self {
        super let this = Self { age };  // declare in the caller's frame
        dbg!(addr_of!(this));
        this
    }
}

It's worth nothing though that this is not an endorsement for actually going with super let or super Type - but merely to speculate how there might be a possible connection between both features. I think it's fun, and the connection between both seems worthy of further exploration!

conclusion

In this post I've shown how we can construct types in-place using MaybeUninit which surprised me how simple it ended up being. I mostly wanted to have gone through the motions at least once - and now I have, and that was fun!

edit 2024-06-22: Thanks to Jordan Rose and Simon Sapin for helping un-break the unsafe pointer code in an earlier version of this post. Goes to show: Rust's pointer ergonomics really could use an overhaul.

in-place construction seems surprisingly simple?— 2024-06-22