in-place construction seems surprisingly simple?
— 2024-06-22
- introduction
- in-place construction
- indirect in-place construction
- future possibilities
- the connection to super let
- conclusion
introduction
I've been thinking a little bit about self-referential types recently, and one of its requirements is that a type, when constructed, has a fixed location in memory 1. That's needed, because a reference is pointer to a memory address - and if the memory address it points to changes, that would invalidate the pointer, which can lead to undefined behavior.
Eric Holk and I toyed around for a little while with the idea of offset-schemes, where we only track offsets from the start of structs. But that gets weird once you involve non-linear memory such as when using references in the heap. Using real, actual addresses is probably the better solution since that should work in all cases - even if it brings its own sets of challenges.
Where this becomes tricky is that returning data from a function constitutes a
move. A constructor which creates a type within itself and returns it, will
change the address of the type it constructs. Take this program which constructs
a type Cat
and in the function Cat::new
(playground):
use std::ptr::addr_of;
struct Cat { age: u8 }
impl Cat {
fn new(age: u8) -> Self {
let this = Self { age };
dbg!(addr_of!(this)); // ← first call to `addr_of!`
this
}
}
fn main() {
let cat = Cat::new(4);
dbg!(addr_of!(cat)); // ← second call to `addr_of!`
}
If we run this program we get the following output:
[src/main.rs:7:9] addr_of!(this) = 0x00007ffe0b3575d7 # first call
[src/main.rs:14:5] addr_of!(cat) = 0x00007ffe0b357747 # second call
This shows that the address of Cat
within Cat::new
is different from the
address once it has been returned from the function. Returning types from
functions means changing addresses. Languages like C++ have ways to work around
this by providing something called move constructors, enable types to update
their own internal addresses when they are moved in memory. But instead, what if
we could just construct types in-place so they didn't have to move in the first
place?
in-place construction
We already perform in-place construction when desugaring async {}
blocks, so
this is something we know how to do. And in the ecosystem there is also the
moveit and
ouroboros crates. These are all great, but they all
do additional things like "self-references" or "move constructors". Constructing in-place rather than returning from a function can be useful just for the sake of reducing copies - so let's roll our own version of just that.
The way we can do this is by creating a stable location in memory we can store
our value in. Rather than returning a value, a constructor should take a mutable
reference to this memory location and write directly into it instead. And
because we're starting with a location and write into it later, this location
needs to be MaybeUninit
. If we put those pieces together, we can adapt our
earlier example to the following (playground):
use std::ptr::{addr_of, addr_of_mut};
use std::mem::MaybeUninit;
struct Cat { age: u8 }
impl Cat {
fn new(age: u8, slot: &mut MaybeUninit<Self>) {
let this: *mut Self = slot.as_mut_ptr();
unsafe {
addr_of_mut!((*this).age).write(age);
dbg!(addr_of!(*this)); // ← second call to `addr_of!`
};
}
}
fn main() {
let mut slot = MaybeUninit::uninit();
dbg!(addr_of!(slot)); // ← first call to `addr_of!`
Cat::new(4, &mut slot);
let cat: &mut Cat = unsafe { (slot).assume_init_mut() };
dbg!(addr_of!(*cat)); // ← third call to `addr_of!`
}
If we run the program it will print the following:
[src/main.rs:15:5] addr_of!(slot) = 0x00007ffc9daa590f # first call
[src/main.rs:9:9] addr_of!(*this) = 0x00007ffc9daa590f # second call
[src/main.rs:18:5] addr_of!(*cat) = 0x00007ffc9daa590f # third call
To folks who aren't used to writing unsafe Rust this might look a little
overwhelming. But what we've done is a fairly mechanical translation. Rather
than returning the type Self
, we're created a MaybeUninit<Self>
and passed
it by-reference. The constructor then writes into it, initializing the memory.
From that point onward, all references to Cat
are valid and can be assumed to
be initialized.
Unfortunately calling assume_init
on the actual value of Cat
is not
possible because the compiler treats that as a move - which makes sense since it
takes a type and returns another. But that's mostly a limitation of how we're
doing things - not what we're doing.
indirect in-place construction
Now what happens if there is a degree of indirection? What if rather than
construct just a Cat
, we want to construct a Cat
inside of a Bed
. We would
have to take the memory location of the outer type, and use that as the location
for the inner type. Let's extend our first example by doing exactly that (playground):
use std::ptr::addr_of;
struct Bed { cat: Cat }
impl Bed {
fn new() -> Self {
let cat = Cat::new(4);
Self { cat }
}
}
struct Cat { age: u8 }
impl Cat {
fn new(age: u8) -> Self {
let this = Self { age };
dbg!(addr_of!(this)); // ← first call to `addr_of!`
this
}
}
fn main() {
let bed = Bed::new();
dbg!(addr_of!(bed)); // ← second call to `addr_of!`
dbg!(addr_of!(bed.cat)); // ← third call to `addr_of!`
}
If we run the program it will print the following:
[src/main.rs:15:9] addr_of!(this) = 0x00007fff3910702f
[src/main.rs:22:5] addr_of!(bed) = 0x00007fff391071b7
[src/main.rs:23:5] addr_of!(bed.cat) = 0x00007fff391071b7
Adapting our return-based example to preserve referential stability is once
again very mechanical. Rather than returning Self
from a function, we pass
a mutable reference to MaybeUninit<Self>
. In for Cat
to be constructed in
Bed
, all we have to do is make sure Bed
contains a slot for Cat
to be
written to. Put together we end up with the following (playground):
use std::ptr::{addr_of, addr_of_mut};
use std::mem::MaybeUninit;
struct Bed { cat: MaybeUninit<Cat> }
impl Bed {
fn new(slot: &mut MaybeUninit<Self>) {
let this: *mut Self = slot.as_mut_ptr();
Cat::new(4, unsafe { &mut (*this).cat });
}
}
struct Cat { age: u8 }
impl Cat {
fn new(age: u8, slot: &mut MaybeUninit<Self>) {
let this: *mut Self = slot.as_mut_ptr();
unsafe {
addr_of_mut!((*this).age).write(age);
dbg!(addr_of!(*this)); // ← second call to `addr_of!`
};
}
}
fn main() {
let mut slot = MaybeUninit::uninit();
dbg!(addr_of!(slot)); // ← first call to `addr_of!`
Bed::new(&mut slot);
let bed: &mut Bed = unsafe { (slot).assume_init_mut() };
dbg!(addr_of!(*bed)); // ← third call to `addr_of!`
}
Which if we run the program will print the following addresses. These are all
the same because Cat
is the only field inside of Bed
, so they happen to
point to the same memory location:
[src/main.rs:23:5] addr_of!(slot) = 0x00007fff8271d86f # first call
[src/main.rs:17:9] addr_of!(*this) = 0x00007fff8271d86f # second call
[src/main.rs:26:5] addr_of!(*bed) = 0x00007fff8271d86f # third call
future possibilities
If we squint here it's not hard to see how this could be converted into a
language feature. In this post we've mechanically performed a transformation by
hand. Rather than returning a type T
from a function, we're taking a &mut MaybeUninit<T>
and writing into that. It feels like it's basically just a
spicy return; and it seems like something we could introduce some kind of
notation for.
Though admittedly things get trickier once we want to also enable self-references, perform phased initialization, immovable types, and so on. But those all depend on being able to write to a fixed place in memory - and it feels like perhaps these are concepts which we can decouple from one another? Anyway, if we just take in-place construction as a feature, I think we might be able to get away with something like this for our first example:
use std::ptr::addr_of;
struct Cat { age: u8 }
impl Cat {
fn new(age: u8) -> #[in_place] Self { // ← new notation
Self { age }
}
}
fn main() {
let cat = Cat::new(4);
// ^ cat was constructed in-place
}
This is obviously not a new idea - but it stood out to me how simple the actual underpinnings of in-place construction seem. The change from a regular return to an in-place return feels mechanical in nature - and that seems like a good sign. For good measure let's also adapt our second example:
use std::ptr::addr_of;
struct Bed { cat: Cat }
impl Bed {
fn new() -> #[in_place] Self { // ← new notation
Self { cat: Cat::new(4) }
// ^ cat was constructed in-place
}
}
struct Cat { age: u8 }
impl Cat {
fn new(age: u8) -> #[in_place] Self { // ← new notation
Self { age }
}
}
fn main() {
let bed = Bed::new();
// ^ bed was constructed in-place
}
Admittedly I haven't read up on the 10-year discourse of placement-new, so I assume there are plenty of details left out that make this hard in the general sense. Things like heap-addresses and intermediate references. But for the simplest case? It seems surprisingly doable. Not quite something which we can proc-macro - but not far off either. And maybe scoping that as a first target would be enough? I don't know.
The connection to super let
edit 2024-06-25: This section was added after first publishing this post.
Jack Huey reached out after publishing this post, and
mentioned there might be a connection with the super let
feature. I think that's a super interesting
point, and it's not hard to see why! Take this example from Mara's post:
let writer = {
println!("opening file...");
let filename = "hello.txt";
super let file = File::create(filename).unwrap();
Writer::new(&file)
};
The super let file
notation here allows the file
's lifetime to be scoped to
the outer scope, making it valid for Writer
to take a reference to file
and
return. Without super let
this would result in a lifetime error. Its vibes are
very similar to the #[in_place]
notation we posited in this post.
Perhaps there a synthesis of both features could exist to create a form of
"generalized super scope" feature? There definitely appears like there might be
some kind of connection. Like, we could imagine writing something like super let
to denote a "value which is allocated in the caller's frame". And a
function returning super Type
to signal from the type signature that this is
actually an out-pointer 2.
Shout out to James Munns for teaching me about the term "outpointer" - that's apparently the term C++ uses for this using the outptr
keyword.
use std::ptr::addr_of;
struct Cat { age: u8 }
impl Cat {
fn new(age: u8) -> super Self {
super let this = Self { age }; // declare in the caller's frame
dbg!(addr_of!(this));
this
}
}
It's worth nothing though that this is not an endorsement for actually going
with super let
or super Type
- but merely to speculate how there might be a
possible connection between both features. I think it's fun, and the connection
between both seems worthy of further exploration!
conclusion
In this post I've shown how we can construct types in-place using MaybeUninit
which surprised me how simple it ended up being. I mostly wanted to have gone
through the motions at least once - and now I have, and that was fun!
edit 2024-06-22: Thanks to Jordan Rose and Simon Sapin for helping un-break the unsafe pointer code in an earlier version of this post. Goes to show: Rust's pointer ergonomics really could use an overhaul.