by Gui Andrade
This blog post will outline the creation of dynstack
, a stack datastructure that stores trait objects unboxed to minimize the number of heap allocations necessary.
Rust, not being an object-oriented language, doesn't quite do inheritence like the others. In C++ or Java, subclasses extend superclasses. In Rust, structs implement traits.
This section is designed to provide background about how C++ and Rust implement polymorphism. If you're already familiar with vtables and fat pointers, jump ahead to Part 2.
Consider a typical example of inheritance in C++:
class Animal {
public:
virtual void MakeSound() = 0;
};
class Dog : public Animal {
public:
void MakeSound() {
printf("Ruff\n");
}
};
With C++ inheritance, we can think of Dog
as having an implicit private member: DogVtable *vtable
, with the following definition:
struct DogVtable {
void (*MakeSound)(Dog*);
// Some miscellaneous stuff.
};
If we upcast Dog
to Animal
, and call the MakeSound
method, we in essence are doing the following:
struct AnimalVtable {
void (*MakeSound)(Animal*);
// etc.
}
Dog d();
Dog *dog = &d;
dog->vtable->MakeSound(dog); // prints "Ruff"
Animal *animal = &d;
animal->vtable->MakeSound(animal); // prints "Ruff"
Note why the second invocation works. animal
is dog
; they both point to the same spot in memory. Assuming both Dog
and Animal
have their vtable
members at the same offset, and assuming both vtables have the same layout (spoiler: they do), both lines are identical.
If you can make sense of x86 assembly, I encourage you to play around with this assembly explorer for C++ virtual calls.
In Rust, things are a bit different. Traits have no data members, and any pre-implemented trait functions are duplicated among implementers. This means that unless you're using a trait object, Rust doesn't use vtables.
Trait objects complicate things, though. Imagine you want to have a vector of items that are all Animal
s, but may otherwise be of different types.
// Consider the following definitions:
trait Animal {
fn make_sound(&self);
}
struct Dog;
impl Animal for Dog {
fn make_sound(&self) { println!("Ruff"); }
}
struct Cat;
impl Animal for Cat {
fn make_sound(&self) { println!("Meow"); }
}
struct Chinchilla;
impl Animal for Chinchilla {
fn make_sound(&self) { println!("..."); }
}
// Accompanied with the following code:
let animals = Vec::<dyn Animal>::new();
animals.push(Dog);
animals.push(Cat);
animals.push(Chinchilla);
for animal in animals {
animal.make_sound();
}
Sadly, this example doesn't work. And it doesn't work because dyn Animal
isn't a real object, it's a trait object.
You can think of a trait object as a sort of upcasted representation of a struct. Obviously, the only methods you can call on a trait object are the ones defined in its respective trait.
And there are two other things that are important to keep in mind with trait objects:
This is pretty easy to reason about. Clearly, all of Dog
, Cat
, and Chinchilla
have no members, so they have size 0. But maybe Kangaroo
has a &Joey
member, which makes it size 8. What size does dyn Animal
have, then?
Well, it doesn't have a (static) size. It's dynamically sized, so we can only actually know at runtime for some specific instantiation. This also means that we can't store trait objects on the stack, because Rust doesn't permit variable stack usage (recursion aside).
This shows that trait object pointers are fat pointers. On a 64-bit system, this fat pointer occupies 128 bits for its two component addresses.
Aside: The representation for a fat pointer, as of Rust 1.32, is a packed (*mut Struct, *const Vtable)
tuple. Unfortunately, there's no guarantee that fat pointers will continue to be represented this way in the Future. Still, there's no foreseeable reason it would change.
To illustrate how these come together, let's write some Rust pseudocode that parallels the C++ example above:
let d: Dog = Dog;
let dog: &Dog = &d;
Dog::make_sound(dog); // prints "Ruff"
let animal: &dyn Animal = &d;
let (animal_data, animal_vtable) = decompose_fat_pointer(animal);
((*animal_vtable).make_sound)(animal_data); // prints "Ruff"
Again, animal_data
is dog
. They're both pointers to the exact same data. Also note that no vtable gets used unless we upcast to Animal
.
Again, if you can make sense of x86 assembly, I encourage you to play around with this assembly explorer for Rust virtual calls.
Generally, if we want to store trait objects, we have to allocate them on the heap.
The Box
smart pointer handles this nicely for us.
let animals = Vec::<Box<dyn Animal>>::new();
animals.push(Box::new(Dog));
animals.push(Box::new(Cat));
animals.push(Box::new(Chinchilla));
for animal in animals {
animal.make_sound();
}
And it works!
The only problem here is that we need separate allocations every single time we push to the Vec
. This is bad for performance!
A Vec
of size-N unboxed objects is laid out the following way in memory:
Offset | Data
=========|=======
+0 | item0
+N | item1
+2N | item2
... | ...
While a Vec
of boxed trait objects is laid out like so:
Offset | Data
=========|=======================
+0 | &item0, &item0_vtable
+16 | &item1, &item1_vtable
+32 | &item2, &item2_vtable
... | ...
I propose a data structure called DynStack
, which takes elements from both of the above. DynStack
has two interior data structures:
1. An item descriptor table
2. Contiguous item memory
The item memory is conceptually simple. Consider three trait objects of sizes I, J, and K. Pushing them to the stack would result in the following:
Offset | Data
========================|=======
+0 | item0
+I (+ padding) | item1
+I + J (+ padding) | item2
+I + J + K (+ padding) | item3
The item descriptor table is laid out similarly to the Vec
of boxed trait objects, with one important caveat. In place of pointers to individual items, we store offsets.
Offset | Data
=========|================================
+0 | 0, &item0_vtable
+16 | I + padding, &item1_vtable
+32 | I + J + padding, &item2_vtable
Indexing an item still has runtime O(1). We just lookup the descriptor entry at i
:
let (offset, vtable_ptr) = self.descriptors[i];
return create_fat_pointer(&self.data[offset], vtable_ptr)
More importantly, the number of memory allocations is now O(log N), as both the item descriptor table and contiguous item memory double in size upon reaching capacity.
DynStack
How can we implement DynStack::push
? Note that we need to move the trait object into our data structure.
Our structure definition:
struct DynStack<T: ?Sized> {
// ...
}
And recall that we can't take trait object parameters by value.
impl<T: ?Sized> DynStack<T> {
fn push<I>(&mut self, item: I) where I: T {}
// error[E0404]: expected trait, found type parameter `T`
}
impl T
has the same issue. Turns out, there's no way to parametrize your type over a trait :(
CoerceUnsized
Turns out, Rust has a trait available for this exact purpose.
impl<T: ?Sized> DynStack<T> {
fn push<I>(&mut self, mut item: I)
where *mut I: CoerceUnsized<*mut T> {
// ...
}
}
And it works! Unfortunately, CoerceUnsized
is unstable. And it doesn't look like it'll be stabilized any time soon. Damn.
First, I define a push
function that just takes in a mutable reference to T.
impl<T: ?Sized> DynStack<T> {
unsafe fn push(&mut self, item: &mut T) {
// ...
}
}
Then, a safe macro that calls push
while moving the value.
macro_rules! dyn_push {
{ $stack:expr, $item:expr } => {{
let mut t = $item;
unsafe { $stack.push(&mut t) };
::std::mem::forget(t);
}}
}
We need to make sure to forget
the item because we don't want to call its destructor twice. The DynStack
will Drop
all its items later.
We just have one problem here: Vec
. Turns out, &mut Vec<K>
will coerce into &mut [K]
, so push will take ownership of the slice and not the vector if we do the following:
let mut stack: DynStack<[u8]> = DynStack::new();
let item = vec![0u8; 10000000];
dyn_push!(stack, item);
And will then leak the Vec
's memory with the call to forget
.
All we have to do is change push's &mut T
argument to *mut T
:
impl<T: ?Sized> DynStack<T> {
unsafe fn push(&mut self, item: *mut T) {
// ...
}
}
This allows Rust to correctly error on the vector pushing code above, because &mut Vec<u8>
won't coerce to *mut [u8]
.
push
This part is pretty simple. I'll spare you the details, but here's the pseudocode:
fn push(item_ptr) {
next_offset = align_to(self.data.end, item_ptr)
self.data.grow_to_accomodate(next_offset + sizeof *item_ptr)
self.descriptors.push((next_offset, vtable_of(item_ptr))
self.data.copy_into(item_ptr, next_offset)
}
Benchmarks done on a 12-core Ryzen CPU with 16GB RAM, running Linux.
TL;DR: dynstack
beats an equivalent Vec<Box<_>>
across the board, but seeing heavily task- and
allocator-dependent performance boosts.
This benchmark pushes 4x to a Vec<Box<dyn Trait>>
/ DynStack<dyn Trait>
as quickly as possible, without popping.
Jemalloc (as of Rust 1.32):
push_speed_naive time: [77.472 ns 79.771 ns 82.055 ns]
push_speed_dynstack time: [72.303 ns 74.035 ns 75.572 ns]
Linux system allocator (circa Rust 1.34):
push_speed_naive time: [62.828 ns 63.456 ns 64.181 ns]
push_speed_dynstack time: [37.408 ns 37.696 ns 37.989 ns]
This benchmark pushes 100x to a Vec<Box<Fn>>
/ DynStack<Fn>
and then runs the whole list sequentially.
Jemalloc (as of Rust 1.32):
push_and_run_naive time: [1.9763 us 1.9779 us 1.9796 us]
push_and_run_dynstack time: [1.5346 us 1.5366 us 1.5387 us]
Linux system allocator (circa Rust 1.34):
push_and_run_naive time: [3.5431 us 3.5534 us 3.5684 us]
push_and_run_dynstack time: [1.8091 us 1.8118 us 1.8148 us]
This benchmark pushes once to a Vec<Box<Fn>>
/ DynStack<Fn>
, then calls the closure, then pops it from the list. It does this 100x per iteration.
Jemalloc (as of Rust 1.32):
pseudorecursive2_naive time: [1.5299 us 1.5307 us 1.5316 us]
pseudorecursive2_dynstack time: [1.0138 us 1.0159 us 1.0188 us]
Linux system allocator (circa Rust 1.34):
pseudorecursive2_naive time: [1.1447 us 1.1455 us 1.1464 us]
pseudorecursive2_dynstack time: [989.59 ns 991.74 ns 995.08 ns]