Writing Recursively Defined Types in Rust
Mon Apr 22, 2019 · 955 words

Let’s say we’re writing a theoretical compiler, and we want to teach it how to parse some math expressions. Consider this snippet of source code:

(+ (- 20 10) 7)

In this example, we see a math expression. The operator in this case is +, and the operands are another math expression and 7. Internally, the compiler sees these operands as just two more distinct expressions (literals are a type of expression). Since a math expression is an operator and two more expressions, the grammar matches and this parses successfully. To get a better understanding of how this is defined, consider a possible set of grammars:

OPERATOR <-- + | - | / | *
EXPR <-- MATHEXPR | NUMLIT
MATHEXPR <-- OPERATOR EXPR EXPR

As you can see, each math expression could contain two more math expressions. This could theoretically repeat forever, with each new math expression containing two more new math expressions, and so on. Furthermore, as EXPR increases in scope; that is, it gains more possible expression types, MATHEXPR gains more functionality. If you added a function call expression, then a math expression could contain a function call as an operand, and so on1.

To effectively represent the recursive nature of our theoretical grammar, we might try this:

enum Expr {
	Literal(LiteralExpr),
	FunctionCall(FunctionCallExpr),
}

struct MathExpression {
	op: MathOperator,
	lhs: Expr,
	rhs: Expr,
}

enum LiteralExpr {
	Num(usize), // or your favorite numeric type.
}

enum MathOperator {
	Plus,
	Minus,
	Div,
	Mul,
}

However, the compiler takes issue with this definition:

-- snip --
error[E0072]: recursive type `MathExpression` has infinite size
  --> test.rs:7:1
   |
7  | struct MathExpression {
   | ^^^^^^^^^^^^^^^^^^^^^ recursive type has infinite size
8  |     op: MathOperator,
9  |     lhs: Expr,
   |     --------- recursive without indirection
10 |     rhs: Expr,
   |     --------- recursive without indirection
   |
   = help: insert indirection (e.g., a `Box`, `Rc`, or `&`) at some point to make `MathExpression` representable

The issue here is that Rust needs to know how much space on the stack an object will take up. Since we defined MathExpression as two more expressions, and each of those sub-expressions could contain more MathExpressions, we end up with an unbounded object size. The solution to this issue is to add heap indirection of some kind to stamp down the size of an Expr so that Rust can reserve the right amount of memory. In this example, the best solution is to allocate the Expr’s within MathExpression on the heap. This will define the size of a MathExpression, since a pointer to some location on the heap has a concrete size - and as the Rust compiler so graciously tells us, we have a few different options for introducing this indirection.

The first option is Box. The Rust documentation defines Box as “a pointer type for heap allocation”, and that’s exactly what it is: wrapping a value in a Box will heap allocate the value. Note that Box can only have one owner, if you call .clone() it will copy the data within the Box as well. The memory within a Box will be freed when its owner is dropped.

The second listed choice is Rc. Rc stands for “reference-counted”2, and it’s the same as a Box as far as the allocation part is concerned. However, Rc is not limited to a single owner like Box - it permits shared ownership. This means that multiple variables can own the data an Rc points to, with the caveat that it is not mutable3. So, if you have some code like the following:

let x = Rc::new(22);
let y = Rc::clone(&x); // you can also use x.clone()
drop(x);
println!("{}", y); // y is still valid even though x was dropped!

In this example, both x and y point to the same memory. However, even though we tell Rust to explicitly drop x before the end of the scope, y is still valid. This can be attributed to the reference-counting Rc employs: the memory that stores the value 22 will not be freed until the reference-count is zero. Dropping x only decreased the reference-count by one, leaving it at one reference remaining. When y is dropped, the memory will be freed.

As an aside, note that the presented Rc example would work with Box too - you can .clone() a Box with no problems. However, as mentioned, cloning a Box will also clone the value within it, effectively doubling your allocations. Cloning an Rc will simply return a pointer to the same data and increase the reference-count. Because of this ambiguous nature, when cloning Rcs the better syntax is Rc::clone(&from), as it is more clear about what’s actually happening.

There is also Arc, which is the thread-safe atomic version of Rc. This isn’t really relevant to the example here, so I won’t bother going over it - as far as I understand, it works the same as Rc anyway, sans the atomic part.

So, which is the best option for this case? In this situation we don’t really need multiple owners - Box is the right choice. So, our MathExpression definition changes to this:

struct MathExpression {
	op: MathOperator,
	lhs: Box<Expr>,
	rhs: Box<Expr>,
}

And it compiles just fine. Now our theoretical grammar is prefectly represented in Rust, recursion and all.

Footnotes

1: Obviously, there are limits to this - you wouldn’t want an assignment expression as an argument to a math expression. While that kind of use would parse, it would not pass further analysis.

2: Note that reference-counted types have runtime overhead, because they need to keep track of references. Arc even more so, because it is atomic.

3: You actually can mutate the contents of a Rc, if they’re wrapped in a Cell or RefCell. This type of mutability is known as interior mutability, and allows for types like Rc to have mutable data. Arc has the same capability, but you must use Mutex or RwLock instead, as they are thread-safe.


back · Posts · About · Main