a blog

by Gui Andrade

Compile-time coprocessor codegen, with Rust macros


The Nintendo 3DS uses an ARM standard peripheral, the CoreLink DMA engine, for copying memory among DRAM and memory-mapped peripherals.

This DMA engine, unlike most other IO devices on the 3DS, actually has its own instruction set where the CPU merely uploads a stream of instructions for the peripheral to execute (other examples of this, on the 3DS, are the DSP audio processor and the PICA graphics chip).

I’d like to compile and run DMA instructions in Rust, in a hopefully ergonomic manner, without needing to use any dynamic memory allocation. This imposes a particular constraint that I need to know the number of instruction bytes at compile time so I can use an appropriately-sized array.

Macro Implementation

Anytime I consider complex compile-time processing, macros are usually the solution. In this case, I could choose between generating byte arrays directly or emitting nested packed structs (which I would later have to transmute to some kind of byte pointer). I chose the former, as getting packed struct semantics right can be somewhat tricky.

So my macro needed to do a few things:

  1. Take in a linear sequence of opcodes and their parameters, delimited by semicolons.
  2. Do some bookkeeping to keep track of the output position in the instruction buffer, as well as backward jump offsets for loops.
  3. Emit a sequence of bytes in an appropriately-sized [u8; N] array.

Here’s the final syntax I achieve:

let program = xdma_compile! {
    MOV(SAR, (src as u32));
    MOV(CCR, (ctrl_big.val));
    MOV(DAR, (dst as u32));
    LP(0, (chunks as u8));

Constant array size

The simplest part of this implementation is computing the byte array size:

macro_rules! xdmainst_size {
    (GO) => (6);
    (END) => (1);
    (KILL) => (1);
    (FLUSHP) => (2);
    (WFP) => (2);
    // ...

These constants are summed together with a one-liner in the xdma_compile macro.

const LEN: usize = 0 $(+ xdmainst_size!($inst_name))+;

Code generation

Code generation for individual instructions, with the simple encoding of the CoreLink ISA, is also relatively easy to do. I use another macro to emit byte arrays of the appropriate length for some given instruction.

macro_rules! xdmainst {
    (END) => ([0x00]);
    (KILL) => ([0x01]);
    (FLUSHP $which:expr) => ([0x35, $which << 3]);
    (WFP $which:expr, periph) => ([0x31, $which << 3]);
    // ...
    (MOV $where:ident, $what:expr) => {{
        enum Reg {
            SAR = 0,
            CCR = 1,
            DAR = 2
        let b = ($what as u32).to_le_bytes();
        [0xbc, Reg::$where as u8, b[0], b[1], b[2], b[3]]

Then, copying the instruction into its designated slot in the buffer (arr here) is as easy as mutating successive slices (arr_sl).

let arr_sl = &mut arr[..];
$( // This macro repetition expands over every given instruction
    let inst_dat = xdmainst!( $inst_name $($inst_param),* );

    // Update the slice to move onto the next instruction
    let arr_sl = &mut arr_sl[inst_dat.len()..];

Aside - Handling the DMALP instruction

The DMA instruction set additionally includes two instructions, LP and LPEND, which begin and end a loop. This slightly complicates the instruction stream copying, as I have to keep track of the number of bytes between the two instructions to calculate its relative backward jump.

I keep track of the two available loop counters on the DMA engine, initialize them when I spot a DMALP, and commit them when I spot a DMALPEND.

// ...
let mut loop_rel: [Option<u8>; 2] = [None; 2];
    let inst_dat = { /* ... */ };

    loop_rel[0].as_mut().map(|x| *x += xdmainst_size!( $inst_name ));
    loop_rel[1].as_mut().map(|x| *x += xdmainst_size!( $inst_name ));

    handle_lpend!( &mut loop_rel; arr_sl; $inst_name $($inst_param),* );
    handle_lp!( &mut loop_rel; $inst_name $($inst_param),* );
    // ...

Optimized Rust output

Rust's constant folding optimizations do a terrific job of optimizing this control code away. You can see this on the Godbolt Compiler Browser.


Rust macros can implement readable assemblers, at least for simple architectures, with relatively little macro code. The assembling process can even hold state in the process, and most of this state handling code will be optimized away during compilation.

As this method of assembling code never needs to allocate memory, and can never overrun a statically-allocated buffer, this can be an effective tool for embedded firmware or similar applications where alloc is not an option. ​