uTensor: A Tale of Two Allocators

uTensor

Sep 9

Introduction

One of the principal driving concepts in uTensor is that the various tensor implementations describe where and how underlying data is accessed. For example, RomTensors can read (only) from ROM, RamTensors from RAM, and so on. This makes it easy to support heterogeneous memory/compute architectures in a clean, yet scalable, fashion and allow TinyML engineers to predict and tune runtime performance with minimal code bloat.

uTensor facilitates the tensor lifecycle by breaking the core dynamic allocator into two, the first of which handles small, low-variance, allocations presumably in less volatile memory and the second which takes care of relatively large, high-variance, allocations in RAM (or upcoming denser/noisier equivalents [1]). Dividing dynamic allocations this way has the further benefits that we can better estimate dynamic memory performance a priori at model transform time, and minimize system heap pollution as larger TinyML allocations can easily clash with network stacks or IoT management solutions.

Detailed description of Memory Allocator Interface

MetaAllocator, small but non-trivial

Originally, we proposed the meta data allocator to minimize runtime stack pressure resulting from managing the various tensor objects information. As it turns out, the basic structure of generic tensors (shape, quantization parameters, type info, etc.) take up a non-trivial amount of storage when you consider how many tensors are often managed in a model. Furthermore, not all of these parameters are read-only and static, especially when dealing with heavily optimized compute kernels and co-accelerators.

Fast forward to our modern interpretation of the metadata allocator geared towards one-shot allocations or short-lived allocations of relatively-small low-variance blocks on the order of ~bytes to ~tens of bytes each. Although small impact, deciding which type of allocation to go for can lead to slightly lower dynamic RAM usage or slightly faster inference. The default implementation uses the same, but internally scaled, allocator as the ram data allocator, solely for the sheer laziness on my part and partially for teaching purposes. There is no strong reason this allocator cannot use the default system heap under the hood, we just need to be careful with object alignment. However, I do plan on eventually writing a lightweight dedicated version in the future that’s optimized for small block allocations.

One neat byproduct of making this a separate allocator is that it becomes much easier to support online model update strategies. Generally speaking, weights in a graph are much more likely to change across model updates than the graph structure. This makes sense for two reasons, 1) we can fine tune our model as new data samples come in after deployment, and 2) if the graph structure doesn’t change then we can treat the weights as managed objects in deployment and update them in persistent memory without having to incur an expensive firmware update campaign. With a separate metadata allocator, where tensor object info is guaranteed to live, we can enable online model update either with a dirty-tag strategy (update weights speculatively) or feed-forward fencing strategy (only update weights prior to current inference point), both of which are cheap to maintain internally in the uTensor context.

RamAllocator, a huge range of allocations

Whereas the metadata allocator is made for small, consistent, allocations, blocks allocated in the ram data allocator are a lot like my cat, massive. To put it into perspective, with a little bit of engineering effort it’s possible to fit a decent sized IoT application, RTOS, IoT management code, network stack, and SSL bits into about 40-80 KB. And unless you are running an aggressive network architecture search strategy [3], pruning strategy, or employing challenging tensor compression**/decomposition techniques, chances are your model is at least as big as the rest of your firmware, meaning intermediates use up a significant amount of RAM.

In long-term, and reliable, deployments It’s important to know exactly how these expensive allocations behave. For example, the uTensor model transform workflow does a best effort search for a memory allocation plan based on the model + runtime guarantees, and outputs both this plan and soft upper-bound for the required allocators. Internally, the allocator will attempt to follow this plan which could include worst-case of moving allocated blocks around. We also factor in extra scratch buffer space for operators that need it. The downside of preallocating dedicated arenas per big application is there is a lot of unused space that could be shared between the apps. However, in practice we’ve found it helpful to let the users extend the size of this arena to fit their budget and then wrap new and delete in a dedicated namespace. For example uTensor::RamAllocation::new.

** I’ve had pretty good luck with custom Golomb-Rice based decompression of weights since it can be streamed.