Inside bpfman: Anatomy of an eBPF Manager

The previous post covered why eBPF program management is harder than it looks: distribution, access control, and multi-tenancy are all unsolved by the kernel itself. This post is about how bpfman solves them - what the components are, how they fit together, and the decisions behind the design.

The daemonless model

The first thing to understand about bpfman's architecture is that it is not a long-running daemon. It was, once - when it was called bpfd - but that design was rethought.

A long-running privileged process is an attack surface. A process holding CAP_BPF continuously, even while idle, is a risk that grows with time. The redesign takes a different approach: bpfman is launched only when needed - to load a program, unload one, or query state - and exits once it's done.

The result is that CAP_BPF is held by a process for as little time as possible. The attack surface shrinks to the window in which bpfman is actually doing something.

The single binary

An earlier version of bpfman split functionality between two executables: bpfd (the server) and bpfctl (the client). The redesign consolidates both into a single binary with subcommands:

bpfman system service - runs the gRPC API server
bpfman load file / bpfman load image - load a program from a local file or OCI image
bpfman unload - remove a loaded program
bpfman list - list currently loaded programs
bpfman pull / bpfman images - manage the local image cache

One binary to install, one binary to understand, one binary to audit. Where before we had a gRPC API used everywhere, it has now been relegated to support the bpfman-operator only... but the eventual plan is to remove gRPC completely.

What happens when you load a program

Take the simplest case: a caller asks bpfman to load an XDP program from an OCI image.

1. Pull the image. bpfman pulls the OCI image from the specified registry, the same way a container runtime would. The image contains compiled eBPF bytecode and metadata describing the program type, map definitions, and expected attach points.

2. Verify the signature. Before loading anything into the kernel, bpfman verifies the image signature using cosign and the sigstore infrastructure. An unsigned or unverifiable image is rejected. eBPF programs run in the kernel, so the provenance of the bytecode matters.

3. Load into the kernel. The verified bytecode is loaded via the bpf() syscall. The kernel verifier runs at this point - it statically analyses the bytecode and rejects programs it can't prove are safe. This step requires CAP_BPF.

4. Attach via the dispatcher. Rather than attaching directly to the XDP hook, bpfman attaches its own dispatcher program (if one isn't already there) and registers the caller's program with it. The dispatcher calls registered programs in priority order. A second program loading at the same interface adds itself to the chain rather than overwriting what's already there.

5. Persist state. bpfman writes program configuration to its state store and exits after the inactivity timeout. State is not held in memory - it survives the process exiting. The other way that bpfman persists state is by using the bpffs filesystem. This allows us to pin references to programs, maps, and links to a filesystem which allows the program to persist in kernel memory after the program that has loaded it exits.

State persistence with sled

The shift to a short-lived process required rethinking state management. A long-running daemon can keep state in memory and reconstruct it from disk on restart. A process that exits by design cannot rely on in-memory state at all.

bpfman uses sled - a fast, embedded key-value database written in Rust - as its state store. Program configuration, attachment metadata, and dispatcher state are all written to sled. When bpfman starts (whether for the first time or after a previous run), it reads from sled to understand what is currently loaded and reconciles against actual kernel state.

This is a stricter model that forces better design: anything that needs to survive a process exit must be explicitly persisted. It also means that daemon crashes or upgrades don't take down your networking or observability stack - the kernel holds the loaded programs, sled holds the configuration, and the next invocation of bpfman picks up where the last one left off.

The dispatcher in detail

The multi-tenancy mechanism deserves a closer look. bpfman's approach is based on the protocol defined by libxdp, which introduced the concept of an XDP dispatcher: a meta-program with several stub functions in it. Your XDP program gets turned into a BPF_PROG_TYPE_EXT, and we attach it to one of the stub functions in the dispatcher. What's very clever about this is that when the assembled dispatcher program is loaded into the kernel, any empty stub functions are removed via dead-code elimination so it's performant too.

bpfman extends this protocol to TC (traffic control) hook points, which libxdp didn't cover. The other extension is that bpfman acts as the orchestrator for priority ordering. Callers declare their intent - "I want to run at priority 50 on eth0" - and bpfman manages the dispatcher configuration to make that happen. Programs remain unaware of each other; bpfman handles the coordination.

When the last program is unloaded from a hook point, bpfman removes the dispatcher too, leaving the interface clean.

Kubernetes integration

On Kubernetes, bpfman runs as a DaemonSet - one instance per node. eBPF programs are declared as custom resources. You kubectl apply a BpfApplication the same way you'd apply a Deployment or a ConfigMap. The bpfman operator watches for those resources and instructs the per-node bpfman-agent container to load them.

The CRDs are the authoritative source of truth in Kubernetes. After a node reboot, the DaemonSet starts, bpfman detects the divergence between desired state (CRDs) and actual state (empty), and reloads everything. Standard Kubernetes reconciliation, applied to eBPF programs.

Evolving the operator: the CLI is the API

You may have noticed the tension in the previous section. bpfman is daemonless, but we have DaemonSets in Kubernetes. The awkward truth is that it's easy to go daemonless at the Linux level, but Kubernetes patterns will continue to force your hand. For example, we can't have a serverless controller yet - i.e., one that only wakes to reconcile a CRD on demand.

As I mentioned before, there is a desire to remove gRPC from bpfman. Therefore, the per-node bpfman-agent DaemonSet will need a way to load programs!

Without gRPC, the CLI becomes our API. This might have felt controversial a few years ago. But with the rise of Agents and Skills, it is now much less so.

If the CLI becomes our API, then what should the controller do? In our opinion, the future is the Kubernetes Job. Essentially, we would spawn a Job to perform a bpfman invocation on a node. This follows our ethos: if you need privileges, keep them for the least amount of time. Using Job creates new challenges with how we represent Status in our CRDs so it's not a simple fix. Expect to hear more from me about how we solved this in future.

Why OCI

The distribution decision - using OCI images for eBPF programs - is the one I'm most pleased with in retrospect, because it wasn't obvious when we made it.

The alternative was to build a bespoke packaging format: define a schema, build tooling to produce and consume it, build or integrate a registry, handle versioning and signing from scratch. That's a lot of infrastructure for a problem that the container ecosystem had already solved, in production, at scale.

OCI images are content-addressed, signable with cosign, and pullable from any OCI-compliant registry. Every security scanning tool in the ecosystem understands them. Every GitOps pipeline can reference them. By choosing OCI, bpfman programs inherit all of that for free.

The tradeoff is that an eBPF program packaged as an OCI image looks a bit unusual - it's not a container, it has no entrypoint, it just contains bytecode and metadata. That's a small conceptual overhead. The operational benefits are worth it.

The full picture

bpfman's architecture reflects a deliberate choice to minimise the time spent holding privileged capabilities: launch on demand, do the work, persist the state, exit. The kernel holds the loaded programs. sled holds the configuration. CAP_BPF is held for seconds, not indefinitely.

The next post steps back from the architecture and talks about what it was like to build and maintain this as an open source project.