Performance Analysis · Go Binary Encoding

nibble Benchmark Report

Declarative bit-level binary parsing for Go

Generated: —

10×
faster than go-bitfield
on unmarshal (1M packets)
5 M
packets/second
sustained throughput
182 ns
per-packet unmarshal
at 1M scale

Executive Summary

What was measured and what it means

nibble is a Go library that lets you describe binary packet layouts using struct tags (bits:"N"), then marshal and unmarshal bit-packed data declaratively — no hand-written bit arithmetic required.

This report benchmarks nibble against two alternatives: manual bit arithmetic (the theoretical performance ceiling — raw shifts and masks, zero reflection, zero allocations) and go-bitfield (a comparable struct-tag library that provides unmarshal-only parsing). The test packet is a 64-bit game-state struct (8 bytes, 8 fields with widths from 1 to 16 bits).

Key findings: After schema-caching was added to nibble, unmarshal throughput improved 11.6× (from 2102 ns/op to 182 ns/op per packet). nibble is now ~10× faster than go-bitfield and sustains ~5 million packets/second on a single core — sufficient for most production game servers, IoT hubs, and security tooling workloads. The remaining gap versus manual code (~27–40×) is the cost of reflection-based field dispatch.

When to use nibble: correctness-critical protocol work, rapid protocol iteration, debugging (Explain/Diff/Validate APIs), and any workload under ~5 M pkt/s. When to use manual: hot-path code requiring >5 M pkt/s, HFT/raw-networking, and cases where you own the bit-math and never change the protocol.

Parsing Speed: nibble vs manual vs go-bitfield

Unmarshal performance across five dataset sizes — ns per packet (lower is better)

All three libraries — log scale
nibble vs go-bitfield only (manual excluded for scale)
Unmarshal · ns per packet · color: ■ best  ■ mid  ■ worst
Dataset nibble (ns) manual (ns) go-bitfield (ns) nibble / manual nibble / go-bitfield
100 212.75.3 2235.040.1×10.5× faster
1K 244.211.7 2065.520.9×8.5× faster
10K 182.55.2 1725.635.1×9.5× faster
100K 184.95.7 1705.132.4×9.2× faster
1M 181.76.7 1824.727.1×10.0× faster

Encoding Speed: nibble vs manual

Marshal performance — ns per packet (lower is better) · go-bitfield has no Marshal API

nibble vs manual marshal
nibble / manual overhead ratio by dataset size
Marshal · ns per packet
Dataset nibble (ns) manual in-place (ns) nibble / manual
100 237.910.422.9×
1K 162.65.8 28.0×
10K 157.26.0 26.2×
100K 155.75.9 26.4×
1M 225.835.06.5×

Throughput: Millions of Packets Per Second

Higher is better · measured on Intel i7-10510U @ 1.80 GHz · single core

All libraries — Mpkt/s by dataset size
Real-world workload coverage (log scale)
Throughput · millions of packets/second
Dataset nibble unmarshal manual unmarshal go-bitfield nibble marshal manual marshal
100 4.7187.50.454.295.7
1K 4.185.5 0.486.1173.9
10K 5.5191.70.586.4167.7
100K 5.4174.30.596.4170.2
1M 5.5149.20.554.428.6

Memory Allocation Profile

Heap allocations per single operation — measured with testing.AllocsPerRun(1000, …)

Allocs/op per operation type
⚠️
nibble Marshal / Unmarshal: 2 allocs/op

Each call allocates two small objects on the heap — one for the parsed struct layout and one for the byte slice result. At 5 M pkt/s this is ~10 M allocs/s, increasing GC frequency. Target: 0 allocs/op via object pooling in a future release.

manual Marshal / Unmarshal: 0 allocs/op

Pure stack arithmetic — no heap involvement. ManualMarshalInto writes directly into a caller-supplied buffer. Zero GC pressure regardless of throughput.

📊
Long-running heap: flat post-GC

After 10 × 1M-packet batches the post-GC live heap stays at a constant 305 MiB — confirming nibble allocations are transient and the GC reclaims them fully.

Safety: Where nibble Wins Outright

Performance isn't everything — correctness and developer safety matter more in most codebases

Feature & safety comparison
Scenario nibble manual go-bitfield
Truncated packet input ✅ ErrInsufficientData ❌ index panic ❌ index panic
Field overflow (value > bit-width max) ✅ ErrFieldOverflow ❌ silent truncation ❌ silent truncation
Protocol format change ✅ 1-line struct edit ❌ risky bit-math refactor ✅ 1-line struct edit
Code readability ✅ Declarative struct tags ❌ Opaque bit arithmetic ⚠️ Verbose (no Marshal)
Marshal support ✅ Yes ✅ Yes ❌ No
Explain() debug tool ✅ Yes — byte/bit breakdown ❌ No ❌ No
Validate() before marshal ✅ Yes ❌ Manual ❌ No
Diff() struct comparison ✅ Yes — field-level diff ❌ No ❌ No
Signed integer support ✅ With sign extension ✅ Manual sign extension ⚠️ Unsigned only safe
Bool field support ✅ Native bool ✅ Manual comparison ❌ Must use uint8

Optimization Progress

Schema caching eliminated repeated reflection on every call

v0.1.0 · before caching
2102.3 ns
unmarshal per packet
2458.8 ns
marshal per packet
~0.5 Mpkt/s
throughput
300×
overhead vs manual
latest · schema cached
181.7 ns
unmarshal per packet 11.6×↑
155.7 ns
marshal per packet 15.8×↑
~5.5 Mpkt/s
throughput 11×↑
27–40×
overhead vs manual
Performance before vs after schema caching

When Should You Use nibble?

A practical guide to choosing the right approach

Do you need bit-level binary parsing?
↓ Yes                   ↓ No: use standard encoding/binary or protobuf
Is correctness / maintainability more important than raw throughput?
↓ Yes
Use nibble ✅
Safe, declarative, full-featured
↓ No — need maximum throughput
Processing > 5 M pkt/s?
↓ Yes
Use manual ⚡
Write and own the bit-math
↓ No
Use nibble ✅
5 Mpkt/s is plenty
✅ Use nibble
  • Game server backends
  • IoT device hubs
  • CTF / capture tools
  • Protocol prototyping
  • Any workload < 5 M pkt/s
✅ Use nibble (debug & ops)
  • Security packet scanners
  • Protocol debugging
  • New or changing formats
  • Teams without bit-math experts
  • When Explain() / Diff() matter
⚡ Use manual
  • HFT / raw networking
  • Kernel-adjacent packet paths
  • Throughput > 5 M pkt/s needed
  • Stable, never-changing format
  • You have thorough fuzz tests

Raw Benchmark Data

All numbers used in this report

Unmarshal benchmarks — Show raw data ▼
Benchmarkns/op (loop)ns/pktMB/sallocs/opB/op
Nibble/Tiny_100 22,460212.735.62001700
Nibble/Small_1K 244,200244.232.7200017000
Nibble/Medium_10K1,825,000182.543.820000170000
Nibble/Large_100K18,490,000184.943.32000001700000
Nibble/XLarge_1M181,700,000181.744.0200000017000000
Manual/Tiny_100 5305.3127000
Manual/Small_1K 11,70011.754600
Manual/Medium_10K52,0005.2123100
Manual/Large_100K570,0005.7112200
Manual/XLarge_1M6,700,0006.795400
GoBitfield/Tiny_100 223,5002235.03.6100800
GoBitfield/Small_1K 2,065,5002065.53.910008000
GoBitfield/Medium_10K17,256,0001725.64.61000080000
GoBitfield/Large_100K170,510,0001705.14.7100000800000
GoBitfield/XLarge_1M1,824,700,0001824.74.410000008000000
Marshal benchmarks — Show raw data ▼
Benchmarkns/op (loop)ns/pktMB/sallocs/opB/op
Nibble/Tiny_100 23,790237.933.62001600
Nibble/Small_1K 162,600162.649.2200016000
Nibble/Medium_10K1,572,000157.250.920000160000
Nibble/Large_100K15,570,000155.751.42000001600000
Nibble/XLarge_1M225,800,000225.835.4200000016000000
Manual/Tiny_100 1,04010.477000
Manual/Small_1K 5,8005.8137900
Manual/Medium_10K60,0006.0133300
Manual/Large_100K590,0005.9135600
Manual/XLarge_1M35,000,00035.022900

Reproduce these benchmarks

git clone https://github.com/PavanKumarMS/nibble-benchmark
cd nibble-benchmark
go mod tidy
go test -bench=. -benchmem -benchtime=10s -count=3 ./...
go run cmd/runner/main.go --full --open