nibble Benchmark Report

Executive Summary

What was measured and what it means

nibble is a Go library that lets you describe binary packet layouts using struct tags (bits:"N"), then marshal and unmarshal bit-packed data declaratively — no hand-written bit arithmetic required.

This report benchmarks nibble against two alternatives: manual bit arithmetic (the theoretical performance ceiling — raw shifts and masks, zero reflection, zero allocations) and go-bitfield (a comparable struct-tag library that provides unmarshal-only parsing). The test packet is a 64-bit game-state struct (8 bytes, 8 fields with widths from 1 to 16 bits).

Key findings: After schema-caching was added to nibble, unmarshal throughput improved 11.6× (from 2102 ns/op to 182 ns/op per packet). nibble is now ~10× faster than go-bitfield and sustains ~5 million packets/second on a single core — sufficient for most production game servers, IoT hubs, and security tooling workloads. The remaining gap versus manual code (~27–40×) is the cost of reflection-based field dispatch.

When to use nibble: correctness-critical protocol work, rapid protocol iteration, debugging (Explain/Diff/Validate APIs), and any workload under ~5 M pkt/s. When to use manual: hot-path code requiring >5 M pkt/s, HFT/raw-networking, and cases where you own the bit-math and never change the protocol.

Parsing Speed: nibble vs manual vs go-bitfield

Unmarshal performance across five dataset sizes — ns per packet (lower is better)

All three libraries — log scale

nibble vs go-bitfield only (manual excluded for scale)

Unmarshal · ns per packet · color: ■ best ■ mid ■ worst

Dataset	nibble (ns)	manual (ns)	go-bitfield (ns)	nibble / manual	nibble / go-bitfield
100	212.7	5.3	2235.0	40.1×	10.5× faster
1K	244.2	11.7	2065.5	20.9×	8.5× faster
10K	182.5	5.2	1725.6	35.1×	9.5× faster
100K	184.9	5.7	1705.1	32.4×	9.2× faster
1M	181.7	6.7	1824.7	27.1×	10.0× faster

Encoding Speed: nibble vs manual

Marshal performance — ns per packet (lower is better) · go-bitfield has no Marshal API

nibble vs manual marshal

nibble / manual overhead ratio by dataset size

Marshal · ns per packet

Dataset	nibble (ns)	manual in-place (ns)	nibble / manual
100	237.9	10.4	22.9×
1K	162.6	5.8	28.0×
10K	157.2	6.0	26.2×
100K	155.7	5.9	26.4×
1M	225.8	35.0	6.5×

Throughput: Millions of Packets Per Second

Higher is better · measured on Intel i7-10510U @ 1.80 GHz · single core

All libraries — Mpkt/s by dataset size

Real-world workload coverage (log scale)

Throughput · millions of packets/second

Dataset	nibble unmarshal	manual unmarshal	go-bitfield	nibble marshal	manual marshal
100	4.7	187.5	0.45	4.2	95.7
1K	4.1	85.5	0.48	6.1	173.9
10K	5.5	191.7	0.58	6.4	167.7
100K	5.4	174.3	0.59	6.4	170.2
1M	5.5	149.2	0.55	4.4	28.6

Memory Allocation Profile

Heap allocations per single operation — measured with testing.AllocsPerRun(1000, …)

Allocs/op per operation type

⚠️

nibble Marshal / Unmarshal: 2 allocs/op

Each call allocates two small objects on the heap — one for the parsed struct layout and one for the byte slice result. At 5 M pkt/s this is ~10 M allocs/s, increasing GC frequency. Target: 0 allocs/op via object pooling in a future release.

✅

manual Marshal / Unmarshal: 0 allocs/op

Pure stack arithmetic — no heap involvement. ManualMarshalInto writes directly into a caller-supplied buffer. Zero GC pressure regardless of throughput.

📊

Long-running heap: flat post-GC

After 10 × 1M-packet batches the post-GC live heap stays at a constant 305 MiB — confirming nibble allocations are transient and the GC reclaims them fully.

Safety: Where nibble Wins Outright

Performance isn't everything — correctness and developer safety matter more in most codebases

Feature & safety comparison

Scenario	nibble	manual	go-bitfield
Truncated packet input	✅ ErrInsufficientData	❌ index panic	❌ index panic
Field overflow (value > bit-width max)	✅ ErrFieldOverflow	❌ silent truncation	❌ silent truncation
Protocol format change	✅ 1-line struct edit	❌ risky bit-math refactor	✅ 1-line struct edit
Code readability	✅ Declarative struct tags	❌ Opaque bit arithmetic	⚠️ Verbose (no Marshal)
Marshal support	✅ Yes	✅ Yes	❌ No
Explain() debug tool	✅ Yes — byte/bit breakdown	❌ No	❌ No
Validate() before marshal	✅ Yes	❌ Manual	❌ No
Diff() struct comparison	✅ Yes — field-level diff	❌ No	❌ No
Signed integer support	✅ With sign extension	✅ Manual sign extension	⚠️ Unsigned only safe
Bool field support	✅ Native bool	✅ Manual comparison	❌ Must use uint8

Optimization Progress

Schema caching eliminated repeated reflection on every call

v0.1.0 · before caching

2102.3 ns

unmarshal per packet

2458.8 ns

marshal per packet

~0.5 Mpkt/s

throughput

300×

overhead vs manual

→

latest · schema cached

181.7 ns

unmarshal per packet 11.6×↑

155.7 ns

marshal per packet 15.8×↑

~5.5 Mpkt/s

throughput 11×↑

27–40×

overhead vs manual

Performance before vs after schema caching

When Should You Use nibble?

A practical guide to choosing the right approach

Do you need bit-level binary parsing?

↓ Yes ↓ No: use standard encoding/binary or protobuf

Is correctness / maintainability more important than raw throughput?

spacer

↓ Yes

Use nibble ✅
Safe, declarative, full-featured

↓ No — need maximum throughput

Processing > 5 M pkt/s?

↓ Yes

Use manual ⚡
Write and own the bit-math

↓ No

Use nibble ✅
5 Mpkt/s is plenty

✅ Use nibble

Game server backends
IoT device hubs
CTF / capture tools
Protocol prototyping
Any workload < 5 M pkt/s

✅ Use nibble (debug & ops)

Security packet scanners
Protocol debugging
New or changing formats
Teams without bit-math experts
When Explain() / Diff() matter

⚡ Use manual

HFT / raw networking
Kernel-adjacent packet paths
Throughput > 5 M pkt/s needed
Stable, never-changing format
You have thorough fuzz tests

Raw Benchmark Data

All numbers used in this report

Unmarshal benchmarks — Show raw data ▼

Benchmark	ns/op (loop)	ns/pkt	MB/s	allocs/op	B/op
Nibble/Tiny_100	22,460	212.7	35.6	200	1700
Nibble/Small_1K	244,200	244.2	32.7	2000	17000
Nibble/Medium_10K	1,825,000	182.5	43.8	20000	170000
Nibble/Large_100K	18,490,000	184.9	43.3	200000	1700000
Nibble/XLarge_1M	181,700,000	181.7	44.0	2000000	17000000
Manual/Tiny_100	530	5.3	1270	0	0
Manual/Small_1K	11,700	11.7	546	0	0
Manual/Medium_10K	52,000	5.2	1231	0	0
Manual/Large_100K	570,000	5.7	1122	0	0
Manual/XLarge_1M	6,700,000	6.7	954	0	0
GoBitfield/Tiny_100	223,500	2235.0	3.6	100	800
GoBitfield/Small_1K	2,065,500	2065.5	3.9	1000	8000
GoBitfield/Medium_10K	17,256,000	1725.6	4.6	10000	80000
GoBitfield/Large_100K	170,510,000	1705.1	4.7	100000	800000
GoBitfield/XLarge_1M	1,824,700,000	1824.7	4.4	1000000	8000000

Marshal benchmarks — Show raw data ▼

Benchmark	ns/op (loop)	ns/pkt	MB/s	allocs/op	B/op
Nibble/Tiny_100	23,790	237.9	33.6	200	1600
Nibble/Small_1K	162,600	162.6	49.2	2000	16000
Nibble/Medium_10K	1,572,000	157.2	50.9	20000	160000
Nibble/Large_100K	15,570,000	155.7	51.4	200000	1600000
Nibble/XLarge_1M	225,800,000	225.8	35.4	2000000	16000000
Manual/Tiny_100	1,040	10.4	770	0	0
Manual/Small_1K	5,800	5.8	1379	0	0
Manual/Medium_10K	60,000	6.0	1333	0	0
Manual/Large_100K	590,000	5.9	1356	0	0
Manual/XLarge_1M	35,000,000	35.0	229	0	0

Reproduce these benchmarks

git clone https://github.com/PavanKumarMS/nibble-benchmark
cd nibble-benchmark
go mod tidy
go test -bench=. -benchmem -benchtime=10s -count=3 ./...
go run cmd/runner/main.go --full --open