What 4 Million Files Taught Me: A Diskly Scanner Performance Postmortem

Diskly’s job is to scan a disk with millions of files in tens of seconds, hold the entire directory tree in memory, render it as a treemap, and stay smooth while you hover, drill down, and delete. At this scale there’s one iron law: any code that does “just a tiny bit more per file” gets multiplied by four million. This postmortem covers the most expensive lessons, all with measured numbers.

75s → 2.2s: strings are a luxury

After a scan finishes, there’s a categorization pass: aggregate usage into buckets like “Apps / Developer / System / Documents.” The first version called url.path on every file to build a full path string, then ran a dozen-plus substring scans to decide which category directory it fell under. 4 million files × (one string construction + a dozen scans) = 75 seconds, which users experienced as “the scan froze at 99%.”

The fix came from an observation that sounds worthless once said aloud: location category is a property of the directory, not the file. Everything under /Applications is an “app” — decide once. Restructured:

The scan context ScanCtx resolves the category once on entering a directory, and it’s inherited down the traversal;
Per-file work shrinks to a single “extract the extension from name” — O(short string);
The aggregation buckets are a fixed 8-slot struct on the stack (exactly 64 bytes), CategoryBuckets: accumulation with zero heap allocation, zero dictionary hashing.

75s → 2.2s, 34x. No black magic — just demoting O(file count × path length) work back to O(directory count).

The 6.46TB “2TB disk”: firmlinks and inode deduplication

Scanning the root / reported 6.46TB of usage — on a 2TB disk. This isn’t some abstract bug; it’s the concrete reality of macOS volume layout: since Catalina, the system and data volumes are separate, paths like /Users and /Applications reach the data volume through firmlinks, and /System/Volumes/Data mounts that same data volume again in full. The same files have multiple legitimate paths from the root down, and a naive traversal counts every one.

The correct way to dedupe is to trust inodes, not paths: lstat for the (st_dev, st_ino) pair, into a seen-set. But doing it for everything is too expensive — millions of set insertions are themselves a cost. The actual strategy is layered:

Directories are always deduped — firmlinks and duplicate mount entry points are directories; no cutting corners here;
Files are deduped only when st_nlink > 1 — a file with linkCount == 1 physically cannot be referenced by a second path, so it’s counted directly. That’s over 99% of cases, at zero extra cost;
Keys are integer pairs, not URL or any other object — integer hashing is fast, and it saves millions of boxings.

APFS clones (copy-on-write shared blocks) are a different matter: two cloned files each genuinely “own” all their logical bytes, and the physical sharing is the filesystem’s internal bookkeeping. The scanner doesn’t handle it — at boundaries like this, honest beats clever.

One sibling detail worth noting: file size uses a three-level fallback, totalFileAllocatedSize ?? fileAllocatedSize ?? fileSize ?? 0 — preferring “actual on-disk footprint” (including block alignment and transparent compression), which is the correct measure for the question “where did my disk go.”

Memory: three unrelated invisible hogs

With millions of nodes resident in memory, one byte saved per node is several MB. Three independent fixes, which together pulled peak usage back into a manageable range:

UUID → ObjectIdentifier: node ids were UUIDs, 16 bytes each. 9 million nodes = 150MB of pure overhead. But FileNode is a reference type, and object addresses are inherently unique — ObjectIdentifier is free. SwiftUI’s Identifiable doesn’t care what your id is, only that it’s stable;
Strip URLs of their hidden cache: the resourceValues prefetched by contentsOfDirectory(includingPropertiesForKeys:) are cached inside the URL object. Skip removeAllCachedResourceValues() before nodes enter the tree, and every URL drags a metadata dictionary to the grave with it — multiple GB of extra resident memory. The cache is useless downstream: the fields we need were extracted long ago;
An autoreleasepool per directory: scan workers are long loops that never return, and autoreleased temporaries from Foundation calls normally wait for the thread’s pool to drain — which, inside a long loop, never reaches a drain point. An entire scan’s worth of temporaries piles up. Wrapping each directory’s processing manually releases them promptly.

Concurrency: intuition loses to 24% and 40%

“Disk scanning is I/O-bound, so more threads is always better” — sounds airtight. Measured (8 cores, local SSD): 16 threads burned 24% more CPU and 40% more memory than 8 threads, with zero speedup.

Because the premise was wrong. Metadata operations on NVMe are so fast that the bottleneck isn’t the device at all — it’s syscalls and kernel locks on directory structures. Threads beyond the core count add no throughput; they just spin on the shared work queue’s lock, feeding CPU to the scheduler. The final default: min(16, max(4, cores)).

The one exception that holds up is cloud-synced directories: every stat on an iCloud/OneDrive placeholder file is an IPC round trip to the File Provider process, so threads really do spend most of their time waiting, and over-provisioning pays off — which is why workers scale up to 32 only when the user opts into “include cloud directories.” Same parameter, correct values differing by 2x across two workloads: concurrency is never a constant; it’s a function of the workload.

(Cloud directories hide a nastier trap: those placeholder files share the boot volume’s dev_t, so a volume blacklist can’t screen them out — only an absolute-path blacklist works, and metadata reads must never trigger a download. Otherwise one “disk analysis” can devour the user’s entire iCloud bandwidth.)

Finishing and rendering: two traps in the last mile

The COW trap in sorting: once the tree is built, it’s sorted by size and small items get folded into “Other (N items).” If the folding function holds an extra reference to the children array, the subsequent removeLast triggers a full copy-on-write duplication — and all the memory saved by in-place sorting evaporates overnight. Swift’s value semantics are a safety net and a performance tax: on million-element arrays, every casual “let me stash a reference” needs a second thought.

Canvas hover jank: the treemap is drawn with SwiftUI Canvas, and the first version painted structure and highlight in the same layer. Consequence: every one-pixel mouse move → hovered state change → body recompute → the entire Canvas closure reruns — every cell’s Path and fill, plus binary-search truncation measurement for two lines of text per cell (hundreds of cells × ~10 font measurements each = thousands of measurements per frame). The fix cut two ways:

Split into two Canvas layers: StructureCanvas (structure + labels) takes no hovered parameter and gets .equatable(), so SwiftUI skips redrawing entirely while the tree reference is unchanged; HighlightCanvas draws only the two strokes for selection and hover;
All text measurement moved forward to the layout-rebuild phase; computed displayName/displaySize are cached on the node, and the draw path measures nothing.

Wrap-up

After this whole loop, the methodology fits in one sentence: first count how many times this line of code will run, then decide whether it deserves to exist. Amplified by four million, one string concatenation is 75 seconds, 16 bytes is 150MB, and one “just add more threads” intuition is 24% of CPU burned for nothing. Performance engineering isn’t mysterious — it just refuses to spare intuition’s feelings.