Inside Matrix Desktop's 5-Pass Metal Rendering Pipeline
Matrix Desktop renders film-accurate digital rain as a live macOS desktop wallpaper at 60fps on 5K Retina displays. The entire rendering pipeline runs on the GPU through Metal compute shaders, keeping CPU usage near zero. This article walks through every stage of that pipeline: how cell data flows from CPU simulation to pixel output, how bloom creates the characteristic phosphor glow, how a 256-entry palette LUT drives the color system, and how the app adapts to thermals, battery state, and multi-monitor configurations.
The Big Picture: CPU Simulation to GPU Rendering
The architecture splits cleanly into two halves. The CPU runs a state machine (RainSimulation) at 30 ticks per second that determines which glyphs are visible, how bright each cell is, and where deletion strings are erasing content. The GPU takes that cell data and turns it into pixels through a 5-pass compute pipeline.
Each MatrixRenderer instance owns one simulation, one glyph atlas, and a set of triple-buffered Metal buffers. There is one renderer per connected display, and each renderer selects the GPU attached to its screen.
Cell Data: The CPU-GPU Interface
The simulation produces an array of CellData structs, one per grid cell across three depth layers. Each struct is 8 bytes and laid out to match the Metal shader's expectations exactly:
struct CellData {
var charIndex: UInt16 // Index into glyph atlas
var brightness: Float16 // 0.0 to ~1.5 (HDR for cursors)
var flags: UInt8 // Bit flags: cursor, deletion, etc.
var age: UInt8 // Frames since last glyph change
var prevCharIndex: UInt16 // Previous glyph for cross-dissolve
}
The prevCharIndex field enables the 1-frame cross-dissolve when glyphs change. Instead of a hard cut, the shader blends from the old glyph to the new one over a single frame, giving the rain a smoother, more organic look. Glyph changes are synchronized globally every 3 frames rather than randomly per-cell, matching the cadence of the original film.
Three Depth Layers
The simulation manages three independent layers that create the parallax depth effect visible in the 1999 film. Each layer has its own grid of CellData, its own string spawning logic, and its own timing:
| Layer | Scale | Brightness | Speed | Density |
|---|---|---|---|---|
| Background | 60% | 30% | 60% | Sparse |
| Midground | 100% | 100% | 100% | Dense |
| Foreground | 130% | 120% | 130% | Sparse |
The background layer uses smaller, dimmer, slower glyphs. The foreground layer uses oversized, slightly brighter, faster glyphs that pass "in front" of the midground. Together they create the sensation of depth without any actual 3D geometry.
The Glyph Atlas
Before any rendering happens, GlyphAtlas uses Core Text to rasterize 53 glyphs into a single 512x448 Metal texture in r8Unorm format (single-channel, 8 bits). The atlas is arranged as an 8x7 grid and contains:
- 30 half-width katakana, mirrored horizontally (matching the film's convention)
- The kanji character for "day" (日)
- 10 Arabic numerals, some mirrored (2 and 5 horizontal, 3 and 6 vertical)
- Z plus 11 symbols
Subtle stroke weight variation across glyphs produces a hand-drawn quality that avoids the sterile look of uniformly rendered text. The shader samples this atlas by computing UV coordinates from the charIndex stored in each cell's data.
Pass 1: Main Composite
The first compute pass runs at full display resolution. Each thread maps to one output pixel. For each of the three depth layers, the shader:
- Computes which grid cell the pixel falls within (accounting for the layer's scale factor)
- Looks up the cell's
CellDatafrom the buffer - Skips empty cells entirely (early-out optimization)
- Samples the glyph atlas texture at the appropriate UV offset for the cell's
charIndex - If the cell is mid-dissolve (
ageindicates a recent change), blends betweenprevCharIndexandcharIndex - Multiplies the atlas sample by the cell's brightness value
- Accumulates the result into the output
Cursor cells can have brightness values above 1.0 (up to roughly 1.5), intentionally pushing into HDR range. These hot pixels are what the bloom pass extracts in the next stage, creating the bright phosphor glow at the leading edge of each rain string.
kernel void mainComposite(
texture2d<float, access::write> output [[texture(0)]],
texture2d<float, access::sample> atlas [[texture(1)]],
device const CellData* bgCells [[buffer(0)]],
device const CellData* mgCells [[buffer(1)]],
device const CellData* fgCells [[buffer(2)]],
device const Uniforms& uniforms [[buffer(3)]],
uint2 gid [[thread_position_in_grid]])
{
float3 color = float3(0.0);
// Accumulate each layer at its own scale
color += sampleLayer(atlas, bgCells, gid, uniforms.bgScale,
uniforms.bgBrightness);
color += sampleLayer(atlas, mgCells, gid, uniforms.mgScale,
uniforms.mgBrightness);
color += sampleLayer(atlas, fgCells, gid, uniforms.fgScale,
uniforms.fgBrightness);
output.write(float4(color, 1.0), gid);
}
The output goes to an intermediate full-resolution texture (not directly to the drawable). This texture feeds into both the bloom pipeline and the final composite pass.
Pass 2: Bloom Downsample
The bloom pass extracts the brightest pixels from the scene and wraps them in a soft glow. This is what makes cursor cells and high-brightness glyphs feel like they are emitting light rather than just being painted white.
The downsample kernel runs at quarter resolution (half width, half height). Each thread reads a 4x4 block of pixels from the full-resolution scene texture, averages them (a box filter), and subtracts a brightness threshold. Pixels below the threshold contribute zero. The result is a small texture containing only the bright spots of the scene, already blurred slightly by the averaging.
kernel void bloomDownsample(
texture2d<float, access::read> scene [[texture(0)]],
texture2d<float, access::write> bloom [[texture(1)]],
device const Uniforms& uniforms [[buffer(0)]],
uint2 gid [[thread_position_in_grid]])
{
float3 sum = float3(0.0);
uint2 base = gid * 2;
// 4x4 box filter
for (int y = 0; y < 4; y++) {
for (int x = 0; x < 4; x++) {
sum += scene.read(base * 2 + uint2(x, y)).rgb;
}
}
sum /= 16.0;
// Threshold extraction
float lum = dot(sum, float3(0.299, 0.587, 0.114));
float contribution = max(lum - uniforms.bloomThreshold, 0.0);
float3 result = sum * (contribution / max(lum, 0.001));
bloom.write(float4(result, 1.0), gid);
}
Passes 3-4: Separable Gaussian Blur
A direct 2D Gaussian blur at radius 13 would require sampling 27x27 = 729 pixels per thread. That is expensive even on a quarter-resolution texture. The standard trick is a separable filter: split the 2D blur into two 1D passes. A horizontal pass blurs each row, then a vertical pass blurs each column of the already-horizontally-blurred result. This drops the sample count from 729 to 54 (27 + 27) per pixel while producing an identical result.
Both passes use the same Gaussian kernel weights, precomputed on the CPU and uploaded as a buffer. With radius 13, there are 27 taps (13 on each side plus center). The weights are symmetric, so the shader only stores 14 unique values and mirrors them.
kernel void bloomBlurH(
texture2d<float, access::read> input [[texture(0)]],
texture2d<float, access::write> output [[texture(1)]],
device const float* weights [[buffer(0)]],
uint2 gid [[thread_position_in_grid]])
{
float3 sum = float3(0.0);
int radius = 13;
for (int i = -radius; i <= radius; i++) {
uint2 coord = uint2(
clamp(int(gid.x) + i, 0, int(input.get_width()) - 1),
gid.y
);
sum += input.read(coord).rgb * weights[abs(i)];
}
output.write(float4(sum, 1.0), gid);
}
Pass 4 is identical except it iterates over the vertical axis. The output is a quarter-resolution bloom texture with soft, wide halos around every bright pixel in the scene.
Pass 5: Final Composite
The last pass combines everything and produces the final pixel output. It runs at full display resolution and writes directly to the MTKView drawable. This pass handles four responsibilities:
1. Scene + Bloom Merge
The shader reads the full-resolution scene from Pass 1 and the blurred bloom texture from Pass 4. The bloom texture is bilinearly upsampled (the hardware sampler handles this since the bloom texture is quarter-res) and added on top of the scene with a configurable intensity multiplier.
2. Palette LUT Color Mapping
Up to this point, the pipeline has been monochrome: cell brightness values modulated by atlas samples. Color is applied in this pass using a 256x1 palette lookup table (LUT) texture.
The shader takes each pixel's luminance and uses it as a U coordinate to sample the LUT. Dark pixels map to the left of the LUT (typically black or deep color), bright pixels map to the right (bright tint or white). This single texture lookup replaces what would otherwise be a complex color grading calculation per pixel.
// Map luminance to palette color
float lum = dot(pixel.rgb, float3(0.299, 0.587, 0.114));
float u = clamp(lum, 0.0, 1.0);
float3 color = paletteLUT.sample(linearSampler, float2(u, 0.5)).rgb;
This is how Matrix Desktop supports 12 presets with radically different color palettes (green, red, blue, purple, amber) without changing a single line of shader code. The CPU generates a different LUT texture for each preset and uploads it. The shader is palette-agnostic.
3. CRT Scanlines
A subtle scanline effect darkens every other row of pixels by a configurable amount. This simulates the horizontal gaps between phosphor rows on a CRT monitor, reinforcing the retro-digital aesthetic. The effect scales with display resolution to maintain consistent visual density on 1080p through 5K screens.
4. Dithering
Finally, the shader adds a small amount of blue noise dithering to break up color banding. This is especially important in the dark regions of the image where 8-bit color output would otherwise show visible stepping between nearly-black values.
kernel void finalComposite(
texture2d<float, access::read> scene [[texture(0)]],
texture2d<float, access::sample> bloom [[texture(1)]],
texture2d<float, access::sample> paletteLUT [[texture(2)]],
texture2d<float, access::write> drawable [[texture(3)]],
device const Uniforms& uniforms [[buffer(0)]],
uint2 gid [[thread_position_in_grid]])
{
// Combine scene and bloom
float3 pixel = scene.read(gid).rgb;
float2 bloomUV = float2(gid) / float2(uniforms.resolution);
pixel += bloom.sample(linearSampler, bloomUV).rgb
* uniforms.bloomIntensity;
// Palette LUT color mapping
float lum = dot(pixel, float3(0.299, 0.587, 0.114));
float3 color = paletteLUT.sample(
linearSampler, float2(clamp(lum, 0.0, 1.0), 0.5)
).rgb;
// CRT scanlines
float scanline = 1.0 - uniforms.scanlineStrength
* float(gid.y % 2);
color *= scanline;
// Dithering
float noise = fract(sin(dot(float2(gid), float2(12.9898, 78.233)))
* 43758.5453);
color += (noise - 0.5) / 255.0;
drawable.write(float4(color, 1.0), gid);
}
The Palette LUT System
Each of Matrix Desktop's 12 presets defines a color palette as a series of HSL gradient stops. On the CPU, these stops are interpolated into a 256-entry RGBA array and uploaded as a 256x1 Metal texture with rgba8Unorm format.
The generation process:
- Define 2-5 HSL color stops (e.g., the Classic preset goes from black at 0.0, through deep green at 0.3, to bright green at 0.8, to white-green at 1.0)
- Linearly interpolate between stops in HSL space to produce 256 samples
- Convert each HSL sample to RGBA
- Write the 256 RGBA values into a
MTLTextureusingreplace(region:...)
When the user switches presets, only this 256x1 texture is regenerated and swapped. The entire GPU pipeline, all five passes, all shader code, stays identical. This separation of color from geometry is what makes preset switching instantaneous.
Why a texture and not a buffer?
The LUT is sampled with a linear filter, meaning the hardware interpolates between adjacent entries. Using a texture gives this filtering for free. A buffer would require manual interpolation in the shader, adding instructions and complexity for no benefit.
Triple Buffering and Frame Pacing
The renderer uses three sets of Metal buffers for cell data and uniforms. A DispatchSemaphore initialized to 3 gates frame submission: the CPU can prepare up to 2 frames ahead of the GPU without blocking, but if all 3 are in flight, the CPU waits.
private let inflightSemaphore = DispatchSemaphore(value: 3)
private var bufferIndex = 0
func draw(in view: MTKView) {
inflightSemaphore.wait()
let buffer = cellDataBuffers[bufferIndex]
let uniforms = uniformBuffers[bufferIndex]
bufferIndex = (bufferIndex + 1) % 3
// Update simulation, write to buffer
simulation.tick()
simulation.copyToBuffer(buffer)
// Encode 5 compute passes...
let commandBuffer = commandQueue.makeCommandBuffer()!
commandBuffer.addCompletedHandler { [weak self] _ in
self?.inflightSemaphore.signal()
}
// Dispatch passes, commit, present
commandBuffer.present(drawable)
commandBuffer.commit()
}
This pattern ensures the GPU is never starved waiting for data and the CPU is never blocked for more than one frame's worth of work. At 60fps on a 5K display, the total frame budget is 16.6ms. The pipeline consistently lands well under that.
| Stage | Time (5K @ 60fps) |
|---|---|
| CPU simulation | ~0.5ms |
| Pass 1: Main composite | ~2.0ms |
| Pass 2: Bloom downsample | <0.1ms |
| Passes 3-4: Gaussian blur | ~0.6ms |
| Pass 5: Final composite | ~1.0ms |
| Total | ~4.1ms |
That leaves roughly 12ms of headroom per frame, which is important for a desktop wallpaper app. The rendering should never compete with foreground applications for GPU time.
Adaptive Quality
A desktop wallpaper that drains battery or spins fans would be unacceptable. Matrix Desktop monitors system conditions and degrades gracefully:
Thermal Monitoring
The app reads ProcessInfo.thermalState and adjusts rendering based on the current thermal pressure:
- Nominal: Full quality. 60fps, all 5 passes, full resolution bloom.
- Fair: Drop to 30fps. All passes still active.
- Serious: 30fps and skip bloom passes (passes 2-4). The final composite renders the scene without bloom, saving three dispatch calls and their associated texture bandwidth.
- Critical: 15fps, no bloom. Minimum viable rendering to keep the rain visible without contributing to thermal pressure.
Battery Detection
When a MacBook is unplugged, the renderer drops to 30fps and skips bloom regardless of thermal state. The moment AC power returns, full quality resumes. This is transparent to the user and prevents the wallpaper from being a meaningful battery drain.
Sleep, Screen Saver, and Lid Close
The app observes system notifications for sleep, screen saver activation, and display power state. When any of these trigger, rendering pauses completely. No simulation ticks, no GPU work, zero resource usage. Rendering resumes on wake.
Per-Screen GPU Selection
On Macs with multiple GPUs (or external GPUs), each display may be driven by a different GPU. Matrix Desktop creates one MatrixRenderer per connected screen, and each renderer queries NSScreen for the preferred MTLDevice associated with that display.
This means an eGPU-connected external monitor gets its rain rendered on the eGPU, while the built-in Retina display uses the integrated GPU. Buffer copies stay local to each GPU, and no cross-GPU synchronization is needed because the renderers are fully independent.
When displays are hot-plugged (connecting or disconnecting a monitor), the app observes NSApplication.didChangeScreenParametersNotification, tears down renderers for removed screens, and creates new ones for added screens.
Why framebufferOnly Must Be False
By default, MTKView sets its drawable textures to framebufferOnly = true. This is a Metal optimization hint that tells the driver the texture will only be used as a render target for fragment shaders in a render pass. The driver can store it in a format optimized for that specific access pattern.
Matrix Desktop uses compute shaders, not render passes. The final composite kernel writes to the drawable texture directly via texture2d<float, access::write>. This is a general-purpose texture write, not a framebuffer attachment, and it requires the texture to support arbitrary write access.
let mtkView = MTKView(frame: screen.frame, device: device)
mtkView.framebufferOnly = false // Compute shaders need write access
mtkView.colorPixelFormat = .bgra8Unorm
mtkView.preferredFramesPerSecond = 60
If framebufferOnly is left at its default of true, the compute shader dispatch fails silently or crashes on some GPU architectures. This is a common pitfall when using compute-only rendering pipelines with MTKView.
Performance impact
Setting framebufferOnly = false has a negligible performance cost on Apple Silicon. On older Intel+AMD configurations it may prevent certain display controller optimizations, but the difference is unmeasurable in practice given the pipeline's 12ms of headroom.
Summary
Matrix Desktop's rendering pipeline is built around a few key design decisions: separate CPU simulation from GPU rendering, keep all shader passes in compute (no render passes, no vertex shaders), use a LUT texture for color so presets are just data swaps, and degrade quality automatically based on system conditions. The result is a wallpaper that looks like it belongs on a production-quality VFX compositor but runs at 4ms per frame on a 5K display.
Frequently Asked Questions
How many GPU passes does Matrix Desktop use?
Five. The pipeline runs a main composite pass at full resolution, a bloom downsample at quarter resolution, two separable Gaussian blur passes (horizontal and vertical) at quarter resolution, and a final composite pass at full resolution that combines the scene with bloom, applies palette color mapping, CRT scanlines, and dithering.
What is a palette LUT texture?
A palette LUT (lookup table) is a 256x1 pixel texture that maps brightness values to colors. The shader takes each pixel's luminance and uses it as a coordinate to sample the LUT, converting monochrome brightness into the final color. This is how Matrix Desktop supports 12 different color presets without changing any shader code — only the LUT texture is swapped.
Does Matrix Desktop use the CPU or GPU for rendering?
Both, but the GPU does the heavy lifting. The CPU runs a lightweight state machine at 30 ticks per second to determine which glyphs are visible and how bright each cell is. The GPU handles all pixel rendering through five Metal compute shader passes. CPU usage stays under 1%, while the GPU handles compositing, bloom, blur, and color grading.
What is triple buffering?
Triple buffering uses three sets of Metal buffers that rotate via a counting semaphore. While the GPU renders frame N, the CPU prepares data for frame N+1, and the display shows frame N-1. This prevents the CPU and GPU from ever waiting on each other, eliminating micro-stutters and keeping the pipeline fully saturated at 60 fps.
How does adaptive quality work?
Matrix Desktop monitors thermal state and power source in real time. At nominal temperature on AC power, it renders at 60 fps with all five passes. Under thermal pressure, it progressively drops to 30 fps, then 15 fps, and disables bloom passes entirely. On battery power, it immediately drops to 30 fps. The wallpaper will never cause fan spin or meaningful battery drain.
Does it support multiple monitors?
Yes. Matrix Desktop creates a separate renderer per connected display, each bound to the GPU driving that specific screen. On multi-GPU Macs or eGPU setups, each display renders on its own GPU with no cross-GPU synchronization needed. Displays can be hot-plugged and the app automatically creates or tears down renderers as screens are added or removed.
See the pipeline in action
Download Matrix Desktop and watch 14.7 million pixels of digital rain rendered through 5 Metal compute passes on your desktop.
Free Download