Skaldtorium
The hall where the world rehearses how it sounds. The audio system in Skaldborn isn't a soundtrack — it's a projection of authoritative simulation state, downstream of the world, never feeding back into it. Below, a few of the building blocks the world will be heard through, playable in your browser.
A Skald rides to market
A Skald — call her Hervör — rides out from her holding to market and back across a single day. The simulation tracks her route, her mount, the surfaces under her hooves, the time of day, the season, the proximity and state of other entities. The audio surface renders that day to her body.
She begins in countryside. Birdsong, wind in leaves, hooves on dirt — a sparse ambient bed with low music presence. As she nears the marketplace, dirt yields to cobblestone and the hooves change voice; the crowd's bustle bleeds into the foreground; music rises to meet the density. She stations her horse and walks. Cobblestone underfoot. She enters her preferred shop. The door squeaks open, then thuds closed behind her. The floor is hardwood now; her step changes again. She greets the shopkeep. As he speaks, his voice profile fires — pitched blips at conversational rhythm, distinct from her own. She buys salted meat for the winter, exits. The door's thud now means transaction completed rather than threshold crossed.
The return ride begins in peace. The bustle fades, the countryside bed returns. Then it shifts — birds leaving, a tense undercurrent rising under the bed. The simulation knows the wolves are stalking. The audio surface tells her body what her conscious mind has not yet inferred. The pack closes. Music crescendos. Combat begins. The encounter resolves. Three wolves dead. The music does not fade — it stops. Birdsong returns over silence, then over breathing.
She walks to her cliff. The ocean fills the audio frame; no music plays, by design. The frame is wide enough to hold a thought.
The five buses
The audio surface is composed of five independent buses, each consuming a different event stream from the simulation:
- Music — non-diegetic, state-driven adaptive layering. Bed always plays; tension fades in on proximate threat; combat replaces tension on engagement; everything stops on resolution.
- Ambient — diegetic looping environmental beds, selected by composite state (biome × weather × time-of-day × season × interior/exterior × lifecycle). The marketplace at dawn sounds different from the marketplace at dusk because the composite resolves to a different bed.
- Foley — diegetic per-action SFX, surface-tagged. The audio reads the same authoritative tile data the renderer reads. Cobblestone foley fires on cobblestone tiles whether or not the sprite happens to draw correctly.
- Earcon — non-diegetic UX feedback. Item acquired, oath sworn, lifecycle transition. Bypasses world space, never heard by other characters.
- Voice — diegetic character speech rendering. Each entity carries a voice profile (base pitch, timbre, blip set). The shopkeep sounds like the shopkeep across his entire lifetime, and as quoted in others' memories of him.
The five buses share one rule: audio reads from authoritative state and produces sound. Nothing else. No sound the engine emits ever feeds back into simulation; no sound is invented from the visual layer; nothing is heard that the simulation hasn't told the session about. Audio is a projection.
Music — vertical layering
18 unapproved candidates from the current marketplace-bed audition pass. Each carries three stems — bed, tension, combat — composed in the same key, at the same BPM, with the same loop length. They line up because the pipeline guarantees they line up: every stem in a music-track batch shares bpm, key, and loop_length_ms, and the manifest emit fails if any stem violates the invariant.
Pick a candidate from the dropdown, click Play, then switch between modes — peace (bed only), tension (bed + tension), combat (bed + combat, replacing tension). Stems crossfade over about a second and a half. They never restart; they're playing the whole time, gain-gated by mode.
Marketplace bed — audition variants
Three modes (dorian, mixolydian, aeolian) · three voicing palettes (bone-horn ensemble, low fiddle drone, drone choir) · BPM 80–104 · 4 / 8 / 16-bar loops · same recipe, different lever values
Idle.
All candidates are renders of the same recipe with different lever values — mode, voicing palette, density, BPM, loop length, seed. Dorian reads neutral-mysterious; mixolydian reads lighter, festive; aeolian reads darker and tends to pair with higher tension density. The bone-horn palette is brass-bright; the low fiddle drone is bowed and sustained; the drone choir is mixed-timbre and dusky. The simulation does the same thing in production: proximate_threat_detected triggers tension; combat_engaged replaces it with combat; combat_resolved stops the music cleanly so the world reasserts itself.
Foley — surface-tagged round-robin
Per-action foley is selected by a tag on the surface the actor interacts with. Not by sprite, not by visual rendering — by the same authoritative tile data the renderer reads. A single footstep is one of a small bank of authored variations played in round-robin order, with a per-emission jitter envelope (pitch ±20 cents, gain ±1 dB) applied client-side so the same footstep never fires twice in a row.
Walking surfaces
Eight variations per surface · runtime jitter ±20¢ pitch, ±1 dB gain · round-robin selection
Click a surface to step.
The variation index, pitch offset, and gain offset for the most recent step show below the buttons. In production each is bounded by the recipe's runtime_jitter envelope; the pipeline pre-bakes the eight variations, the consumer applies the jitter at emit time. This is the same logic that runs in the browser sample, ported from the reference implementation in the consumer examples.
Ambient — the marketplace bed
A single looping diegetic bed. Selected by composite state — in this case biome: marketplace, time-of-day: midday, weather: clear. The bed is rendered with a spectral carve below 4 kHz so it leaves room for music to coexist on top without frequency masking. music_coexistence is declared on the bed itself; consumer applies the carve to the music bus, never the bed bus.
Marketplace · midday · clear
16-second loop · -27.96 LUFS · 48 kHz stereo · spectral carve below 4 kHz
Idle.
Try playing this alongside one of the music variants above. The bed is short on purpose — long enough to feel populated, short enough to fit a coherent group of vendor calls and crowd footfall, looping cleanly through the manifest's authored crossfade. The marketplace at midnight resolves to a different bed entirely. Same place, different composite key, different sound.
How this gets made
The audio above is produced by a separate content pipeline I'm building called skald-conductor. It's a deterministic Python service: take a recipe (a YAML document declaring the levers — mode, BPM, density, voicing palette, loop length), pick a seed, and the pipeline composes MIDI patterns, synthesizes them through FluidSynth and an SF2 instrument bank, post-processes the result through Pedalboard's effects chain, encodes to OGG via libvorbis, and emits a manifest.yaml that names every stem, its sha256, its loop length, its LUFS. Same recipe + same seed → byte-equivalent stems, byte-equivalent manifest, every time.
The contract is the manifest. Consumers (Skaldborn, eventually) read the manifest, fetch the bytes by name, verify the sha256, and play. They never glob the directory; they never invent IDs; they never trust a file on disk over what the manifest declares. Approved batches are content-addressed and append-only — a correction is a new batch, not an edit. The whole thing is a smaller, audio-shaped instance of the same simulation-owns-reality discipline that runs through the rest of Skaldborn's architecture.
Stack — all open-source libraries: FluidSynth (SF2 sampler), libpd (Pure Data integration for procedural foley synthesis), Pedalboard (audio effects chain), libsndfile + libvorbis (file I/O + OGG encoding), FastAPI (read-only HTTP API exposing approved batches), Pydantic (recipe and manifest schemas), Textual (TUI for the operator's audition pass), SQLite (queue + state). The repo itself is private for now; the framing here describes what it does, not where it lives.
The world has a body
Doc 24 in our internal canon — the audio architecture spec — closes with a list of what the system makes possible. It's worth quoting in shape, because the demos above are the small concrete pieces of it:
- A floorboard becomes a memory.
- A door becomes a threshold whose meaning changes by tick.
- A shopkeep becomes a voice you would recognize across a generation.
- A wolf's snarl becomes information about the world's intent.
- A silence becomes a composition large enough to hold a thought.
- A ruin becomes audible as ruin.
- A candle blow becomes the day's last sentence.
The full system isn't running yet. The buses, the voice profiles, the surface tags, the composite bed selection, the lifecycle drift — all are designed and contracted, and most are queued to land in upcoming development phases. What runs today is the pipeline that produces the audio assets above, and the contract surface other systems will integrate against. The blog precedes the game; this hall precedes the world.
See what Skaldborn is for the broader frame, or the launch post for the architectural discipline that the audio system extends.