Huxley

Voice assistants are walled gardens by choice, not by necessity. Amazon decides what Alexa can do. Apple decides what Siri will refuse to do. Google decides which contacts can call which contacts. None of those choices are technical constraints. They are product decisions made by companies whose incentives do not include "the specific elderly man in Bogotá who needs to call his daughter without first forcing her to install a particular app."

My grandfather is one of those specific people. He is in his nineties, blind, speaks no English, and lives alone in a city where most of his children have emigrated. The only viable interface for him is voice. The available voice products are uniformly hostile to his situation. Alexa will not call his cousin because his cousin will not install the Alexa app to receive the call. The radio station he listens to is not in any skill catalog, so he uses a physical radio whose batteries he cannot replace himself when they run out at midnight. If he asks the assistant to do something it does not know how to do, it tells him "I cannot help with that," and a blind ninety-year-old has no fallback path. The friction is not in any single product flaw. It is in the architecture: the platform owns the skill economy, the orchestration layer, the audio path, and the refusal rules. None of those are tuned for him, and there is no way to tune them yourself.

That is why I built Huxley, an open framework for self-hosted voice agents. The bet is small: keep the model, let everything else be yours.

The shape of it

A Huxley agent is two things: a persona and a set of skills. The persona is configuration. The skills are code. Both are yours.

A persona is a YAML file. It names the agent, sets the language and voice, writes the personality in prose, lists the skills the agent has access to, and composes a small set of named behavioral constraints onto the system prompt. A constraint is a primitive like never_say_no, confirm_destructive, or echo_short_input. Personas pick the constraints they want. The same Huxley process can host a Spanish-speaking companion for an elderly user, a kid-safe tutor for a neighbor's child, or a hands-free assistant for a delivery driver, by swapping the YAML. No code changes, no fork.

A skill is a Python package. It exposes one or more tools that the LLM can call, returns a result the model reads aloud, and optionally produces a side effect: start an audio stream, claim the microphone for a full-duplex call, fire a notification. Skills install from PyPI under the huxley-skill-* prefix and are discovered through Python entry points. The framework never imports skill code directly. Anyone who has shipped a Python package to PyPI has shipped a Huxley skill before; they just did not know it.

Today nine first-party skills ship: audiobooks, internet radio with auto-reconnect (the batteries problem dissolves), news, web search, timers, reminders, system controls, Telegram voice calls with full-duplex audio (the call-the-other-person-without-asking-them- to-install-an-app problem dissolves), and a reference stocks skill that exists mostly to be read by the next person writing one. The registry is just a JSON file in a separate repository, curated by pull request. No certification fee, no review process, no platform veto.

The piece nobody talks about

Most open-source voice-agent projects quietly punt on audio coordination. When the agent is reading you a book and you ask a question, what should happen to the book? When a phone call comes in while a timer is ringing, which one gets the speaker? Most stacks answer this by stopping everything and starting the new thing, which produces audio glitches, dropped state, and the unsettling experience of the agent talking over itself.

Huxley has a real focus manager. Channels are named: DIALOG for the conversation, COMMS for calls, CONTENT for media, ALERT for urgent announcements. Each has a priority. A book on CONTENT ducks to a lower volume when a call preempts it on COMMS, then resumes from the exact byte offset when the call ends. A direct conversation on DIALOG preempts both. The framework guarantees exactly one foreground activity at a time, hands every displaced activity a BACKGROUND state with a configurable grace period, and fires an on_patience_expired hook so the agent can narrate the eviction before the audio actually changes. The user always hears one stream at a time, and they always hear why.

This is borrowed conceptually from Alexa's audio architecture. The borrowing was deliberate. The novel part is not the design. The novel part is that nobody else seems to believe this layer belongs in an open framework.

The bigger argument

Personal voice agents are inevitable. The interesting question is whether they will be products you buy or frameworks you compose. A product is sealed, certified, and incentivized around the median paying customer. A framework is open, accountable to the person running it, and capable of being shaped by the family that uses it. The first model has produced Alexa. The second model is what Huxley is trying to demonstrate is possible.

The technical bet is that the hard parts of a voice agent are not the model. The model is rented from OpenAI today and will be replaceable in two years. The hard parts are the orchestration around it: turn coordination, audio focus, skill plumbing, the named behavioral rules a persona declares. These are software engineering problems with software engineering answers. Huxley is one set of answers. There should be others.

For my grandfather, the practical bet is simpler. huxley-skill-radio plays his station with auto-reconnect, no batteries. huxley-skill-telegram makes voice calls to anyone with Telegram installed, which is everyone he knows. The never_say_no constraint, composed onto his persona, means the agent will never tell him "I cannot help with that" without offering an alternative. None of those required a corporation's permission to ship.

Pre-1.0, in daily use, at huxley.ma-r-s.com. Source on GitHub.