← All notes
Jun 2026·aivoicetool-calling

Why Voice AI Needs Tools

When I started building Mnemosyne, the first version was mostly a voice interface around a language model. It could listen, respond, and hold a conversation. That was interesting at first because speech-to-speech interaction feels strange when it works.

It became limited quickly.

The issue was not that I expected the model to know live information by magic. I built the program, so I knew what I had connected to it. The issue was that the system could not reach anything. It could generate language, but it could not inspect current state, call my code, save durable memory, schedule anything, search a source, or perform an action outside the conversation.

That made it feel fake in a specific way. It was not useless because it was unintelligent. It was useless because it had no hands.

Text chat can tolerate that limitation longer. The user is already working with text, tabs, search results, editors, and copy-paste. In that setting, a model that only reasons can still be useful.

Voice changes the expectation. When I speak to an assistant, I am usually trying to turn an intention into an action. I am not looking for a long explanation of the action. I want the system to understand the request and do the relevant work. If it can only respond with more speech, the interface feels mismatched.

Tool calls changed Mnemosyne because they changed the boundary of the system. The model was no longer sealed inside generated text. It could route a request into an actual capability. It could call a function, retrieve a result, write something, check something, or pass structured arguments into code I controlled.

Tool calling is not automatically intelligence. A bad tool-using assistant is worse than a plain chatbot because it can make mistakes in the world instead of only making mistakes in language. The hard part is the loop around the tool. The model has to understand the request, decide whether a tool is needed, choose the right tool, pass the right arguments, interpret the result, and report the outcome without wasting the user's time.

Voice makes mistakes in that loop more obvious. In text, a user can skim past a bad paragraph. In voice, the assistant wastes time out loud. A spoken interface has to be more direct because the cost of useless output is higher.

That is why I do not think voice AI should be treated as chat with a microphone attached. The microphone is only the input method. The actual product is the system behind it: the tools, the memory, the state, the permissions, and the verification around each action.

Building Mnemosyne made this obvious to me. A voice assistant that can only talk may be a good demo, but it is not a very good assistant. If it speaks as if it is present, it needs access to the present. If it responds as if it can help, it needs tools that let it do real work.

Next

Climbing Routes and Graph Search