Back to projects
AI / MLFull-Stack

PartyHost

Voice-first multiplayer trivia, hosted by an AI avatar

A multiplayer trivia game you play by speaking. An AI avatar runs the show, talking and lip-syncing in real time, while on-device vision works out which player just answered.

PartyHost screenshot 1

Demos

See it in action. Pick a demo to watch.

Overview

PartyHost is a voice-first multiplayer trivia game hosted by an AI avatar that listens, talks back, and lip-syncs in real time. Players answer out loud instead of tapping. The host speaks through a text-to-speech provider that returns per-word timestamps, which drive the avatar's mouth shapes for accurate lip-sync. In party mode, computer vision runs entirely in the browser to figure out which of up to five players just spoke, by correlating each face's mouth movement with the microphone's loudness over a short window, so there is no per-user setup or training. The whole game show is a state machine of more than thirty phases, written as a pure reducer that emits events, which keeps the flow predictable and testable without audio, timers, or React. Under the hood it is React 19 and TypeScript with a Zustand store, an Express server, and Firebase for auth and storage, plus several AI providers picked by cost and capability: timestamped speech for lip-sync, embeddings for fast answer matching, and a low-cost chat model for generating questions and explanations. A four-tier cache keeps spoken lines fast and cheap, from an in-browser audio buffer down to a live API call. It started as a vanilla JavaScript app, and the move to React was the project's largest refactor.

How it works

  1. 1

    The host speaks

    Text-to-speech returns audio along with per-word timestamps, and the AI avatar lip-syncs its mouth shapes to them.

  2. 2

    The mic listens

    Browser speech recognition arms just before the host finishes, anchored to the audio clock so it catches the answer cleanly.

  3. 3

    Echo is filtered

    Layered suppression drops anything that overlaps the host's own words, so the host never answers itself.

  4. 4

    The answer is matched

    The spoken text is compared to the options with embeddings first, falling back to a chat model for trickier phrasings.

  5. 5

    The speaker is identified

    In party mode, on-device vision correlates each player's mouth motion with microphone loudness to attribute the answer.

  6. 6

    The state machine advances

    A pure reducer moves the game show to its next phase and emits the events that update the UI and the host.

Gallery

A closer look at the interface. Click any image to expand.

What it does

  • An AI avatar that talks and lip-syncs in real time, with mouth shapes driven by per-word timestamps from the speech provider
  • Voice-first gameplay where players answer out loud, matched against the options using fast embeddings with a chat-model fallback
  • In-browser computer vision that identifies which of up to five players spoke by correlating mouth motion with microphone loudness, no training required
  • A game show of more than thirty phases built as a pure reducer that emits events, testable without audio, timers, or React
  • Five layers of echo suppression so the host's own voice through the speakers is never mistaken for a player answer
  • A four-tier text-to-speech cache, from an in-browser audio buffer to a live API call, that keeps spoken lines fast and cheap
  • Multiple AI providers chosen by cost and capability: timestamped speech, embeddings for matching, and a low-cost chat model for generation

Why it's built this way

Zustand over Context or Redux

The voice engine reads and writes game state from outside React, which Zustand supports directly, and selectors stop a score change from re-rendering every screen.

On-device vision for privacy and speed

Camera frames never leave the browser, which matters when several non-account players are on camera, and it keeps the audio-to-video correlation within the sub-100ms budget it needs.

Browser speech over server-side ASR

Built-in recognition is free, on-device, and low latency; its mistakes under speaker bleed are handled by the echo-suppression layers instead of a costly server round trip.

A pure reducer for a complex flow

Modeling more than thirty distinct game states as explicit phases, rather than computed flags, makes the show predictable and unit-testable without audio or timers.

Built with

ReactTypeScriptZustandExpressFirebaseThree.jsMediaPipeOpenAI