System Block Diagram

Figure 1. High level design of Highlighting project

Description

The frontend consists of a web-based interface that captures user audio and displays AAC tiles. Audio is streamed to the backend through a WebSocket connection, while tile prediction requests are sent via REST.

The backend includes two main services. The Audio Transcription Service converts audio with FFmpeg and produces transcripts using a local Whisper model. The Prediction Service processes transcripts by generating embeddings, comparing them to cached tile vectors, and using a local LLM to refine the top predicted tiles. These predictions are then returned to the frontend for highlighting.

Optional auxiliary systems, such as Supabase and FastAPI, support logging, analytics, and utility functions.