Class Diagrams
Figure 1. High level view Front end class diagram
Frontend Components
1. Audio Transcription Server
Location: backend/server.mjs
Purpose: Handles real-time audio transcription using local Whisper model.
Architecture:
Client (WebSocket)
↓
Audio Chunks (WebM)
↓
FFmpeg (WebM → PCM)
↓
WAV File Creation
↓
Whisper Model (Local)
↓
Transcription Text
↓
Client (WebSocket)
2. Tiles Component
Location: frontend/src/components/AAC/Tiles.tsx
Purpose: Displays the AAC board with tiles and applies highlighting based on predictions.
Key Features:
- Renders tiles in a grid layout
- Supports hierarchical navigation (folders/subtiles)
- Applies opacity-based highlighting to predicted tiles
- Supports multiple highlight methods (opacity, darken)
- Manages tile selection and navigation state
Highlighting Logic:
// Predicted tiles: 100% opacity, pulsing animation when selected, border outline when selected
// Non-predicted tiles: 50% opacity (when predictions exist), not pulsing, doesn't have an outline
// Default: 100% opacity (when no predictions)
const tileOpacity = shouldBeHighlighted ? 100 : 50;
State Management:
- Uses
useTilesProviderfor tile data - Uses
usePredictedTilesfor prediction results - Uses
useHighlightMethodsfor highlight mode selection - Uses reducer (
stackReducer) for navigation stack
3. State Management Providers
Location: frontend/src/react-state-management/providers/
Key Providers:
-
TranscriptProvider: Manages transcript state
- Stores accumulated transcript text
- Provides
transcriptandsetTranscript
-
PredictedTilesProvider: Manages predicted tiles
- Stores array of predicted tile words
- Provides
predictedTilesandsetPredictedTiles
-
RecordingControlProvider: Manages recording state
- Tracks if recording is active
- Provides
isActiveandsetIsActive
-
useUtteredTiles: Tracks pressed tiles
- Maintains history of clicked tiles
- Provides
tiles,addTile,removeLastTile,clear
-
HighlightMethodsProvider: Manages highlight modes
- Supports 'opacity' and 'darken' methods
- Provides
activeHighlightsSet
Figure 2. High level view Back end class diagram
Backend Components
1. Audio Transcription Server
Location: backend/server.mjs
Purpose: Handles real-time audio transcription using local Whisper model.
Architecture:
Client (WebSocket)
↓
Audio Chunks (WebM)
↓
FFmpeg (WebM → PCM)
↓
WAV File Creation
↓
Whisper Model (Local)
↓
Transcription Text
↓
Client (WebSocket)
Key Functions:
-
Audio Processing Pipeline:
// Receives WebM audio chunks
socket.on("audio-chunk", (data) => {
ffmpeg.stdin.write(Buffer.from(data))
})
// FFmpeg converts to PCM
ffmpeg.stdout.on("data", (chunk) => {
audioBuffer = Buffer.concat([audioBuffer, chunk])
})
// Process audio every 3-6 seconds
const processAudio = async () => {
const wavData = createWavFile(pcmChunk)
const transcribedText = await transcribeAudioLocal(filePath)
socket.emit("transcript", transcribedText)
} -
Whisper Transcription:
- Model:
Xenova/whisper-small.en - Input: WAV file (16kHz, mono, PCM)
- Output: Transcribed text
- Anti-hallucination parameters:
temperature: 0(greedy decoding)no_speech_threshold: 0.6logprob_threshold: -1.0compression_ratio_threshold: 2.4
- Model:
-
Audio Validation:
- Checks RMS energy to detect silence
- Validates minimum audio duration (1.5 seconds)
- Filters out hallucinations using pattern matching
- Removes unwanted markers from transcription
-
Duplicate Detection:
- Compares new transcriptions with previous ones
- Uses similarity scoring to prevent duplicate sends
- Implements time-based throttling (2 seconds minimum)
2. Tile Prediction System
Location: backend/server.mjs (function: predictNextTilesLocalLLM)
Purpose: Predicts relevant tiles based on transcript and pressed tiles.
Architecture:
Input: Transcript + Pressed Tiles
↓
Context Embedding (all-MiniLM-L6-v2)
↓
Vector Search (Cosine Similarity)
↓
Candidate Selection (Top 60)
↓
LLM Prompt Generation (DistilGPT2)
↓
Text Generation
↓
Word Extraction & Filtering
↓
Output: Predicted Tiles (Top 10)
Prediction Modes:
- Transcript Only: Uses only transcribed text
- Tiles Only: Uses only recently pressed tiles
- Both: Combines transcript and pressed tiles
Key Functions:
-
Embedding Generation:
// Embed context (transcript + tiles)
const queryEmb = await embedText(combinedContext)
// Embed all tile labels (cached)
const labelEmbeddings = await embedText(label) -
Vector Search:
// Calculate cosine similarity
const sims = labelEmbeddings.map(e =>
cosineSimilarity(queryEmb, e)
)
// Get top 60 candidates
const topIndices = topNIndices(sims, 60) -
LLM-Based Selection:
// Generate prompt with context
const prompt = `Recently pressed tiles: ${pressedTiles.join(', ')}
Transcript: "${contextLines}"
Based on the context, select the 10 best next tiles from: ${candidateWords.join(', ')}`
// Generate with LLM
const result = await llm(prompt, {
max_new_tokens: 40,
temperature: 0,
do_sample: false
}) -
Word Filtering:
- Excludes common words (pronouns, prepositions, etc.)
- Prioritizes high-value words (actions, emotions, nouns)
- Validates words against candidate list
- Returns top 10 predictions
API Endpoint: POST /api/nextTilePred
Request Body:
{
"transcript": "string (optional)",
"pressedTiles": ["string"] (optional)
}
Response:
{
"predictedTiles": ["word1", "word2", ...],
"status": "success",
"context": "transcript text",
"pressedTiles": ["tile1", "tile2", ...]
}
3. Python FastAPI Server
Location: backend/src/main.py
Purpose: Provides REST API endpoints for additional features.
Endpoints:
GET /: Health checkGET /health-check: Health statusPOST /similarity: Word similarity suggestions (uses spaCy)POST /tts: Text-to-speechPOST /rekognition: Image recognitionPOST /s3/*: S3 file operationsPOST /custom_tiles/*: Custom tile management
Similarity Endpoint (/similarity):
- Uses spaCy
en_core_web_lgmodel - Calculates semantic similarity between words
- Returns top N similar words from vocabulary
Data Flow
Audio Transcription Flow
1. User clicks "Start Recording"
↓
2. AudioTranscription component requests microphone access
↓
3. MediaRecorder starts capturing audio (WebM format)
↓
4. Audio chunks sent via WebSocket to backend
↓
5. Backend receives chunks, converts to PCM via FFmpeg
↓
6. Audio buffer accumulates (3-6 second chunks)
↓
7. WAV file created from PCM data
↓
8. Whisper model transcribes audio
↓
9. Transcription validated and cleaned
↓
10. Transcript sent back to client via WebSocket
↓
11. Client updates transcript state
↓
12. Prediction request triggered automatically
Tile Prediction Flow
1. Transcript updated OR tiles pressed
↓
2. AudioTranscription component calls fetchPredictions()
↓
3. HTTP POST to /api/nextTilePred
Request: { transcript, pressedTiles }
↓
4. Backend creates combined context
↓
5. Context embedded using all-MiniLM-L6-v2
↓
6. Vector search finds top 60 candidate tiles
↓
7. LLM (DistilGPT2) selects best 10 from candidates
↓
8. Response: { predictedTiles: [...] }
↓
9. Frontend updates PredictedTilesProvider
↓
10. Tiles component re-renders with highlighting
Highlighting Flow
1. PredictedTilesProvider updated with new predictions
↓
2. Tiles component receives predictedTiles array
↓
3. For each tile:
- Check if tile text matches predicted tiles
- Check if any subtiles match (recursive)
↓
4. Calculate opacity:
- Predicted: 100% opacity
- Non-predicted: 50% opacity (when predictions exist)
- Default: 100% opacity (when no predictions)
↓
5. Apply opacity via CSS data attribute
↓
6. Tiles visually highlighted on screen
Model Details
Whisper Model
Model: Xenova/whisper-small.en
- Size: ~244MB
- Language: English only
- Input: 16kHz mono PCM audio
- Output: Transcribed text
- Performance: ~1-3 seconds per 3-6 second audio chunk
Configuration:
{
return_timestamps: false,
language: 'en',
temperature: 0, // Greedy decoding
no_speech_threshold: 0.6,
logprob_threshold: -1.0,
compression_ratio_threshold: 2.4
}
Embedding Model
Model: Xenova/all-MiniLM-L6-v2
- Size: ~80MB
- Dimensions: 384
- Purpose: Semantic similarity search
- Performance: ~50-100ms per embedding
Usage:
- Embeds conversation context
- Embeds all tile labels (cached on startup)
- Cosine similarity for vector search
LLM Model
Model: Xenova/distilgpt2
- Size: ~350MB
- Purpose: Text generation for tile selection
- Performance: ~200-500ms per generation
Configuration:
{
max_new_tokens: 40,
temperature: 0, // Greedy decoding
do_sample: false,
return_full_text: false
}
Configuration
Environment Variables
Backend (.env file):
OPENAI_API_KEY=sk-proj-... # For testing (not used in local mode)
Server Configuration
Ports:
- Frontend:
http://localhost:3000 - Backend (Node.js):
http://localhost:5000 - Backend (Python FastAPI): Default port 8 CORS Configuration:
origins: [
"http://localhost:3000",
"http://localhost",
"https://highlighting.vercel.app/"
]
Audio Processing Settings
- Sample Rate: 16000 Hz
- Channels: Mono (1)
- Format: PCM 16-bit signed little-endian
- Chunk Duration: 3-6 seconds
- Overlap: 0.5 seconds (to catch boundary words)