Class Diagrams

Figure 1. High level view Front end class diagram

Frontend Components

1. Audio Transcription Server

Location: backend/server.mjs

Purpose: Handles real-time audio transcription using local Whisper model.

Architecture:

Client (WebSocket)
    ↓
Audio Chunks (WebM)
    ↓
FFmpeg (WebM → PCM)
    ↓
WAV File Creation
    ↓
Whisper Model (Local)
    ↓
Transcription Text
    ↓
Client (WebSocket)

2. Tiles Component

Location: frontend/src/components/AAC/Tiles.tsx

Purpose: Displays the AAC board with tiles and applies highlighting based on predictions.

Key Features:

Renders tiles in a grid layout
Supports hierarchical navigation (folders/subtiles)
Applies opacity-based highlighting to predicted tiles
Supports multiple highlight methods (opacity, darken)
Manages tile selection and navigation state

Highlighting Logic:

// Predicted tiles: 100% opacity, pulsing animation when selected, border outline when selected
// Non-predicted tiles: 50% opacity (when predictions exist), not pulsing, doesn't have an outline
// Default: 100% opacity (when no predictions)

const tileOpacity = shouldBeHighlighted ? 100 : 50;

State Management:

Uses useTilesProvider for tile data
Uses usePredictedTiles for prediction results
Uses useHighlightMethods for highlight mode selection
Uses reducer (stackReducer) for navigation stack

3. State Management Providers

Location: frontend/src/react-state-management/providers/

Key Providers:

TranscriptProvider: Manages transcript state
- Stores accumulated transcript text
- Provides transcript and setTranscript
PredictedTilesProvider: Manages predicted tiles
- Stores array of predicted tile words
- Provides predictedTiles and setPredictedTiles
RecordingControlProvider: Manages recording state
- Tracks if recording is active
- Provides isActive and setIsActive
useUtteredTiles: Tracks pressed tiles
- Maintains history of clicked tiles
- Provides tiles, addTile, removeLastTile, clear
HighlightMethodsProvider: Manages highlight modes
- Supports 'opacity' and 'darken' methods
- Provides activeHighlights Set

Figure 2. High level view Back end class diagram

Backend Components

1. Audio Transcription Server

Location: backend/server.mjs

Purpose: Handles real-time audio transcription using local Whisper model.

Architecture:

Client (WebSocket)
    ↓
Audio Chunks (WebM)
    ↓
FFmpeg (WebM → PCM)
    ↓
WAV File Creation
    ↓
Whisper Model (Local)
    ↓
Transcription Text
    ↓
Client (WebSocket)

Key Functions:

Audio Processing Pipeline:

// Receives WebM audio chunks
socket.on("audio-chunk", (data) => {
  ffmpeg.stdin.write(Buffer.from(data))
})

// FFmpeg converts to PCM
ffmpeg.stdout.on("data", (chunk) => {
  audioBuffer = Buffer.concat([audioBuffer, chunk])
})

// Process audio every 3-6 seconds
const processAudio = async () => {
  const wavData = createWavFile(pcmChunk)
  const transcribedText = await transcribeAudioLocal(filePath)
  socket.emit("transcript", transcribedText)
}

Whisper Transcription:
- Model: Xenova/whisper-small.en
- Input: WAV file (16kHz, mono, PCM)
- Output: Transcribed text
- Anti-hallucination parameters:
  - temperature: 0 (greedy decoding)
  - no_speech_threshold: 0.6
  - logprob_threshold: -1.0
  - compression_ratio_threshold: 2.4
Audio Validation:
- Checks RMS energy to detect silence
- Validates minimum audio duration (1.5 seconds)
- Filters out hallucinations using pattern matching
- Removes unwanted markers from transcription
Duplicate Detection:
- Compares new transcriptions with previous ones
- Uses similarity scoring to prevent duplicate sends
- Implements time-based throttling (2 seconds minimum)

2. Tile Prediction System

Location: backend/server.mjs (function: predictNextTilesLocalLLM)

Purpose: Predicts relevant tiles based on transcript and pressed tiles.

Architecture:

Input: Transcript + Pressed Tiles
    ↓
Context Embedding (all-MiniLM-L6-v2)
    ↓
Vector Search (Cosine Similarity)
    ↓
Candidate Selection (Top 60)
    ↓
LLM Prompt Generation (DistilGPT2)
    ↓
Text Generation
    ↓
Word Extraction & Filtering
    ↓
Output: Predicted Tiles (Top 10)

Prediction Modes:

Transcript Only: Uses only transcribed text
Tiles Only: Uses only recently pressed tiles
Both: Combines transcript and pressed tiles

Key Functions:

Embedding Generation:

// Embed context (transcript + tiles)
const queryEmb = await embedText(combinedContext)

// Embed all tile labels (cached)
const labelEmbeddings = await embedText(label)

Vector Search:

// Calculate cosine similarity
const sims = labelEmbeddings.map(e => 
  cosineSimilarity(queryEmb, e)
)

// Get top 60 candidates
const topIndices = topNIndices(sims, 60)

LLM-Based Selection:

// Generate prompt with context
const prompt = `Recently pressed tiles: ${pressedTiles.join(', ')}
Transcript: "${contextLines}"

Based on the context, select the 10 best next tiles from: ${candidateWords.join(', ')}`

// Generate with LLM
const result = await llm(prompt, {
  max_new_tokens: 40,
  temperature: 0,
  do_sample: false
})

Word Filtering:
- Excludes common words (pronouns, prepositions, etc.)
- Prioritizes high-value words (actions, emotions, nouns)
- Validates words against candidate list
- Returns top 10 predictions

API Endpoint: POST /api/nextTilePred

Request Body:

{
  "transcript": "string (optional)",
  "pressedTiles": ["string"] (optional)
}

Response:

{
  "predictedTiles": ["word1", "word2", ...],
  "status": "success",
  "context": "transcript text",
  "pressedTiles": ["tile1", "tile2", ...]
}

3. Python FastAPI Server

Location: backend/src/main.py

Purpose: Provides REST API endpoints for additional features.

Endpoints:

GET /: Health check
GET /health-check: Health status
POST /similarity: Word similarity suggestions (uses spaCy)
POST /tts: Text-to-speech
POST /rekognition: Image recognition
POST /s3/*: S3 file operations
POST /custom_tiles/*: Custom tile management

Similarity Endpoint (/similarity):

Uses spaCy en_core_web_lg model
Calculates semantic similarity between words
Returns top N similar words from vocabulary

Data Flow

Audio Transcription Flow

1. User clicks "Start Recording"
   ↓
2. AudioTranscription component requests microphone access
   ↓
3. MediaRecorder starts capturing audio (WebM format)
   ↓
4. Audio chunks sent via WebSocket to backend
   ↓
5. Backend receives chunks, converts to PCM via FFmpeg
   ↓
6. Audio buffer accumulates (3-6 second chunks)
   ↓
7. WAV file created from PCM data
   ↓
8. Whisper model transcribes audio
   ↓
9. Transcription validated and cleaned
   ↓
10. Transcript sent back to client via WebSocket
    ↓
11. Client updates transcript state
    ↓
12. Prediction request triggered automatically

Tile Prediction Flow

1. Transcript updated OR tiles pressed
   ↓
2. AudioTranscription component calls fetchPredictions()
   ↓
3. HTTP POST to /api/nextTilePred
   Request: { transcript, pressedTiles }
   ↓
4. Backend creates combined context
   ↓
5. Context embedded using all-MiniLM-L6-v2
   ↓
6. Vector search finds top 60 candidate tiles
   ↓
7. LLM (DistilGPT2) selects best 10 from candidates
   ↓
8. Response: { predictedTiles: [...] }
   ↓
9. Frontend updates PredictedTilesProvider
   ↓
10. Tiles component re-renders with highlighting

Highlighting Flow

1. PredictedTilesProvider updated with new predictions
   ↓
2. Tiles component receives predictedTiles array
   ↓
3. For each tile:
   - Check if tile text matches predicted tiles
   - Check if any subtiles match (recursive)
   ↓
4. Calculate opacity:
   - Predicted: 100% opacity
   - Non-predicted: 50% opacity (when predictions exist)
   - Default: 100% opacity (when no predictions)
   ↓
5. Apply opacity via CSS data attribute
   ↓
6. Tiles visually highlighted on screen

Model Details

Whisper Model

Model: Xenova/whisper-small.en

Size: ~244MB
Language: English only
Input: 16kHz mono PCM audio
Output: Transcribed text
Performance: ~1-3 seconds per 3-6 second audio chunk

Configuration:

{
  return_timestamps: false,
  language: 'en',
  temperature: 0,  // Greedy decoding
  no_speech_threshold: 0.6,
  logprob_threshold: -1.0,
  compression_ratio_threshold: 2.4
}

Embedding Model

Model: Xenova/all-MiniLM-L6-v2

Size: ~80MB
Dimensions: 384
Purpose: Semantic similarity search
Performance: ~50-100ms per embedding

Usage:

Embeds conversation context
Embeds all tile labels (cached on startup)
Cosine similarity for vector search

LLM Model

Model: Xenova/distilgpt2

Size: ~350MB
Purpose: Text generation for tile selection
Performance: ~200-500ms per generation

Configuration:

{
  max_new_tokens: 40,
  temperature: 0,  // Greedy decoding
  do_sample: false,
  return_full_text: false
}

Configuration

Environment Variables

Backend (.env file):

OPENAI_API_KEY=sk-proj-...  # For testing (not used in local mode)

Server Configuration

Ports:

Frontend: http://localhost:3000
Backend (Node.js): http://localhost:5000
Backend (Python FastAPI): Default port 8 CORS Configuration:

origins: [
  "http://localhost:3000",
  "http://localhost",
  "https://highlighting.vercel.app/"
]

Audio Processing Settings

Sample Rate: 16000 Hz
Channels: Mono (1)
Format: PCM 16-bit signed little-endian
Chunk Duration: 3-6 seconds
Overlap: 0.5 seconds (to catch boundary words)

Frontend Components​

1. Audio Transcription Server​

2. Tiles Component​

3. State Management Providers​

Backend Components​

1. Audio Transcription Server​

2. Tile Prediction System​

3. Python FastAPI Server​

Data Flow​

Audio Transcription Flow​

Tile Prediction Flow​

Highlighting Flow​

Model Details​

Whisper Model​

Embedding Model​

LLM Model​

Configuration​

Environment Variables​

Server Configuration​

Audio Processing Settings​

Frontend Components

1. Audio Transcription Server

2. Tiles Component

3. State Management Providers

Backend Components

1. Audio Transcription Server

2. Tile Prediction System

3. Python FastAPI Server

Data Flow

Audio Transcription Flow

Tile Prediction Flow

Highlighting Flow

Model Details

Whisper Model

Embedding Model

LLM Model

Configuration

Environment Variables

Server Configuration

Audio Processing Settings