A comprehensive multilingual form system with voice interaction, PDF generation, and admin UI. Create schemas visually, fill forms via voice or keyboard in 18+ languages, and generate PDFs with embedded metadata for session resumption.
- π¨ Visual Schema Builder - Admin UI for creating forms without code
- π Numbered Paragraphs - Limited-length text boxes numbered for easy voice navigation
- π Section-Based Forms - Organize related fields together for better UX
- π Schema Versioning - Track changes and maintain compatibility
- π€ Voice Interaction - Fill forms naturally in 18+ spoken languages
- β¨οΈ Hybrid Input - Seamlessly switch between voice, keyboard, and mouse
- π Smart Progress Tracking - "Question 4 of 10 in Section 2 of 12"
- π― Paragraph Navigation - Say "go to paragraph 4" or "change paragraph 2 to..."
- π Section Navigation - Jump between sections via voice or UI controls
- π PDF Generation - Professional PDFs with all form data
- π Embedded Metadata - Hidden schema_name and session_id for validation
- π€ Upload & Resume - Upload previous PDF to continue where you left off
- π Round-Trip Integrity - Data preserved through PDF generation/extraction cycle
- π HIPAA-Compliant - Zero PHI storage, stateless backend
- βοΈ Voice Confirmation - Verbal readback before moving to next field
- π Validation - Sanity checking for logical errors
- π Multilingual - Converse in native language, form fills in any language
Transform form-filling into a natural conversation while maintaining accuracy, privacy, and user control. Enable organizations to create custom forms easily and allow users to complete them in their native language using voice, keyboard, or both.
- Medical Intake Forms - Patients complete complex forms in their language
- Government Services - Accessible forms for diverse populations
- Job Applications - Multilingual application process
- Any Multi-Page Form - Where voice guidance improves completion rates
- User-First: Users control the pace and method (voice/keyboard/both)
- Privacy: Zero PII/PHI storage on backend, stateless architecture
- Accuracy: Confirmation required, validation enforced
- Accessibility: Works in 18+ spoken languages, voice-only operation possible
- Flexibility: Admin UI allows non-developers to create forms
Status: Planning Phase - Major Refactor Last Updated: October 25, 2025
- β Gemini Live API voice integration working
- β 18+ spoken languages with real-time switching
- β 174-field adult intake form tested and working
- β Stateless backend architecture (HIPAA compliant)
- β WebSocket real-time communication with <50ms barge-in
- β Hybrid voice/keyboard workflow
- β Comprehensive test coverage
- π Admin UI for schema creation and management
- π PDF generation with embedded metadata
- π PDF upload and data extraction for session resumption
- π Numbered paragraph fields for voice navigation
- π Enhanced section-based navigation
- π Session management and validation system
See planning_documents/REFACTOR_IMPLEMENTATION_PLAN.md for detailed roadmap.
- Python 3.12+ with venv support
- Node.js 18+ with npm
- Google Gemini API key (set as environment variable)
# Clone repository (if not already done)
cd polyglot-pdf
# Backend setup
python -m venv .venv
source ./.venv/Scripts/activate # Windows: .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r backend/requirements.txt
# Frontend setup
npm install -C frontendSet one of these:
export GOOGLE_API_KEY=your_key_here
# or
export GEMINI_API_KEY=your_key_hereWindows PowerShell:
$env:GOOGLE_API_KEY="your_key_here"Terminal 1 - Backend:
source ./.venv/Scripts/activate
uvicorn backend.main:app --reload --port 8000Backend available at: http://localhost:8000
Terminal 2 - Frontend:
cd frontend
npm run devFrontend available at: http://localhost:5173
- Open http://localhost:5173/test-demo-new.html
- Click "Connect to Voice" (grant microphone permission if prompted)
- Select a language from the dropdown (e.g., "Spanish")
- Gemini will greet you and ask for the first field
- Speak naturally: "My ID is 12345" or manually type values
- Watch fields auto-fill with confirmation!
- Try switching sections or changing languages mid-session
The repo-level form-kits/ directory is the single source of truth for every schema/PDF pair. To add or update a kit:
- Create a folder under
form-kits/<kit-id>/containing at leastschema.jsonand the PDF(s) referenced by that schema. - Run
npm run form-kits:prepare(or any frontend command that triggers it automatically) to copy kits intofrontend/public/form-kits/and refresh the manifest atfrontend/public/form-kits/index.json. - Launch the frontend; the schema demo dropdown now shows the new kit, and its
kitIdis sent to the backend when filling PDFs. - Ensure the backend sees those kits by leaving the default env vars (
FORM_KITS_MODE=local,FORM_KITS_LOCAL_ROOT=./form-kits) or, if serving remotely, configure the remote store vars documented inbackend/config.py(FORM_KITS_MODE=gcs, bucket/base URL, optional prefix, timeout).
This workflow keeps dev and prod aligned: the frontend manifest and the backend /api/fill-pdf route both load assets through the same kit IDs, so a single npm run form-kits:prepare run guards against mismatched schemas or missing PDFs.
βββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React/TS) β
β βββββββββββββββββββββββββββββββββββββββββ β
β β sessionStorage (Form State) β β
β β - schema β β
β β - current_values (with confirmed flags)β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β WebSocket Connection β β
β β - Audio streaming (binary) β β
β β - Control messages (JSON) β β
β βββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β Backend (FastAPI/Python) β
β βββββββββββββββββββββββββββββββββββββββββ β
β β FormFillingService β β
β β - Stores schema only β β
β β - Validates tool calls β β
β β - Proxies to Gemini β β
β βββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β Gemini Live API β
β - Voice interaction β
β - Function calling (5 tools) β
β - Audio responses β
βββββββββββββββββββββββββββββββββββββββββββββββ
New Components to be Added:
- Admin UI for schema creation
- PDF generation service with metadata embedding
- PDF upload and extraction service
- Session management system
- Schema CRUD API
- Enhanced form renderer with numbered paragraphs
See planning_documents/REFACTOR_IMPLEMENTATION_PLAN.md for detailed architecture plans.
Backend:
- FastAPI 0.117.1 - Modern async web framework
- google-genai 1.38.0 - Gemini Live API integration
- Pydantic - Data validation and settings
- pytest - Testing framework
- PDF Library - TBD (ReportLab, WeasyPrint, or similar)
Frontend:
- React 18+ - UI framework
- TypeScript - Type-safe development
- Vite - Fast build tool
- WebSocket API - Real-time communication
Key Dependencies:
- librosa 0.10.1 - Audio processing
- soundfile 0.12.1 - Audio I/O
- WebSockets 13-15 - Real-time communication
# All tests
python -m pytest backend/tests -v
# Unit tests only
python -m pytest backend/tests/unit -q
# Contract tests
python -m pytest backend/tests/contract -qcd frontend
npm testSee planning_documents/MANUAL_TESTING_GUIDE.md for detailed testing procedures (to be updated for new features).
- REFACTOR_IMPLEMENTATION_PLAN.md - Detailed phase-by-phase implementation plan with testable deliverables
- PROJECT_CONTEXT.md - Project vision, design principles, and development guidelines
- TECHNICAL_GUIDE.md - Complete technical reference (updated after code changes)
- 0_gemini_live_documentation.md - Gemini Live API reference and best practices
- 5_LOGGING_OVERHAUL_PLAN.md - Logging infrastructure specification
Stateless Architecture:
- Client holds all form state in sessionStorage
- Backend stores only schemas (not user data)
- Better scalability and simpler reconnect logic
- HIPAA compliant by design
PDF Metadata System:
- Embedded schema_name identifies form structure
- Embedded session_id validates authenticity
- Allows PDF upload to resume sessions
- Metadata hidden from casual users, extractable programmatically
Numbered Paragraphs:
- Limited-length text fields with visible numbers
- Voice navigation: "go to paragraph 4"
- Easier editing: "change paragraph 2 to..."
- Concatenated on PDF generation for clean output
Section-Based Organization:
- Related fields grouped logically
- Token-efficient for AI processing
- Progress tracking: "Q4 of 10 in Section 2 of 12"
- Reduces user overwhelm
Connection Errors:
- Check
GOOGLE_API_KEYorGEMINI_API_KEYis set - Verify backend is running on port 8000
- Check for zombie processes
Microphone Not Working:
- Use HTTPS or localhost (required for getUserMedia)
- Check browser permissions (Settings β Privacy β Microphone)
- Try Chrome or Edge (best MediaRecorder support)
Fields Not Updating:
- Open browser DevTools β Network β WS tab
- Check WebSocket messages flowing
- Verify tool calls in backend logs
- Confirm schema was sent at handshake
State Lost on Refresh:
- Check sessionStorage in DevTools (Application tab)
- Verify webFormState.ts persistence logic
- Clear sessionStorage and reconnect if corrupted
Backend Logs:
# Check tool call interactions
cat .logs/gemini_tool_calls.log
# Watch live
tail -f .logs/gemini_tool_calls.logBrowser Console:
// View stored state
console.log(sessionStorage.getItem('webFormState'))
// Clear state
sessionStorage.clear()API Documentation:
- Backend API docs: http://localhost:8000/docs
- WebSocket endpoint:
ws://localhost:8000/ws/fill
- Read
planning_documents/PROJECT_CONTEXT.mdfirst - Review
planning_documents/REFACTOR_IMPLEMENTATION_PLAN.mdfor current phase - Create feature branch from current development branch
- Make changes with tests
- Run test suite:
pytest backend/tests -v - Run frontend tests:
npm test -C frontend - Update documentation if needed
- Submit pull request
- Python: Follow PEP 8, use type hints
- TypeScript: Follow project ESLint config
- Add tests for new features (target: 80% backend, 70% frontend coverage)
- Update planning documents for architectural changes
- Never log PII/PHI in backend logs
- Use structured logging with metadata only
Never Do:
- β Store user data (PII/PHI) in backend
- β Log field values (log IDs, types, counts only)
- β Use
.innerHTMLwith user/AI content (XSS risk) - β Create new summary documents (update existing docs in place)
- β Modify
.venvor example files
Always Do:
- β
Use latest Gemini model:
gemini-live-2.5-flash-preview-native-audio-09-2025 - β Create unit tests before making changes
- β Maintain stateless backend (client = source of truth)
- β
Update
TECHNICAL_GUIDE.mdafter code changes - β
Use
textContentorescapeHtml()for display
Note: Deployment procedures are being updated for the new PDF-based architecture. Current instructions are for the existing voice form filler system.
# Set your GCP project
export VOICE_PROJECT_NAME=your-project-id
# Deploy backend
cd backend
./deploy-backend.shEnvironment Variables (Cloud Run):
GOOGLE_API_KEY- Gemini API key (required)- Set via:
gcloud run services update voice-form-filler --set-env-vars GOOGLE_API_KEY=xxx
Backend:
-
GOOGLE_API_KEYset in Cloud Run - CORS configured in
backend/cors.jsonfor production domains - Health check endpoint responding
- Structured logging configured
- Session storage configured (file-based for MVP, DB for production)
- PDF generation tested on production OS
Frontend:
- Production backend URL configured (no localhost references)
- Environment-specific configs set
- Bundle optimized and tested (<500KB gzipped target)
- VAD parameters tuned (MIN_SPEECH_DURATION_MS=45ms for fast barge-in)
Testing:
- Test form creation via admin UI
- Test PDF generation with various schemas
- Test PDF upload/extraction round-trip
- Verify voice commands work on production
- Check microphone permissions flow
- Validate WebSocket connection stability
[Add your license here]
- Google Gemini Live API for voice interaction
- FastAPI framework for backend
- React community for frontend tools
For AI Assistants: Read planning_documents/PROJECT_CONTEXT.md and planning_documents/REFACTOR_IMPLEMENTATION_PLAN.md before making changes to this codebase.
For Developers: This project is currently in a major refactor phase. The existing voice form filler is functional, and we're building new PDF and admin UI features around it. See the implementation plan for the roadmap.