A minimal Retrieval-Augmented Generation (RAG) service that answers natural-language questions from a small corpus of FAQ markdown files and exposes the functionality over a local HTTP API.
# 1. Install dependencies
npm install
# 2. Set your OpenAI API key
cp .env.example .env
# edit .env and set OPENAI_API_KEY=sk-...
# 3. Start the server (ingestion runs automatically on startup)
npm startThe server ingests all .md files in ./faqs/ at startup, embeds every chunk via the OpenAI Embeddings API, and then begins accepting requests.
Liveness check.
curl http://localhost:3000/health{ "status": "ok" }Ask a question. The server retrieves the most relevant chunks and generates a grounded answer.
Request body:
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
question |
string | yes | — | non-empty |
top_k |
integer | no | 4 | 1–10 inclusive |
Example:
curl -X POST http://localhost:3000/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do I reset my password and set up MFA?"}'Response 200:
{
"answer": "To reset your password, visit the login page and click \"Forgot Password.\" (source: authentication.md). To enable MFA, go to Account Settings → Security → Enable Two-Factor Authentication and scan the QR code with an authenticator app (source: authentication.md).",
"sources": ["authentication.md"]
}Response 400 (bad input):
{ "error": "Invalid input: \"question\" must be a non-empty string." }Response 500 (OpenAI or internal error):
{ "error": "Internal server error. Please try again." }| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
yes | — | OpenAI secret key |
PORT |
no | 3000 |
HTTP server port |
FAQ_DIR |
no | ./faqs |
Path to FAQ markdown directory |
COMPLETION_MODEL |
no | gpt-4o-mini |
OpenAI chat completion model |
The server fails fast at startup if OPENAI_API_KEY is not set.
rag-api/
├── faqs/ # FAQ markdown source documents
│ ├── faq_auth.md
│ ├── faq_sso.md
│ └── faq_employee.md
├── rag.js # RAG core: chunking, embedding, retrieval, generation
├── server.js # Express HTTP API wrapper
├── ingest.js # Standalone ingestion smoke-test script
├── package.json
├── .env.example
└── README.md
Chunks are ~200 characters with a 40-character overlap. The overlap prevents meaningful sentences from being split cleanly across chunk boundaries — a common failure mode where the answer straddles two adjacent chunks and neither scores highly enough on its own. Splitting is attempted on sentence boundaries (. ) within the final 60 characters of each window to avoid mid-sentence cuts.
Trade-off: Larger chunks preserve more context per retrieval hit but reduce precision (more noise per chunk). Smaller chunks increase precision but can miss context. 200 chars is on the smaller end; 400–600 chars works better for prose-heavy documents.
All embeddings are stored in a plain JavaScript array in memory. This is appropriate for a corpus of a few hundred chunks.
Trade-off: No persistence — the store is rebuilt from scratch on every startup (one batched OpenAI Embeddings API call). For larger corpora or faster cold starts, persist embeddings to a JSON file, SQLite, or a purpose-built store like Chroma or pgvector.
text-embedding-3-small — OpenAI's smallest and cheapest embedding model. Dimensionality: 1536. Accuracy is sufficient for keyword-heavy FAQ retrieval.
Trade-off: text-embedding-3-large is ~6x more expensive and meaningfully more accurate for semantic queries (paraphrases, ambiguous phrasing). Not necessary here.
Implemented in plain JS — no numpy, no external library. At this scale (< 1000 chunks, 1536-dim vectors) the inner loop completes in < 1ms. For thousands of documents, switch to a pre-built ANN index (FAISS, hnswlib) or a vector database.
All chunks across all files are embedded in a single API call during ingestion. This is cheaper (fewer HTTP round-trips, OpenAI charges per token not per request) and faster than embedding each chunk individually.
gpt-4o-mini by default. It's fast, inexpensive, and follows instructions reliably. The system prompt instructs the model to cite sources inline using a (source: filename.md) convention and to answer strictly from the provided context.
Temperature = 0 for deterministic answers — important for a Q&A system where factual consistency matters.
Sources are derived from the top-k retrieved chunks, not extracted from the LLM's output text. This is more reliable — LLMs sometimes hallucinate or omit citations. The sources array in the response reflects which files actually contributed context, regardless of what the model wrote.
questionmust be a non-empty string. An empty string would result in a degenerate embedding and a meaningless (but expensive) API call.top_kis bounded to [1, 10]. Allowing arbitrarily large values would send enormous context windows to the completion model, inflating cost and latency.
- Fail-fast on missing API key at startup — better to crash immediately than to serve 500s on every request.
- 503 if a request arrives before ingestion completes (possible under slow network conditions).
- OpenAI API errors bubble up as 500 with a generic user-facing message; the real error is logged server-side.
- Caching: Query-level embedding caching would speed up repeated questions, but adds statefulness that's overkill here.
- Re-ranking: A cross-encoder re-ranker (e.g.
ms-marco-MiniLM) would improve precision for ambiguous queries, but requires a separate model inference step. - Streaming: The completion response is collected in full before responding. Streaming would improve perceived latency but complicates the response schema.
- Auth: Out of scope for a local prototype.
# Smoke test ingestion and retrieval
node ingest.js
# Health check
curl http://localhost:3000/health
# Password reset question (should cite authentication.md)
curl -X POST http://localhost:3000/ask \
-H "Content-Type: application/json" \
-d '{"question": "What are the password requirements?"}'
# Cross-document question (should cite multiple files)
curl -X POST http://localhost:3000/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do I authenticate API calls and log in with SSO?", "top_k": 4}'
# Validation: empty question → 400
curl -X POST http://localhost:3000/ask \
-H "Content-Type: application/json" \
-d '{"question": ""}'
# Validation: top_k out of range → 400
curl -X POST http://localhost:3000/ask \
-H "Content-Type: application/json" \
-d '{"question": "test", "top_k": 99}'