RAG API

A minimal Retrieval-Augmented Generation (RAG) service that answers natural-language questions from a small corpus of FAQ markdown files and exposes the functionality over a local HTTP API.

Quick Start

# 1. Install dependencies
npm install

# 2. Set your OpenAI API key
cp .env.example .env
# edit .env and set OPENAI_API_KEY=sk-...

# 3. Start the server (ingestion runs automatically on startup)
npm start

The server ingests all .md files in ./faqs/ at startup, embeds every chunk via the OpenAI Embeddings API, and then begins accepting requests.

API

`GET /health`

Liveness check.

curl http://localhost:3000/health

{ "status": "ok" }

`POST /ask`

Ask a question. The server retrieves the most relevant chunks and generates a grounded answer.

Request body:

Field	Type	Required	Default	Constraints
`question`	string	yes	—	non-empty
`top_k`	integer	no	4	1–10 inclusive

Example:

curl -X POST http://localhost:3000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I reset my password and set up MFA?"}'

Response 200:

{
  "answer": "To reset your password, visit the login page and click \"Forgot Password.\" (source: authentication.md). To enable MFA, go to Account Settings → Security → Enable Two-Factor Authentication and scan the QR code with an authenticator app (source: authentication.md).",
  "sources": ["authentication.md"]
}

Response 400 (bad input):

{ "error": "Invalid input: \"question\" must be a non-empty string." }

Response 500 (OpenAI or internal error):

{ "error": "Internal server error. Please try again." }

Environment Variables

Variable	Required	Default	Description
`OPENAI_API_KEY`	yes	—	OpenAI secret key
`PORT`	no	`3000`	HTTP server port
`FAQ_DIR`	no	`./faqs`	Path to FAQ markdown directory
`COMPLETION_MODEL`	no	`gpt-4o-mini`	OpenAI chat completion model

The server fails fast at startup if OPENAI_API_KEY is not set.

Project Structure

rag-api/
├── faqs/                    # FAQ markdown source documents
│   ├── faq_auth.md
│   ├── faq_sso.md
│   └── faq_employee.md
├── rag.js                   # RAG core: chunking, embedding, retrieval, generation
├── server.js                # Express HTTP API wrapper
├── ingest.js                # Standalone ingestion smoke-test script
├── package.json
├── .env.example
└── README.md

Design Decisions & Trade-offs

Chunking strategy

Chunks are ~200 characters with a 40-character overlap. The overlap prevents meaningful sentences from being split cleanly across chunk boundaries — a common failure mode where the answer straddles two adjacent chunks and neither scores highly enough on its own. Splitting is attempted on sentence boundaries (. ) within the final 60 characters of each window to avoid mid-sentence cuts.

Trade-off: Larger chunks preserve more context per retrieval hit but reduce precision (more noise per chunk). Smaller chunks increase precision but can miss context. 200 chars is on the smaller end; 400–600 chars works better for prose-heavy documents.

Vector store

All embeddings are stored in a plain JavaScript array in memory. This is appropriate for a corpus of a few hundred chunks.

Trade-off: No persistence — the store is rebuilt from scratch on every startup (one batched OpenAI Embeddings API call). For larger corpora or faster cold starts, persist embeddings to a JSON file, SQLite, or a purpose-built store like Chroma or pgvector.

Embedding model

text-embedding-3-small — OpenAI's smallest and cheapest embedding model. Dimensionality: 1536. Accuracy is sufficient for keyword-heavy FAQ retrieval.

Trade-off: text-embedding-3-large is ~6x more expensive and meaningfully more accurate for semantic queries (paraphrases, ambiguous phrasing). Not necessary here.

Cosine similarity

Implemented in plain JS — no numpy, no external library. At this scale (< 1000 chunks, 1536-dim vectors) the inner loop completes in < 1ms. For thousands of documents, switch to a pre-built ANN index (FAISS, hnswlib) or a vector database.

Batched embeddings at ingest

All chunks across all files are embedded in a single API call during ingestion. This is cheaper (fewer HTTP round-trips, OpenAI charges per token not per request) and faster than embedding each chunk individually.

Completion model

gpt-4o-mini by default. It's fast, inexpensive, and follows instructions reliably. The system prompt instructs the model to cite sources inline using a (source: filename.md) convention and to answer strictly from the provided context.

Temperature = 0 for deterministic answers — important for a Q&A system where factual consistency matters.

Source citation

Sources are derived from the top-k retrieved chunks, not extracted from the LLM's output text. This is more reliable — LLMs sometimes hallucinate or omit citations. The sources array in the response reflects which files actually contributed context, regardless of what the model wrote.

Input validation

question must be a non-empty string. An empty string would result in a degenerate embedding and a meaningless (but expensive) API call.
top_k is bounded to [1, 10]. Allowing arbitrarily large values would send enormous context windows to the completion model, inflating cost and latency.

Error handling

Fail-fast on missing API key at startup — better to crash immediately than to serve 500s on every request.
503 if a request arrives before ingestion completes (possible under slow network conditions).
OpenAI API errors bubble up as 500 with a generic user-facing message; the real error is logged server-side.

What's intentionally omitted (and why)

Caching: Query-level embedding caching would speed up repeated questions, but adds statefulness that's overkill here.
Re-ranking: A cross-encoder re-ranker (e.g. ms-marco-MiniLM) would improve precision for ambiguous queries, but requires a separate model inference step.
Streaming: The completion response is collected in full before responding. Streaming would improve perceived latency but complicates the response schema.
Auth: Out of scope for a local prototype.

Testing

# Smoke test ingestion and retrieval
node ingest.js

# Health check
curl http://localhost:3000/health

# Password reset question (should cite authentication.md)
curl -X POST http://localhost:3000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the password requirements?"}'

# Cross-document question (should cite multiple files)
curl -X POST http://localhost:3000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I authenticate API calls and log in with SSO?", "top_k": 4}'

# Validation: empty question → 400
curl -X POST http://localhost:3000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": ""}'

# Validation: top_k out of range → 400
curl -X POST http://localhost:3000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "test", "top_k": 99}'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG API

Quick Start

API

`GET /health`

`POST /ask`

Environment Variables

Project Structure

Design Decisions & Trade-offs

Chunking strategy

Vector store

Embedding model

Cosine similarity

Batched embeddings at ingest

Completion model

Source citation

Input validation

Error handling

What's intentionally omitted (and why)

Testing

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
faqs		faqs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
ingest.js		ingest.js
package-lock.json		package-lock.json
package.json		package.json
rag.js		rag.js
server.js		server.js

Folders and files

Latest commit

History

Repository files navigation

RAG API

Quick Start

API

GET /health

POST /ask

Environment Variables

Project Structure

Design Decisions & Trade-offs

Chunking strategy

Vector store

Embedding model

Cosine similarity

Batched embeddings at ingest

Completion model

Source citation

Input validation

Error handling

What's intentionally omitted (and why)

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`GET /health`

`POST /ask`