- You must code in Python and/or C++.
- The "bare-minimum" task can be accomplished with just CPU. If your GPU is good enough just use that though.
- This task will require you to run your code from your machines OS terminal. For windows, its the powershell (there's another shell too I think), for macOS and Linux machines a common terminal is bash. Your OS might be using a different terminal from what I mentioned, or might have multiple, doesn't matter, just use one.
- After completing this task you will need to screen record to make a video showing that your code works and you explaining how it works. Obviously in the screen recording you MUST run your program from the terminal.
- Create a github repo containing your code and the video. Name the repo something like "JSB_grade_2_interview_problem" or something like that so it's identifiable.
- IMPORTANT (READ THE ENTIRE BULLET POINT) For submission you must:
- submit a pull request to this repo so that we have access to your username and get find your repo. However, so others don't copy your work, do not do you work in the public forked repo.
- Make a private clone (or however you make things private) of the forked repo and do your actual work there.
- Send an invite to me to the private repo (philipamadasun1@gmail.com) so I can gain access.
- Please don't make me have to remake this repo again. In your ReadME, make sure to provide your email address.
- You may freely use any tool available to you to accomplish this task. The internet, ChatGPT, anything.
- You could use some of the code above might help get you started if you choose.
The platform I advise to run LLMs from is ollama as it's the easierst to set up, here is their repo. The ollama repo also provides some example scripts that might provide some inspiration on how to go about solving some parts of the problem. There are other API platforms like vllm and llama.cpp you could try too. You could use the transformers library and fastAPI or flask and set up your own API service that way too.
For those with not so good PCs, again the "bare-minimum" can be done with just CPU, you can pull a small LLM like gemma:2b or tinyllama (these are around 2GB in size) locally on your ollama and just use those. For the webUI you may use streamlit and Flask as a server to retreive user queries and LLM responses from. I have provided two scripts which use streamlit and Flask to show a simple example of to get user input to show up on the streamlit webUI. Again, this is just advice, any other way you can get this done, you can just do that. You don't have to use ollama , or streamlit or Flask.
JSBCAI / Robotics Lab β LLM Engineering Task
This challenge evaluates your ability to build end-to-end LLM systems, including:
- LLM API integration
- WebUI development
- RAG (Retrieval-Augmented Generation)
- Session & state management
- Tooling & evaluation
- Optional multimodal / speech
- Optional performance profiling
You are allowed to use the internet and AI assistants (ChatGPT, Copilot, Gemini, etc.). What matters is your implementation, architecture reasoning, and the video explanation you submit.
You will build a WebUI + backend service that supports:
- Chat Mode β direct conversation with an LLM
- RAG Mode β the LLM answers from supplied documents using retrieval
- (Optional) Tool Mode β LLM outputs structured robot-action JSON
- A working WebUI (Streamlit, React, Flask templates, anything)
- A backend service (Flask/FastAPI/Node)
- Support for streamed generation into the UI
- The ability to switch between Chat and RAG modes
- Configurable model + server settings
- A reproducible RAG pipeline (document parsing β chunking β embeddings β retrieval)
- A session-based conversation memory
- A persistence layer (SQLite or JSONL logs)
- A short recorded video walkthrough explaining your system
-
A GitHub repository containing:
- Source code (backend + UI)
- A
README.mddescribing setup + usage - A
requirements.txtorenvironment.yml - A
config.yamlor.env
-
A 3β7 minute walkthrough video (screenshotted + recorded):
- Show the running system
- Explain architecture
- Show Chat mode
- Show RAG mode
- Show retrieval sources displayed under answers
- Show session persistence
- If you implemented extra credit, demonstrate it
-
A short write-up (included in README or separate file):
- What you built
- What you struggled with
- What you would improve with more time
You may use AI toolsβbut your submission must reflect your own structure, engineering, debugging, and decisions.
Your system must include:
-
Can be Flask, FastAPI, Node, etc.
-
Exposes endpoints for:
/chat/rag/stream(stream responses)/eval(optional)/tool(optional)
-
Must load an LLM through:
- Ollama, or
- llama.cpp server, or
- OpenAI-compatible API
-
Must support both blocking (βcompleteβ) and streaming responses.
-
Any framework:
- Streamlit
- React frontend + backend
- Flask/HTML/CSS
- Gradio (allowed, but less preferred unless styled cleanly)
-
Two clearly labeled modes:
- Chat
- RAG
-
User and LLM messages must be styled differently (colors / bubbles)
-
Show model name & mode in interface
-
Show streaming token-by-token responses
-
Show sources for RAG answers (retrieved chunks)
-
Ability to switch modes without losing conversation history
-
Ability to filter conversation history by mode
-
Show session ID somewhere
Always store:
- Timestamp
- Mode
- Input prompt
- LLM response
- RAG retrieved chunks
- (Optional) tool outputs
- Session ID
Persistence options:
session_logs.sqlitelogs.jsonl- Anything reproducible and queryable
Sessions must reload the last N turns at startup.
Your RAG pipeline must include:
Use blog and/or PDFs.
- Reasonable chunk size (256β512 tokens or ~500 characters)
- Include chunk metadata (doc name, page number)
Use a CPU-safe embedding model such as:
all-MiniLM-L6-v2(Sentence Transformers)- Or Ollamaβs
mxbai-embed-largeornomic-embed-textBoth run on a MacBook.
Acceptable options:
- FAISS
- Numpy + cosine similarity
- Annoy
- A simple in-memory store
Retrieve top-k chunks and show them in the UI.
Each answer must show:
- Retrieved text snippets
- Document name / source
Include a config.yaml or .env:
model: "tinyllama"
llm_server_url: "http://localhost:11434"
embedding_model: "all-MiniLM-L6-v2"
vector_store_path: "./index/faiss.index"
max_context_tokens: 4096
session_memory_turns: 10Also provide:
run.shormake run
This should:
- Start the backend
- Start the UI
- Optionally start the local LLM server if needed
- Must be chunked, SSE (Server-Sent Events), or incremental polling
- UI must show text appearing gradually
Chat β RAG should keep conversation state. RAG β Chat should preserve the chat messages and continue naturally.
UI must indicate:
- When server is loading
- When LLM server is unreachable
- When LLM returns invalid JSON for tool mode
These are optional but valuable.
LLM must output a JSON of the form:
{
"action": "move_to",
"params": {"x": 0.4, "y": 1.1},
"natural_language_explanation": "I'm moving toward the desk."
}Backend must:
- Validate JSON
- Display parsed actions in UI
- Show errors if malformed
Create --eval CLI or /eval endpoint:
-
Ask 5β10 questions about the provided docs
-
Use RAG mode internally
-
Compare answers with:
- Keyword overlap, OR
- Exact expected phrases
-
Produce a score like:
RAG Accuracy: 7/10 (70%)
(x) Time-To-First-Token (x) Total response time (x) Token throughput (tokens/s) (x) Embedding indexing time
Show metrics in UI or log them.
-
Low-resource STT (
faster-whisper small) -
Any TTS (even API-based)
-
UI button:
- βRecordβ
- βPlay answer audioβ
If you have a multimodal model:
- Add UI image upload
- Route prompt + image to model
- Show answer inline
If on GPU:
- Run 7Bβ8B model locally
- Demonstrate faster inference or better quality
- Compare TTFT and throughput vs tiny CPU model
| Category | Points | Description |
|---|---|---|
| Backend implementation | 20 | clean routing, LLM integration, streaming |
| WebUI quality | 20 | clarity, styling, colors, streaming, switching modes |
| RAG correctness | 25 | indexing, retrieval, sources, citations |
| Session memory + persistence | 10 | logs, reload, multi-session |
| Config & reproducibility | 10 | .env, config.yaml, run.sh |
| Video walkthrough | 15 | clarity, explanation, demonstration |
| Extra Credit Tier 1 | +10 | tool mode, eval mode, profiling |
| Extra Credit Tier 2 | +10 | speech/multimodal/big model |
Maximum: 110 points
User β WebUI β Backend API β LLM Server
β β
Vector Store β Embeddings
β
Documents
This section will be filled by you (the candidate) after implementation:
pip install -r requirements.txt
ollama serve &
ollama pull tinyllama
python backend/main.py
streamlit run ui/app.py
Before submitting, ensure you have: