NLP RAG & Prompt Engineering Project

Description

This project is part of the 24-2 Natural Language Processing course and focuses on building a question-answering system enhanced with Retrieval-Augmented Generation (RAG) and advanced prompt engineering techniques. The system is designed to improve the performance of a Large Language Model (LLM) provided by Upstage (Solar).

Models

backbone LLM : Upstage solar-1-mini-chat
Document Loader : UpstageLayoutAnalysisLoader
Splitter : RecursiveCharacterTextSplitter
Embeddings : UpstageEmbeddings
- solar-embedding-1-large-passage
- solar-embedding-1-large-query
Preprocessing Tools:
- NLTK
- TF-IDF Vectorizer
API : Wikipedia API
Prompt Engineering : langchain_core.prompts.prompt.PromptTemplate

Datasets

EWHA Academic Policies
Domain-Specific Data:

Business : Collected from Business Math resources.
Law : Data sourced from U.S Government Information
Philosophy : Sourced form Philosophy 101
Psychology : Sourced from Cognitive Psychology and its Implications

Test Data: MMLU-PRO

Methodology

The methods used in this project include:

Data Collection
- Implement a keyword extraction pipeline using NLTK and TF-IDF to enhance query generation.
  - Use stop words from NLTK to remove unneccessary words from the list
- Gather domain-specific documents (Business, Psychology, Philosophy, Law) to improve coverage in underperforming areas.
- Document retrieval via the Wikipedia API.
- Attempt on web crawling using BeautifulSoup to gather massive information for specific domain.
  - Not used in the final model
Embeddings
- RecursiveCharacterTextSplitter for efficient text segmentation and preprocessing.
- Use UpstageEmbeddings provided by Upstage APIs for document retrieval and domain classification.
- Embed 3 major things:
  - the split documents for each domain
  - query
  - related keywords lists for each domain
Context Retrieval
- Calculate the similarity between the query and contexts
- Select the top 7 contexts that best correlates with the query with respect to each domain for MMLU dataset.
- Select the top 3 contexts that best correlates with the query for EWHA dataset.
Prompt Engineering
- Introduce Chain-of-Thought (CoT) reasoning and one-shot prompting.
- Develop domain-specific prompts tailored to specific question types.

Contributors

Minji Kim About

Hyemin Boo About

Seoyoung Kim

Jiin Lee

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
code		code
data		data
test_dataset		test_dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP RAG & Prompt Engineering Project

Description

Models

Datasets

Directory

Methodology

Contributors

About

Uh oh!

Releases

Packages

Languages

hyeminboo/24NLP-TermProject

Folders and files

Latest commit

History

Repository files navigation

NLP RAG & Prompt Engineering Project

Description

Models

Datasets

Directory

Methodology

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages