-
Notifications
You must be signed in to change notification settings - Fork 4
Description
🗞️ News Intelligence Laboratory — Text Retrieval + Transformer Classification
Purpose
Design an end-to-end Natural Language Processing (NLP) workflow combining information retrieval and news classification using transformer-based models (RoBERTa, DeBERTa, and ModernBERT).
The laboratory has two integrated tasks:
-
Task 1: Building a News Retrieval System from RPP RSS Feed
-
Task 2: Fine-tuning Transformer Models for AG News Classification and Evaluation
This lab connects real-time news ingestion, embeddings, vector search, and transformer-based categorization to simulate a modern AI-driven media analysis pipeline.
📘 Repository Name
Task 1: news-query_RPP-lab
Task 2: News_Classification-lab
Important: Generate two repos, one for each task
🧩 Structure
Task 1 — News Retrieval and Embedding System (RPP RSS Feed)
Objective:
Ingest the latest news from RPP Perú (https://rpp.pe/rss), embed them using SentenceTransformers, and build a retrieval system using ChromaDB orchestrated with LangChain.
Steps
0️⃣ Load Data
-
Use
feedparserto extract 50 latest news items from the RPP RSS feed. -
Each record should include:
-
title,description,link,published(date).
-
1️⃣ Tokenization
-
Tokenize a sample article using
tiktoken. -
Compute
num_tokensand decide if chunking is needed (based on model context limits).
2️⃣ Embedding
-
Generate embeddings using:
model_name = "sentence-transformers/all-MiniLM-L6-v2" -
Store embeddings alongside text and metadata.
3️⃣ Create or Upsert Chroma Collection
-
Use ChromaDB to store documents, metadata, and embeddings.
-
Implement a retriever that supports:
-
Similarity search by keyword or description.
-
4️⃣ Query Results
-
Query with a prompt like “Últimas noticias de economía”.
-
Display results in a pandas DataFrame with columns:
title | description | link | date_published
5️⃣ Orchestrate with LangChain
-
Implement an end-to-end pipeline in LangChain that:
-
Loads RSS → Tokenizes → Embeds → Stores → Retrieves.
-
-
Each step should be modular (functions or LangChain chains).
🧮 Deliverables (Task 1)
- Jupyter Notebook-
requirements.txt
-
README.md
Task 2 — Transformer News Classification (AG News Dataset)
Objective:
Train and compare transformer-based models (RoBERTa, DeBERTa, ModernBERT) on the AG News dataset.
Dataset
from datasets import load_dataset dataset = load_dataset("ag_news")
Categories:
0 - World 1 - Sports 2 - Business 3 - Science/Technology
Steps
1️⃣ Data Preparation
-
Split the dataset into:
-
70% training
-
15% validation
-
15% test
-
-
Use only train + validation for model tuning.
-
Keep test data untouched for final evaluation.
2️⃣ Model Training
Train three models separately using Hugging Face Transformers
3️⃣ Evaluation
-
Plot F1-score comparison between models using
matplotliborseaborn. -
Include:
-
Training curves (optional)
-
Bar chart comparison (RoBERTa vs DeBERTa vs ModernBERT)
-
4️⃣ Bonus Task — RPP RSS Classification
-
Pass the 50 RPP news articles retrieved in Task 1 to an LLM (e.g., ChatGPT API or other open LLM)
→ Ask it to classify each article into one of the four AG News categories. -
Store LLM classifications as “ground-truth-like” reference.
-
Pass the same RPP articles through your three trained models.
-
Compare F1-scores between models vs LLM-assigned labels.
-
Discuss:
-
Are model predictions consistent with the LLM?
-
Which model aligns best with the LLM classification?
-
Hypothesize reasons for discrepancies (e.g., model pretraining domain, context length, etc.).
-
🧮 Deliverables (Task 2)
-
/notebooks/agnews_train_eval.ipynb -
/data/rpp_classified.json(optional) -
Graph comparing model performance (F1-scores)
-
Markdown summary with interpretation of results
🧮 Rubric (20 pts)
Data & Reproducibility — 4 pts
- Organized repository structure (
/src,/data,/notebooks,/outputs). - Functional Google Colab or Jupyter notebook provided.
- All file paths are relative, no absolute directories.
- A complete and functional requirements.txt or pyproject.toml file included.
- Code runs end-to-end without manual intervention.
Task 1: Retrieval System — 6 pts
- Correct RSS parsing from RPP feed (https://rpp.pe/rss).
- Proper tokenization and token count verification using
tiktoken. - Generation of embeddings with
SentenceTransformers/all-MiniLM-L6-v2. - Creation and management of a ChromaDB collection (store + upsert + retrieval).
- LangChain orchestration connecting all steps (load → tokenize → embed → store → query).
- Clear output table displaying:
title | description | link | date_published.
Task 2: Transformer Models (AG News) — 6 pts
- AG News dataset properly loaded and split into 70/15/15 (train/validation/test).
- Fine-tuning of RoBERTa, DeBERTa, and ModernBERT models.
- Models trained only on train + validation; test set reserved for final evaluation.
- F1-score (macro or weighted) computed for each model.
- Test set used only once for final comparison.
- Discussion of model behavior and observed differences.
Visualization & Comparison — 2 pts
- At least one F1-score comparison chart (bar plot or table).
- Proper axis labeling and legend.
- Markdown discussion or brief interpretation of which model performs best and why.
Bonus Task (LLM Classification) — +3 pts
- Use of an LLM (e.g., ChatGPT or open-source equivalent) to classify 50 RPP news items into AG News categories:
0 - World,1 - Sports,2 - Business,3 - Science/Tech. - Comparison of model predictions vs. LLM classifications using F1-score.
- Analytical discussion on:
- Consistency between models and LLM.
- Possible reasons for divergences (e.g., domain differences, context length, embeddings).
- Visualization of comparative F1-scores (optional but recommended).
Penalties (−0.5 each)
- Missing or incomplete
README.md. - Missing
requirements.txtor incorrect dependencies. - Non-reproducible results (unavailable data, missing random seeds, or broken scripts).
- Incomplete or unclear result documentation.
🛠️ Technical Requirements
-
Python 3.10+
-
Packages:
feedparser tiktoken sentence-transformers chromadb langchain datasets transformers torch matplotlib pandas seaborn scikit-learn -
Runable in Google Colab
🔁 Recommended Workflow
-
Task 1:
-
Parse RSS → Inspect text length → Embed → Store → Query → Display results
-
Document examples (5 most recent retrieved items)
-
-
Task 2:
-
Load AG News → Split data → Train models → Compare F1 → Visualize
-
-
Bonus:
-
Classify RPP news via LLM → Test with your 3 models → Compare outcomes
-
Discuss interpretability and model alignment
-
📤 Submission
Submit:
-
GitHub repository URL
in the following Google Sheet:
👉 [Submission Excel – Repository & Dashboard Links]
Deadline: October 23