Skip to content

Homework_4 #158

@jeanpool1415

Description

@jeanpool1415

🗞️ News Intelligence Laboratory — Text Retrieval + Transformer Classification

Purpose

Design an end-to-end Natural Language Processing (NLP) workflow combining information retrieval and news classification using transformer-based models (RoBERTa, DeBERTa, and ModernBERT).

The laboratory has two integrated tasks:

  1. Task 1: Building a News Retrieval System from RPP RSS Feed

  2. Task 2: Fine-tuning Transformer Models for AG News Classification and Evaluation

This lab connects real-time news ingestion, embeddings, vector search, and transformer-based categorization to simulate a modern AI-driven media analysis pipeline.


📘 Repository Name

Task 1: news-query_RPP-lab


Task 2: News_Classification-lab


Important: Generate two repos, one for each task

🧩 Structure

Task 1 — News Retrieval and Embedding System (RPP RSS Feed)

Objective:
Ingest the latest news from RPP Perú (https://rpp.pe/rss), embed them using SentenceTransformers, and build a retrieval system using ChromaDB orchestrated with LangChain.

Steps

0️⃣ Load Data

  • Use feedparser to extract 50 latest news items from the RPP RSS feed.

  • Each record should include:

    • title, description, link, published (date).

1️⃣ Tokenization

  • Tokenize a sample article using tiktoken.

  • Compute num_tokens and decide if chunking is needed (based on model context limits).

2️⃣ Embedding

  • Generate embeddings using:

    model_name = "sentence-transformers/all-MiniLM-L6-v2"
  • Store embeddings alongside text and metadata.

3️⃣ Create or Upsert Chroma Collection

  • Use ChromaDB to store documents, metadata, and embeddings.

  • Implement a retriever that supports:

    • Similarity search by keyword or description.

4️⃣ Query Results

  • Query with a prompt like “Últimas noticias de economía”.

  • Display results in a pandas DataFrame with columns:
    title | description | link | date_published

5️⃣ Orchestrate with LangChain

  • Implement an end-to-end pipeline in LangChain that:

    • Loads RSS → Tokenizes → Embeds → Stores → Retrieves.

  • Each step should be modular (functions or LangChain chains).

🧮 Deliverables (Task 1)

- Jupyter Notebook
  • requirements.txt

  • README.md


Task 2 — Transformer News Classification (AG News Dataset)

Objective:
Train and compare transformer-based models (RoBERTa, DeBERTa, ModernBERT) on the AG News dataset.

Dataset

from datasets import load_dataset dataset = load_dataset("ag_news")

Categories:

0 - World 1 - Sports 2 - Business 3 - Science/Technology

Steps

1️⃣ Data Preparation

  • Split the dataset into:

    • 70% training

    • 15% validation

    • 15% test

  • Use only train + validation for model tuning.

  • Keep test data untouched for final evaluation.

2️⃣ Model Training
Train three models separately using Hugging Face Transformers


3️⃣ Evaluation

  • Plot F1-score comparison between models using matplotlib or seaborn.

  • Include:

    • Training curves (optional)

    • Bar chart comparison (RoBERTa vs DeBERTa vs ModernBERT)

4️⃣ Bonus Task — RPP RSS Classification

  • Pass the 50 RPP news articles retrieved in Task 1 to an LLM (e.g., ChatGPT API or other open LLM)
    → Ask it to classify each article into one of the four AG News categories.

  • Store LLM classifications as “ground-truth-like” reference.

  • Pass the same RPP articles through your three trained models.

  • Compare F1-scores between models vs LLM-assigned labels.

  • Discuss:

    • Are model predictions consistent with the LLM?

    • Which model aligns best with the LLM classification?

    • Hypothesize reasons for discrepancies (e.g., model pretraining domain, context length, etc.).

🧮 Deliverables (Task 2)

  • /notebooks/agnews_train_eval.ipynb

  • /data/rpp_classified.json (optional)

  • Graph comparing model performance (F1-scores)

  • Markdown summary with interpretation of results


🧮 Rubric (20 pts)


Data & Reproducibility — 4 pts

  • Organized repository structure (/src, /data, /notebooks, /outputs).
  • Functional Google Colab or Jupyter notebook provided.
  • All file paths are relative, no absolute directories.
  • A complete and functional requirements.txt or pyproject.toml file included.
  • Code runs end-to-end without manual intervention.

Task 1: Retrieval System — 6 pts

  • Correct RSS parsing from RPP feed (https://rpp.pe/rss).
  • Proper tokenization and token count verification using tiktoken.
  • Generation of embeddings with SentenceTransformers/all-MiniLM-L6-v2.
  • Creation and management of a ChromaDB collection (store + upsert + retrieval).
  • LangChain orchestration connecting all steps (load → tokenize → embed → store → query).
  • Clear output table displaying:
    title | description | link | date_published.

Task 2: Transformer Models (AG News) — 6 pts

  • AG News dataset properly loaded and split into 70/15/15 (train/validation/test).
  • Fine-tuning of RoBERTa, DeBERTa, and ModernBERT models.
  • Models trained only on train + validation; test set reserved for final evaluation.
  • F1-score (macro or weighted) computed for each model.
  • Test set used only once for final comparison.
  • Discussion of model behavior and observed differences.

Visualization & Comparison — 2 pts

  • At least one F1-score comparison chart (bar plot or table).
  • Proper axis labeling and legend.
  • Markdown discussion or brief interpretation of which model performs best and why.

Bonus Task (LLM Classification) — +3 pts

  • Use of an LLM (e.g., ChatGPT or open-source equivalent) to classify 50 RPP news items into AG News categories:
    0 - World, 1 - Sports, 2 - Business, 3 - Science/Tech.
  • Comparison of model predictions vs. LLM classifications using F1-score.
  • Analytical discussion on:
    • Consistency between models and LLM.
    • Possible reasons for divergences (e.g., domain differences, context length, embeddings).
  • Visualization of comparative F1-scores (optional but recommended).

Penalties (−0.5 each)

  • Missing or incomplete README.md.
  • Missing requirements.txt or incorrect dependencies.
  • Non-reproducible results (unavailable data, missing random seeds, or broken scripts).
  • Incomplete or unclear result documentation.

🛠️ Technical Requirements

  • Python 3.10+

  • Packages:

    feedparser tiktoken sentence-transformers chromadb langchain datasets transformers torch matplotlib pandas seaborn scikit-learn
  • Runable in Google Colab


🔁 Recommended Workflow

  1. Task 1:

    • Parse RSS → Inspect text length → Embed → Store → Query → Display results

    • Document examples (5 most recent retrieved items)

  2. Task 2:

    • Load AG News → Split data → Train models → Compare F1 → Visualize

  3. Bonus:

    • Classify RPP news via LLM → Test with your 3 models → Compare outcomes

    • Discuss interpretability and model alignment


📤 Submission

Submit:

  • GitHub repository URL

in the following Google Sheet:
👉 [Submission Excel – Repository & Dashboard Links]

Deadline: October 23

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions