Skip to content

Commit 5735f3f

Browse files
authored
Merge pull request #42 from Negiiiin/wikipedia_capabilities
Added Wikipedia capabilities vs. generated capablities
2 parents 2f34c46 + 27589fa commit 5735f3f

File tree

7 files changed

+1867
-0
lines changed

7 files changed

+1867
-0
lines changed

README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,46 @@ python -m src.agentic_capability_generator
8888
# Generate tasks for each capability
8989
python -m src.agentic_task_generator
9090
```
91+
92+
### Wikipedia-Based Analysis Tools
93+
94+
Tools for extracting, processing, and matching mathematical capabilities from Wikipedia. All prompts are centralized in `wikipedia/prompts.py`.
95+
96+
#### Wikipedia Glossary Scraper
97+
98+
Scrapes Wikipedia's "Glossary of areas of mathematics", extracts capability descriptions, and generates summaries with LLM-powered categorization.
99+
100+
```bash
101+
cd wikipedia
102+
python wikipedia_scraper.py
103+
```
104+
105+
Outputs JSON files to `wikipedia/pages/` containing `capability_name`, `description`, `summary`, `area`, `url`, and `timestamp`.
106+
107+
#### Wikipedia-Generated Capability Matcher
108+
109+
Matches Wikipedia capabilities with generated capabilities using LLM-based similarity analysis. Supports bidirectional matching.
110+
111+
Configure `wikipedia/cfg/wiki_vs_generated.yaml`:
112+
- `data_cfg.wikipedia_pages_dir`: Wikipedia pages directory
113+
- `data_cfg.generated_dir`: Generated capabilities directory
114+
- `processing_cfg.match_direction`: `generated_to_wikipedia` or `wikipedia_to_generated`
115+
116+
```bash
117+
cd wikipedia
118+
python wiki_vs_generated.py
119+
```
120+
121+
#### Dataset Question Categorizer
122+
123+
Categorizes questions from GSM8K or MATH datasets into mathematical areas using generated or Wikipedia taxonomies. Supports checkpoint-based resume.
124+
125+
Configure `wikipedia/cfg/static_vs_generated.yaml`:
126+
- `data_cfg.dataset_name`: `gsm8k` or `math`
127+
- `data_cfg.dataset_path`: Dataset file (GSM8K) or directory (MATH)
128+
- `categorization_cfg.extraction_method`: `generated` or `wikipedia`
129+
130+
```bash
131+
cd wikipedia
132+
python static_vs_generated.py
133+
```
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Configuration for question categorization using generated capabilities extraction
2+
data_cfg:
3+
# Path to the generated capabilities directory containing capabilities
4+
generated_dir: /projects/DeepLesion/projects/automated_capability_evaluation/artifacts/capabilities_gpt-claude-math/math
5+
6+
# Dataset selection
7+
# Supported dataset_name values: "gsm8k", "math"
8+
dataset_name: gsm8k
9+
# For gsm8k: path to combined JSONL; For math: root directory with JSON files (recursive)
10+
dataset_path: /projects/DeepLesion/projects/automated_capability_evaluation/static_datasets/math/gsm8k-main/test.jsonl
11+
12+
# Path to the existing Wikipedia categorization results file (not used in generated mode)
13+
wikipedia_dir: /projects/DeepLesion/projects/automated_capability_evaluation/wikipedia/pages
14+
15+
categorization_cfg:
16+
# Method to use for extracting areas and capabilities
17+
# Options: "generated" (extract from capability.json files) or "wikipedia" (use predefined Wikipedia categorization)
18+
extraction_method: "generated"
19+
20+
llm_cfg:
21+
# LLM model name for categorization
22+
# model_name: "Qwen2.5-14B-Instruct"
23+
model_name: "Qwen2.5-7B-Instruct"
24+
# LLM model provider
25+
model_provider: "local"
26+
27+
28+
output_cfg:
29+
# Directory to save the categorization results
30+
results_dir: /projects/DeepLesion/projects/automated_capability_evaluation/results/GSM8K
31+
# Name of the output file
32+
output_filename: gsm8k_vs_generated.json
33+
34+
processing_cfg:
35+
# Save checkpoint every N questions
36+
save_every_n: 20
37+
38+
defaults:
39+
- _self_
40+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Configuration for Wikipedia-Generated Matcher
2+
# This configuration matches Wikipedia capabilities with generated capabilities using Qwen2.5-32B-Instruct
3+
4+
data_cfg:
5+
# Path to the Wikipedia pages directory containing .json files
6+
wikipedia_pages_dir: /projects/DeepLesion/projects/automated_capability_evaluation/wikipedia/pages
7+
8+
# Path to the generated capabilities directory containing capability.json files
9+
generated_dir: /projects/DeepLesion/projects/automated_capability_evaluation/artifacts/capabilities_gpt-claude-math/math
10+
11+
llm_cfg:
12+
# LLM model name for similarity checking
13+
# model_name: "Qwen2.5-14B-Instruct"
14+
model_name: "Qwen2.5-7B-Instruct"
15+
# LLM model provider (local for vec-inf)
16+
model_provider: "local"
17+
qos: "a100_arashaf"
18+
partition: "a100"
19+
# Time limit for local LLM (format: HH:MM:SS)
20+
time: "72:00:00"
21+
22+
output_cfg:
23+
# Directory to save the matching results
24+
results_dir: /projects/DeepLesion/projects/automated_capability_evaluation/wiki/results
25+
# Name of the output JSON file
26+
output_filename: wikipedia_generated_matching_results.json
27+
28+
processing_cfg:
29+
# Logging level for the matching process
30+
log_level: "info"
31+
# Whether to save intermediate results
32+
save_intermediate: false
33+
# Matching direction: 'generated_to_wikipedia' or 'wikipedia_to_generated'
34+
match_direction: generated_to_wikipedia
35+
36+
defaults:
37+
- _self_

wikipedia/prompts.py

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
"""Centralized prompts for all Wikipedia-related scripts."""
2+
3+
4+
# System prompts
5+
SYSTEM_PROMPT_MATH_CAPABILITIES = "You are an expert in mathematical capabilities."
6+
SYSTEM_PROMPT_MATH_TAXONOMIST = (
7+
"You are an expert mathematical taxonomist. "
8+
"Your task is to map a single math problem to the most appropriate area from a provided list. "
9+
"Output must be EXACTLY one of the given area names, or 'none' if no reasonable match exists. "
10+
"Do not include explanations or extra words."
11+
)
12+
SYSTEM_PROMPT_CAPABILITY_EVALUATION = """You are an expert in mathematics and capability evaluation. Your task is to create concise, informative summaries of mathematical concepts and capabilities.
13+
14+
Given a detailed description of a mathematical concept or capability, provide a clear, concise summary that captures the essential meaning and scope. The summary should be:
15+
- Informative and accurate
16+
- Concise ONLY ONE SENTENCE
17+
- Written in clear, accessible language
18+
- Focused on the core concept and its applications
19+
20+
Examples of good summaries:
21+
- "Capability focusing on field theory, including solving problems related to field extensions, minimal polynomials, and degrees of extensions."
22+
- "Capability that involves solving problems in ring theory including identification of ring properties and operations, testing the structure of rings."
23+
- "Capability that asks the model to simplify algebraic expressions by reducing them to their simplest form. Involves collecting like terms and basic algebraic manipulations."
24+
25+
Provide only the summary, without any additional commentary or formatting."""
26+
SYSTEM_PROMPT_CATEGORIZATION = """You are an expert in mathematics and capability evaluation. Your task is to categorize mathematical concepts and capabilities into one of 10 predefined mathematical areas.
27+
28+
Given a description of a mathematical concept or capability, determine which of the following 10 mathematical areas it best belongs to:
29+
30+
1. Algebra and Functions
31+
2. Arithmetic and Number Theory
32+
3. Calculus and Analysis
33+
4. Differential Equations and Dynamical Systems
34+
5. Discrete Mathematics and Combinatorics
35+
6. Geometry and Spatial Reasoning
36+
7. Linear Algebra and Matrix Theory
37+
8. Mathematical Logic and Set Theory
38+
9. Mathematical Modeling and Applications
39+
10. Probability and Statistics
40+
41+
Return ONLY the exact area name from the list above, nothing else."""
42+
43+
44+
# User prompts - functions that generate user prompts
45+
def get_wikipedia_to_generated_prompt(wikipedia_cap_name: str, wikipedia_cap_description: str, capabilities_list: str) -> str:
46+
"""Generate prompt for matching Wikipedia capability to generated capabilities."""
47+
return f"""You are an expert in mathematical capabilities. Determine which generated capability best matches the given Wikipedia capability.
48+
49+
Wikipedia Capability:
50+
Name: {wikipedia_cap_name}
51+
Description: {wikipedia_cap_description}
52+
53+
Available Generated Capabilities:
54+
{capabilities_list}
55+
56+
Instructions:
57+
- Compare the Wikipedia capability with each available capability.
58+
- Return the exact capability name if ANY of the following is true:
59+
* The Wikipedia capability and the available capability describe the same concept, OR
60+
* The Wikipedia capability is a SUBSET/PART of the available capability (i.e., the available capability includes the Wikipedia capability as one of its components or subskills), OR
61+
* The available capability is a broader superset that clearly contains the Wikipedia capability.
62+
- Prefer the most specific matching capability when multiple candidates qualify.
63+
- Return "none" only if no capability clearly contains or equals the Wikipedia capability.
64+
65+
Answer with only the capability name or "none":"""
66+
67+
68+
def get_generated_to_wikipedia_prompt(generated_cap_name: str, generated_cap_description: str, capabilities_list: str) -> str:
69+
"""Generate prompt for matching generated capability to Wikipedia capabilities."""
70+
return f"""You are an expert in mathematical capabilities. Find the Wikipedia capability that most closely matches the generated capability.
71+
72+
Generated Capability:
73+
Name: {generated_cap_name}
74+
Description: {generated_cap_description}
75+
76+
Available Wikipedia Capabilities:
77+
{capabilities_list}
78+
79+
Instructions:
80+
- Compare the generated capability with each available Wikipedia capability.
81+
- Return the exact Wikipedia capability name if ANY of the following is true:
82+
* The generated capability and the Wikipedia capability describe the same concept, OR
83+
* The generated capability is a SUBSET/PART of the Wikipedia capability (i.e., the Wikipedia capability includes the generated capability as one of its components or subskills), OR
84+
* The Wikipedia capability is a broader superset that clearly contains the generated capability.
85+
- Prefer the most specific matching capability when multiple candidates qualify.
86+
- Return "none" only if no Wikipedia capability clearly contains or equals the generated capability.
87+
88+
Answer with only the Wikipedia capability name or "none":"""
89+
90+
91+
def get_area_categorization_prompt(area_bullets: str, question: str) -> str:
92+
"""Generate prompt for categorizing a question into a mathematical area."""
93+
return f"""Available mathematical areas (choose exactly one):
94+
{area_bullets}
95+
96+
Problem:
97+
{question}
98+
99+
Instructions:
100+
- Return ONLY the exact area name from the list above
101+
- Prefer the closest match even if imperfect; avoid 'none' unless clearly unrelated
102+
- Do not add punctuation or extra text
103+
104+
Answer:"""
105+
106+
107+
def get_capability_summary_prompt(description: str) -> str:
108+
"""Generate prompt for summarizing a mathematical capability."""
109+
return f"Please provide a concise summary of this mathematical concept:\n\n{description}"
110+
111+
112+
def get_capability_categorization_prompt(description: str) -> str:
113+
"""Generate prompt for categorizing a mathematical capability."""
114+
return f"Please categorize this mathematical concept into one of the 10 areas listed above:\n\n{description}"
115+

0 commit comments

Comments
 (0)