GlotWeb is an advanced web indexing system specifically designed to address the digital resource gap for minority languages. Our system:
- Identifies web content in 402+ languages through multi-source aggregation
- Validates linguistic accuracy using GlotLID language identification
- Filters content to ensure quality while minimizing religious bias
- Compiles 169,155+ verified web links (47% in languages absent from major datasets)
โ Covers languages missing from FLORES-200, MADLAD-400, and Glot500
โ Open-source pipeline with reproducible results
โ Interactive demo showcasing language resources (Click Image below to access)
This documentation walks through GlotWeb's 4-step pipeline:
- Search Setup: Configure and run web searches
- Seed Generation: Filter initial results
- Crawling: Expand and validate links
- Cleaning: Deduplicate and finalize outputs
- Follow steps sequentially (1 โ 4)
- Each section includes:
- Purpose explanation
- Configuration options
- Execution commands
- Expected outputs
- Requires basic Python/Docker knowledge
Tip: For quick setup, clone the repository and use the provided configuration files as templates.
Ready to begin? Proceed to Step 1: Set up SearXNG and perform search.
export PORT=8080
docker pull searxng/searxng
docker run --rm \
-d -p ${PORT}:8080 \
-v "${PWD}/searxng:/etc/searxng" \
-e "BASE_URL=http://localhost:${PORT}/" \
-e "INSTANCE_NAME=my-instance" \
searxng/searxng
add JSON output format:
In the ${PWD}/searxng
directory, there is a file named settings.yml
. In that file, you need to enable the JSON output format in the SearX configuration under the search.formats
key like this:
search:
formats:
- html
- json
modify uwsgi.ini :
In the ${PWD}/searxng
directory, there is a file named uwsgi.ini
. In that file you need to modify buffer-size. Default is 8k. Increasing to 9k sometimes help with 'Internal Error 500'.
Default Value:
buffer-size = 8192
Change to:
buffer-size = 9216
This is an object-oriented Python script that leverages the Searx API to perform searches and save the results to JSON files. The script is configurable using a YAML configuration file called 'search_config.yaml'.
- Uses SearxSearchWrapper for querying multiple search engines.
- Handles retries for failed requests.
- Configurable search parameters through a YAML file.
- Configurable input file, search range, output directory, and other parameters.
- Automatically saves results in a structured JSON format.
searx_host: "http://127.0.0.1:8080" # Searx instance URL
engines:
- "bing"
- "yahoo"
- "qwant"
- "duckduckgo" # Search engines to be used
num_results: 50 # Number of results to fetch for each query
max_retries: 3 # Maximum number of retries for failed requests
retry_wait_time: 2 # Wait time (in seconds) between retries
output_file_prefix: "results" # Prefix for output file names
output_directory: "search_dump" # Directory to save output files
input_file: "input.txt" # Path to the input file containing search queries
start_index: 0 # Start index for queries to process
end_index: 10 # End index for queries to process
The input file should be a tab-separated file where each line contains an ISO code and a sentence for search:
ISO_CODE_1 Search query 1
ISO_CODE_2 Search query 2
aa Itiyobbiyah agattinoona sittal xayyossa yangalen qaadoodih baari gablusaanamah angicille le.
aai Baise orot taโitaโimon matah toniwaโan bar hinanutitiy gewas hinawowab.
aak O xewanษจลo na'nษจ re rษจnษจลษจnigษจnษจ, โA'mษจna' sea'yษจ e imo'nษจลa' wonษจrษจnษจ.โ
The iso code for the input text file can be either 2 lettered format or 3 lettered format.
Run the script using: pwd should be be root of the directory.
python pipeline/search_service.py
The search results will be saved in the specified output directory (e.g., search_dump) as JSON files named according to the specified prefix and index range, e.g., results_0-10.json.
You can easily adjust the following parameters in the config.yaml file:
- Search engines: Add or remove engines in the engines list.
- Search range: Modify start_index and end_index to control which lines in the input file are processed.
- Output directory: Change output_directory to save results in a different location.
This script filters web search dump/results based on domain restrictions, scrapes web pages, and performs language identification a FastText model for which we chose GlotLID. The processed data is stored in JSON format categorized by predicted languages.
Ensure you have the following Python packages installed:
pip install fasttext trafilatura urllib3 tqdm pyyaml
Configuration already provided in the repository and must be changed according to user preferneces. Examples below:
model_path: "path/to/fasttext/model"
domain_file: "path/to/domain_filter.txt"
json_filename: "path/to/input.json"
iso_list_file: "path/to/iso_list.json"
output_directory: "path/to/output"
Execute the script with: pwd should be be root of the directory.
python pipeline/language_filter.py
This step takes the filtered seed URLs from Step 2 and performs deep crawling to discover additional web pages in the target languages. It includes:
- Web crawling from seed URLs
- Language detection using FastText (GlotLID)
- Domain filtering
- Parallel processing for efficiency
- Comprehensive logging and metadata collection
Ensure you have the following Python packages installed:
pip install fasttext beautifulsoup4 requests trafilatura tqdm pyyaml urllib3
The script uses config.yaml with these key parameters:
seed_crawler:
max_pages: 100 # Maximum pages to crawl per language
max_time: 3600 # Maximum crawling time in seconds
crawl_delay: 1 # Delay between requests
to_visit_growth_factor: 50 # Threshold for detecting circular links
max_workers: 4 # Threads for parallel processing
url_settings:
request_timeout: 10 # Timeout for web requests
max_url_length: 65000 # Maximum URL length to consider
language_detector:
model_path: "path/to/model" # Path to FastText model
minimum_confidence: 0.7 # Minimum language confidence score
desired_language: "bpy_Beng" # Target language code
save_text: False # Whether to save scraped text
output:
directory: "output" # Output directory
output_file_name: "{language}_filtered.json" # Output filename pattern
batch_processing:
enabled: False # Enable batch mode
input_labels: [] # List of language codes for batch
cooldown_between_languages: 60 # Cool-down between languages
Input JSON files from Step 2 (named as [LANGUAGECODE_SCRIPT].json)
Each JSON file should contain entries with:
-
link: URL string
-
lid_confidence: Confidence score (float)
-
predicted_lid: Language code
For each processed language, the script generates:
[LANGUAGE_CODE]_filtered.json - Filtered URLs with metadata
meta_data/[LANGUAGE_CODE]_meta_data.json - Crawling statistics including:
Seed URLs used
All discovered links
Filtered links
Unique new links
Rejected links
Single Language Processing
python pipeline/seed_crawler.py
Configure desired_language in config.yaml first.
Enable batch mode in config.yaml:
batch_processing:
enabled: True
input_labels: ["syl_Sylo", "bpy_Beng", "akh_Latn"] # Your target languages
Run the same command:
python pipeline/seed_crawler_beta.py
- Crawling Behavior: Adjust max_pages and max_time to control crawling scope Modify crawl_delay to be more/less aggressive
- Language Detection: Change minimum_confidence for stricter/looser filtering
- Set save_text: True to store scraped content
- Performance: Increase max_workers for faster processing (requires more CPU) Adjust cooldown_between_languages for batch processing
- Change output directory and filename patterns
- Metadata collection is always enabled
- The script automatically skips domains listed in your domain filter file
- Progress bars are enabled by default (can be disabled in config)
- Comprehensive logging helps troubleshoot issues
This script performs final domain filtering on crawled results to exclude unwanted domains from both the main output and metadata files.
- Loads crawled data and metadata JSON files
- Applies domain filtering using the configured domain blocklist
- Updates all metadata statistics after filtering
- Handles both single-language and batch processing modes
- Ensures final outputs comply with domain restrictions
- Maintains consistency between data files and their metadata
- Prepares clean data for subsequent deduplication steps
Configure domain_file
path in config.yaml
and run:
python result_filtering/final_domain_filter.py
Uses these key config parameters:
domain_file: "path/to/domain_filter.txt" # List of domains to exclude
output:
directory: "output" # Where to find/save files
Updates both:
[LANGUAGE]_filtered.json - With domain-filtered results
meta_data/[LANGUAGE]_meta_data.json - With filtered statistics
Transforms crawled language data into a structured format suitable for GlotWeb visualization, enriching it with metadata and linguistic information.
- Extracts language metadata (speaker counts, language family)
- Checks inclusion in major multilingual datasets (MADLAD-400, Flores, Glot500)
- Organizes URLs by domain with site categorization
- Handles both single-language and batch processing
- Creates standardized format for GlotWeb frontend
- Enriches raw data with valuable linguistic metadata
- Provides domain-level organization of web resources
- Generates compatibility flags for popular multilingual datasets
output:
formated_directory: "formatted_output" # Output directory
formated_file_name: "{language}_formatted.json" # Output filename pattern
python result_filtering/format_for_glotweb.py
Filters out domains that explicitly block Common Crawl's CCBot in their robots.txt file, ensuring compliance with website crawling policies.
- Checks each domain's robots.txt for CCBot restrictions
- Removes entire domains if they block CCBot
- Preserves all other metadata while filtering
- Handles both single-language and batch processing
- Ensures ethical web scraping compliance
- Prevents potential legal issues
- Maintains good web citizenship by respecting robots.txt
- Filters before final dataset compilation
output:
formated_directory: "formatted_output" # Input directory (from Step 4.2)
cleaned_directory: "cleaned_output" # Output directory for filtered data
python result_filtering/robots_compliance_filter.py
- Loads formatted JSON from Step 4.2
- For each domain:
- Fetches robots.txt
- Checks for CCBot restrictions
- Saves cleaned version with compliant domains only
- Maintains same structure as input
- Only contains domains that allow CCBot
- Saved as [LANGUAGE].json in cleaned directory
- If robots.txt is inaccessible, assumes crawling is allowed
- Only checks for explicit CCBot blocks (not general User-agent: *)
- Processes domains sequentially with 5-second timeout
- Preserves all non-URL metadata (speaker counts, language family etc.)
Performs final data cleaning through URL normalization and deduplication to create a polished dataset.
-
HTTP/HTTPS Merging (
http_merge_2.py
):- Combines duplicate sites with different protocols (http/https)
- Standardizes www/non-www variants
- Preserves all unique links
-
Hash Fragment Removal (
remove_all_hash.py
):- Removes URL fragments (#section)
- Deduplicates URLs that only differ by fragments
output:
robots_filtered: "output/robots_filtered" # Input from Step 4.3
http_merged: "output/http_merged" # Intermediate output
deduplication: "output/deduplication" # Final output
# Run protocol merging first
python result_filtering/http_merge_2.py
# Then run hash removal
python result_filtering/remove_all_hash.py
- Protocol-agnostic site merging
- Consistent URL normalization
- Fragment removal while preserving query parameters
- Order-preserving deduplication
Cleaned JSON files with:
- Unified site entries
- Normalized URLs
- No duplicate content
โ
Processed datasets in output/cleaned_output/[LANG].json
containing:
- Verified web links
- Language metadata (speakers, family)
- Domain categorization
- Compatibility flags (FLORES/MADLAD/Glot500)
โ
Metadata reports in output/meta_data/
with:
- Crawling statistics
- Domain distributions
- Filtering metrics
We urgently need native speakers and linguists to validate results:
- Explore your language in the GlotWeb Demo
- Check 10-20 random links for:
- Actual language content (not machine translation)
- Cultural/educational value
- Flag as religious if content is from religious scriptures
- Correct language/dialect labeling
- Report issues via:
- GitHub Issues
- Email: [email protected]
Impact Area | Community Role |
---|---|
Data Quality | Remove spam/misclassified content |
Language Preservation | Identify valuable resources |
NLP Development | Improve training data for LLMs |
- Speakers: Join us in Language auditing.
- Researchers: Use data with citation (BibTeX forthcoming)