An automated SEO internal linking assistant that crawls a domain, extracts readable content, computes TF-IDF + SBERT hybrid similarity, and recommends high-quality internal links with suggested anchor texts. Runs as an interactive Streamlit app with progress feedback, adjustable thresholds, and CSV export.
Live Demo: https://huggingface.co/spaces/joshi-deepak08/Seo-internal-linker
- Overview
- Features
- Folder Structure
- How to Run Locally
- Architecture & Design Decisions
- Approach
- Pipeline Design
- Challenges & Trade-Offs
This project is an Internal Link Suggestion Tool for SEO teams, content writers, and technical SEOs.
Given:
- a Domain (root URL) – e.g.
https://www.example.com - a Test URL on that same domain – e.g. a blog post you want to optimize
…the tool:
-
Crawls the domain (using sitemap + internal links)
-
Extracts main content from each page using readability heuristics
-
Filters pages by language & content length
-
Builds TF-IDF and SBERT (SentenceTransformer MiniLM) representations
-
Computes a hybrid similarity score between the Test URL and all other pages
-
Extracts keyphrases from the Test URL and chooses anchor → target pairs
-
Outputs a ranked table of internal link suggestions with:
- suggested anchor phrase
- target URL
- similarity scores (TF-IDF, SBERT, combined)
- explanation / reason
All of this is wrapped in a Streamlit dashboard with interactive controls and CSV download.
-
Full-domain crawling via sitemap + BFS internal link discovery
-
Dual similarity engine: TF-IDF (lexical) + SBERT (semantic)
-
Readability-based content extraction using
readability+ BeautifulSoup -
Language filtering using
langdetect -
Configurable thresholds & limits:
-
Max pages to crawl
-
Minimum content length
-
Min TF-IDF & SBERT similarity
-
Number of suggestions per source page
-
Keyphrase-based anchor selection from the Test URL
-
Explainable output:
-
anchor text
-
target URL
-
similarity scores
-
reasoning text
-
CSV export of internal link suggestions
-
Progress bars for crawling, extraction, and embedding steps
-
Hugging Face Space deployment (Dockerized environment)
SEO-interlinking-tool/
│
├── app.py / main.py # Streamlit app with full pipeline (UI + logic)
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── (optionally) assets/ # Screenshots such as seo_interlinking.png
In your Hugging Face Space,
app.pyis the Streamlit entrypoint. The code shown above is the main app logic.
git clone https://github.com/JoshiDeepak08/SEO-interlinking-tool.git
cd SEO-interlinking-toolpython -m venv venv
source venv/bin/activate # macOS / Linux
# or
venv\Scripts\activate # Windowspip install -r requirements.txtIf your main file is app.py:
streamlit run app.py(If the main file is named main.py, just change the command accordingly.)
The app will start at:
http://localhost:8501
Core components:
- Streamlit for UI → quick iteration, sliders, expanders, progress bars, CSV download
- Requests + BeautifulSoup + Readability for robust web crawling & content extraction
- langdetect to filter pages by language
- sklearn TfidfVectorizer / CountVectorizer for TF-IDF representations
- SentenceTransformers / SBERT (
all-MiniLM-L6-v2) for semantic embeddings - Hybrid scoring function:
score = α * TFIDF + (1 - α) * SBERT - NumPy + pandas for similarity computation and tabular output
- Retrying HTTP Session for resilient crawling (with backoff & error handling)
-
TF-IDF captures exact keyword matches and frequency
-
SBERT captures semantic similarity beyond exact words
-
Combining them via a tunable
ALPHAparameter gives you:- control over “strict keyword” vs “semantic looseness”
- better SEO relevance vs purely semantic matching
Instead of naive raw HTML scraping, content is cleaned to:
- reduce navigation/boilerplate noise
- focus on body text, headings, and meaningful content
- improve quality of embeddings and similarity scores
-
Input: user provides:
- Domain (root URL)
- Test URL (page to optimize)
-
Crawling strategy:
- Try sitemap first (
/sitemap.xml,/sitemap_index.xml, etc.) - If weak or missing, fallback to BFS crawling via
<a>tags
- Try sitemap first (
-
Filtering:
- Keep only same-site URLs (optionally allow subdomains)
- Filter by language (e.g., only
en) - Filter by minimum content length
-
Model building:
- Build TF-IDF / CountVectorizer on cleaned docs
- Encode documents with SBERT (
all-MiniLM-L6-v2)
-
Source page representation:
- Use Test URL’s full extracted text
- Compute its TF-IDF vector + SBERT embedding
-
Keyphrase extraction:
- Extract token n-grams from Test URL that appear in TF-IDF vocab
- Rank by frequency and filter out stopwords/boring phrases
-
Similarity scoring + anchor selection:
- Compute TF-IDF cosine similarity + SBERT cosine similarity
- Combine into hybrid score
- Remove self page (Test URL) from candidates
- Enforce similarity thresholds
- Pick the best matching anchor phrase for each candidate target
-
Output:
- Produce ranked list with anchors, target URLs, and explanation
- Display in Streamlit data frame + allow CSV download
flowchart TD
A[User Inputs: Domain and Test URL] --> B[HTTP Session with Retry Logic]
B --> C[Crawl Domain: Sitemaps and Internal Links]
C --> D[Fetch HTML Pages]
D --> E[Readability Extraction: Title, Headings, Meta, Text]
E --> F[Language Filter and Length Filter]
F --> G[Document Corpus Built]
G --> H[TF-IDF or Count Vectorizer]
G --> I[SBERT Embeddings]
H --> J[TF-IDF Similarity Scores]
I --> K[SBERT Similarity Scores]
J --> L[Hybrid Scoring with ALPHA Weight]
K --> L
L --> M[Keyphrase Extraction from Test URL]
M --> N[Anchor Phrase Selection]
N --> O[Final Suggestions Table]
O --> P[Streamlit UI Output with CSV Export]
- Some sites have perfect sitemaps → fast discovery
- Others need BFS crawling → slower & noisier ➡ Trade-off: Hybrid approach (try sitemaps first, fallback to crawl).
- Very aggressive filtering might discard useful pages
- Too loose filtering includes thin/irrelevant content ➡ Exposed MIN_LEN and language whitelist as UI settings.
-
High thresholds → only very strong matches; risk of no suggestions
-
Low thresholds → many weak / irrelevant links ➡ User-controllable sliders for:
MIN_SIM_TFIDFMIN_SIM_BERTALPHA(TF-IDF vs BERT weight)
- SBERT embeddings can be heavy on CPU
- GPU (if available) gives big speed boost
➡ Dynamic device detection (
cudaif available, elsecpu) and batch processing.
- Same phrase should not be used for many URLs ➡ Tool tracks used anchors & targets to maintain diversity.