SEO Internal Link Suggestion Tool

An automated SEO internal linking assistant that crawls a domain, extracts readable content, computes TF-IDF + SBERT hybrid similarity, and recommends high-quality internal links with suggested anchor texts. Runs as an interactive Streamlit app with progress feedback, adjustable thresholds, and CSV export.

Live Demo: https://huggingface.co/spaces/joshi-deepak08/Seo-internal-linker

Overview

This project is an Internal Link Suggestion Tool for SEO teams, content writers, and technical SEOs.

Given:

a Domain (root URL) – e.g. https://www.example.com
a Test URL on that same domain – e.g. a blog post you want to optimize

…the tool:

Crawls the domain (using sitemap + internal links)
Extracts main content from each page using readability heuristics
Filters pages by language & content length
Builds TF-IDF and SBERT (SentenceTransformer MiniLM) representations
Computes a hybrid similarity score between the Test URL and all other pages
Extracts keyphrases from the Test URL and chooses anchor → target pairs
Outputs a ranked table of internal link suggestions with:
- suggested anchor phrase
- target URL
- similarity scores (TF-IDF, SBERT, combined)
- explanation / reason

All of this is wrapped in a Streamlit dashboard with interactive controls and CSV download.

Features

Full-domain crawling via sitemap + BFS internal link discovery
Dual similarity engine: TF-IDF (lexical) + SBERT (semantic)
Readability-based content extraction using readability + BeautifulSoup
Language filtering using langdetect
Configurable thresholds & limits:
Max pages to crawl
Minimum content length
Min TF-IDF & SBERT similarity
Number of suggestions per source page
Keyphrase-based anchor selection from the Test URL
Explainable output:
anchor text
target URL
similarity scores
reasoning text
CSV export of internal link suggestions
Progress bars for crawling, extraction, and embedding steps
Hugging Face Space deployment (Dockerized environment)

Folder Structure

SEO-interlinking-tool/
│
├── app.py / main.py        # Streamlit app with full pipeline (UI + logic)
├── requirements.txt        # Python dependencies
├── README.md               # Project documentation
└── (optionally) assets/    # Screenshots such as seo_interlinking.png

In your Hugging Face Space, app.py is the Streamlit entrypoint. The code shown above is the main app logic.

How to Run Locally

1️⃣ Clone the repository

git clone https://github.com/JoshiDeepak08/SEO-interlinking-tool.git
cd SEO-interlinking-tool

2️⃣ Create & activate a virtual environment (optional but recommended)

python -m venv venv
source venv/bin/activate      # macOS / Linux
# or
venv\Scripts\activate         # Windows

3️⃣ Install dependencies

pip install -r requirements.txt

4️⃣ Run the Streamlit app

If your main file is app.py:

streamlit run app.py

(If the main file is named main.py, just change the command accordingly.)

The app will start at:

http://localhost:8501

Architecture & Design Decisions

Core components:

Streamlit for UI → quick iteration, sliders, expanders, progress bars, CSV download
Requests + BeautifulSoup + Readability for robust web crawling & content extraction
langdetect to filter pages by language
sklearn TfidfVectorizer / CountVectorizer for TF-IDF representations
SentenceTransformers / SBERT (all-MiniLM-L6-v2) for semantic embeddings
Hybrid scoring function: score = α * TFIDF + (1 - α) * SBERT
NumPy + pandas for similarity computation and tabular output
Retrying HTTP Session for resilient crawling (with backoff & error handling)

Why TF-IDF + SBERT (Hybrid)?

TF-IDF captures exact keyword matches and frequency
SBERT captures semantic similarity beyond exact words
Combining them via a tunable ALPHA parameter gives you:
- control over “strict keyword” vs “semantic looseness”
- better SEO relevance vs purely semantic matching

Why Readability Extraction?

Instead of naive raw HTML scraping, content is cleaned to:

reduce navigation/boilerplate noise
focus on body text, headings, and meaningful content
improve quality of embeddings and similarity scores

Approach

Input: user provides:
- Domain (root URL)
- Test URL (page to optimize)
Crawling strategy:
- Try sitemap first (/sitemap.xml, /sitemap_index.xml, etc.)
- If weak or missing, fallback to BFS crawling via <a> tags
Filtering:
- Keep only same-site URLs (optionally allow subdomains)
- Filter by language (e.g., only en)
- Filter by minimum content length
Model building:
- Build TF-IDF / CountVectorizer on cleaned docs
- Encode documents with SBERT (all-MiniLM-L6-v2)
Source page representation:
- Use Test URL’s full extracted text
- Compute its TF-IDF vector + SBERT embedding
Keyphrase extraction:
- Extract token n-grams from Test URL that appear in TF-IDF vocab
- Rank by frequency and filter out stopwords/boring phrases
Similarity scoring + anchor selection:
- Compute TF-IDF cosine similarity + SBERT cosine similarity
- Combine into hybrid score
- Remove self page (Test URL) from candidates
- Enforce similarity thresholds
- Pick the best matching anchor phrase for each candidate target
Output:
- Produce ranked list with anchors, target URLs, and explanation
- Display in Streamlit data frame + allow CSV download

Pipeline Design

High-level flow

flowchart TD

A[User Inputs: Domain and Test URL] --> B[HTTP Session with Retry Logic]

B --> C[Crawl Domain: Sitemaps and Internal Links]
C --> D[Fetch HTML Pages]

D --> E[Readability Extraction: Title, Headings, Meta, Text]
E --> F[Language Filter and Length Filter]

F --> G[Document Corpus Built]

G --> H[TF-IDF or Count Vectorizer]
G --> I[SBERT Embeddings]

H --> J[TF-IDF Similarity Scores]
I --> K[SBERT Similarity Scores]

J --> L[Hybrid Scoring with ALPHA Weight]
K --> L

L --> M[Keyphrase Extraction from Test URL]
M --> N[Anchor Phrase Selection]

N --> O[Final Suggestions Table]

O --> P[Streamlit UI Output with CSV Export]

Challenges & Trade-Offs

1. Crawling vs Sitemap Quality

Some sites have perfect sitemaps → fast discovery
Others need BFS crawling → slower & noisier ➡ Trade-off: Hybrid approach (try sitemaps first, fallback to crawl).

2. Language & Content Filtering

Very aggressive filtering might discard useful pages
Too loose filtering includes thin/irrelevant content ➡ Exposed MIN_LEN and language whitelist as UI settings.

3. Similarity Threshold Tuning

High thresholds → only very strong matches; risk of no suggestions
Low thresholds → many weak / irrelevant links ➡ User-controllable sliders for:
- MIN_SIM_TFIDF
- MIN_SIM_BERT
- ALPHA (TF-IDF vs BERT weight)

4. Model Performance vs Resource Usage

SBERT embeddings can be heavy on CPU
GPU (if available) gives big speed boost ➡ Dynamic device detection (cuda if available, else cpu) and batch processing.

5. Anchor Uniqueness

Same phrase should not be used for many URLs ➡ Tool tracks used anchors & targets to maintain diversity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SEO Internal Link Suggestion Tool

Table of Contents

Overview

Features

Folder Structure

How to Run Locally

1️⃣ Clone the repository

2️⃣ Create & activate a virtual environment (optional but recommended)

3️⃣ Install dependencies

4️⃣ Run the Streamlit app

Architecture & Design Decisions

Why TF-IDF + SBERT (Hybrid)?

Why Readability Extraction?

Approach

Pipeline Design

High-level flow

Challenges & Trade-Offs

1. Crawling vs Sitemap Quality

2. Language & Content Filtering

3. Similarity Threshold Tuning

4. Model Performance vs Resource Usage

5. Anchor Uniqueness

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Deepak-J0shi/SEO-Interlinking-Tool

Folders and files

Latest commit

History

Repository files navigation

SEO Internal Link Suggestion Tool

Table of Contents

Overview

Features

Folder Structure

How to Run Locally

1️⃣ Clone the repository

2️⃣ Create & activate a virtual environment (optional but recommended)

3️⃣ Install dependencies

4️⃣ Run the Streamlit app

Architecture & Design Decisions

Why TF-IDF + SBERT (Hybrid)?

Why Readability Extraction?

Approach

Pipeline Design

High-level flow

Challenges & Trade-Offs

1. Crawling vs Sitemap Quality

2. Language & Content Filtering

3. Similarity Threshold Tuning

4. Model Performance vs Resource Usage

5. Anchor Uniqueness

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages