Skip to content

Bug: GitbookLoader fails to process nested sitemaps #30629

Closed
langchain-ai/langchain-community
#13
@andrasfe

Description

@andrasfe

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

The following code returns 0.

from langchain_community.document_loaders import GitbookLoader

loader = GitbookLoader(
    web_page="https://platform-docs.opentargets.org/",
    load_all_paths=True,
    sitemap_url="https://platform-docs.opentargets.org/sitemap.xml",
)
docs = loader.load()
print(len(docs))  # Returns 0 instead of expected documents

Expected Behavior
The loader should process the sitemap index file, follow links to child sitemaps, and extract URLs to content pages, returning all available documents.
Actual Behavior
The loader processes only the top-level sitemap file, fails to recognize it as a sitemap index, and returns 0 documents.
Root Cause
The code doesn't distinguish between sitemap index files () and regular sitemap files ()
It isn't using the proper XML parser for sitemap files
There is no recursive processing for nested sitemaps

Error Message and Stack Trace (if applicable)

code above returns 0.

Description

Trying to load the gitbook document. The GitbookLoader fails to extract documents when the target site uses nested sitemaps (a sitemap index pointing to child sitemaps). When encountering a sitemap index file, the loader attempts to process it as a regular content page rather than recursively exploring the referenced sitemaps, resulting in zero documents returned.

PS: this is a continuation of issue 30473. Prior PR solved for the example but not for the recursive sitemap issue.

System Info

System Information

OS: Linux
OS Version: #59-Ubuntu SMP PREEMPT_DYNAMIC Sat Mar 15 17:40:59 UTC 2025
Python Version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]

Package Information

langchain_core: 0.3.49
langchain: 0.3.22
langchain_community: 0.3.20
langsmith: 0.3.22
langchain_anthropic: 0.3.10
langchain_openai: 0.3.11
langchain_text_splitters: 0.3.7

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
anthropic<1,>=0.49.0: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.45: Installed. No version info available.
langchain-core<1.0.0,>=0.3.49: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.7: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.21: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
openai-agents: Installed. No version info available.
openai<2.0.0,>=1.68.2: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.16
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.11.1
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 8.3.5
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    🤖:bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions