Description
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
The following code returns 0.
from langchain_community.document_loaders import GitbookLoader
loader = GitbookLoader(
web_page="https://platform-docs.opentargets.org/",
load_all_paths=True,
sitemap_url="https://platform-docs.opentargets.org/sitemap.xml",
)
docs = loader.load()
print(len(docs)) # Returns 0 instead of expected documents
Expected Behavior
The loader should process the sitemap index file, follow links to child sitemaps, and extract URLs to content pages, returning all available documents.
Actual Behavior
The loader processes only the top-level sitemap file, fails to recognize it as a sitemap index, and returns 0 documents.
Root Cause
The code doesn't distinguish between sitemap index files () and regular sitemap files ()
It isn't using the proper XML parser for sitemap files
There is no recursive processing for nested sitemaps
Error Message and Stack Trace (if applicable)
code above returns 0.
Description
Trying to load the gitbook document. The GitbookLoader fails to extract documents when the target site uses nested sitemaps (a sitemap index pointing to child sitemaps). When encountering a sitemap index file, the loader attempts to process it as a regular content page rather than recursively exploring the referenced sitemaps, resulting in zero documents returned.
PS: this is a continuation of issue 30473. Prior PR solved for the example but not for the recursive sitemap issue.
System Info
System Information
OS: Linux
OS Version: #59-Ubuntu SMP PREEMPT_DYNAMIC Sat Mar 15 17:40:59 UTC 2025
Python Version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
Package Information
langchain_core: 0.3.49
langchain: 0.3.22
langchain_community: 0.3.20
langsmith: 0.3.22
langchain_anthropic: 0.3.10
langchain_openai: 0.3.11
langchain_text_splitters: 0.3.7
Optional packages not installed
langserve
Other Dependencies
aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
anthropic<1,>=0.49.0: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.45: Installed. No version info available.
langchain-core<1.0.0,>=0.3.49: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.7: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.21: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
openai-agents: Installed. No version info available.
openai<2.0.0,>=1.68.2: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.16
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.11.1
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 8.3.5
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0