Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 11 additions & 16 deletions src/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -1273,8 +1273,8 @@
"group": "OpenAI",
"pages": [
"oss/javascript/integrations/providers/openai",
"/oss/javascript/integrations/chat/openai",
"/oss/javascript/integrations/text_embedding/openai"
"oss/javascript/integrations/chat/openai",
"oss/javascript/integrations/text_embedding/openai"
]
},
{
Expand All @@ -1288,24 +1288,24 @@
"group": "Google",
"pages": [
"oss/javascript/integrations/providers/google",
"/oss/javascript/integrations/chat/google_generative_ai",
"/oss/javascript/integrations/text_embedding/google_generativeai"
"oss/javascript/integrations/chat/google_generative_ai",
"oss/javascript/integrations/text_embedding/google_generativeai"
]
},
{
"group": "AWS",
"pages": [
"oss/javascript/integrations/providers/aws",
"/oss/javascript/integrations/chat/bedrock",
"/oss/javascript/integrations/text_embedding/bedrock"
"oss/javascript/integrations/chat/bedrock",
"oss/javascript/integrations/text_embedding/bedrock"
]
},
{
"group": "Microsoft",
"pages": [
"oss/javascript/integrations/providers/microsoft",
"/oss/javascript/integrations/chat/azure",
"/oss/javascript/integrations/text_embedding/azure_openai"
"oss/javascript/integrations/chat/azure",
"oss/javascript/integrations/text_embedding/azure_openai"
]
},
"oss/javascript/integrations/providers/all_providers"
Expand All @@ -1329,16 +1329,11 @@
"icon": "database",
"pages": [
"oss/javascript/integrations/retrievers/index",
"oss/javascript/integrations/splitters/index",
"oss/javascript/integrations/text_embedding/index",
"oss/javascript/integrations/vectorstores/index",
{
"group": "Document loaders",
"pages": [
"oss/javascript/integrations/document_loaders/index",
"oss/javascript/integrations/document_loaders/file_loaders/index",
"oss/javascript/integrations/document_loaders/web_loaders/index"
]
}
"oss/javascript/integrations/document_loaders/index",
"oss/javascript/integrations/stores/index"
]
}
]
Expand Down
145 changes: 145 additions & 0 deletions src/oss/integrations/splitters/character_text_splitter.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
title: Splitting by character
---

Character-based splitting is the simplest approach to text splitting. It divides text using a specified character sequence (default: `"\n\n"`), with chunk length measured by the number of characters.

**Key points**:
1. **How text is split**: by a given character separator.
2. **How chunk size is measured**: by character count.

:::python
You can choose between:
- `.split_text` — returns plain string chunks.
- `.create_documents` — returns LangChain @[Document] objects, useful when metadata needs to be preserved for downstream tasks.
:::
:::js
You can choose between:
- `.splitText` — returns plain string chunks.
- `.createDocuments` — returns LangChain @[Document] objects, useful when metadata needs to be preserved for downstream tasks.
:::

:::python
```python
%pip install -qU langchain-text-splitters
```
:::
:::js
<CodeGroup>
```bash npm
npm install @langchain/textsplitters
```

```bash pnpm
pnpm install @langchain/textsplitters
```

```bash yarn
yarn add @langchain/textsplitters
```

```bash bun
bun add @langchain/textsplitters
```
</CodeGroup>
:::

:::python
```python
from langchain_text_splitters import CharacterTextSplitter

# Load an example document
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
```
```output
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd have output/examples that are non-political and not American-centric to make our docs more accessible to a wider audience, but I think stateOfTheUnion was an existing example already. Just flagging - we can update this example later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah these are brought over

```
:::
:::js
```ts
import { CharacterTextSplitter } from "@langchain/textsplitters";
import { readFileSync } from "fs";

// Example: read a long document
const stateOfTheUnion = readFileSync("state_of_the_union.txt", "utf8");

const splitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = splitter.createDocuments([{ pageContent: stateOfTheUnion }]);
console.log(texts[0]);
```
```output
Document {
pageContent: 'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
}
```
:::

:::python
Use `.create_documents` to propagate metadata associated with each document to the output chunks:

```python
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents(
[state_of_the_union, state_of_the_union], metadatas=metadatas
)
print(documents[0])
```
```output
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' metadata={'document': 1}
```
:::
:::js
Use `.createDocuments` to propagate metadata associated with each document to the output chunks:

```ts
const metadatas = [{"document": 1}, {"document": 2}]
const documents = splitter.createDocuments(
[{ pageContent: stateOfTheUnion }, { pageContent: stateOfTheUnion }],
{ metadatas: metadatas }
);
console.log(documents[0]);
```
```output
Document {
pageContent: 'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.',
metadata: {'document': 1}
}
```
:::

:::python
Use `.split_text` to obtain the string content directly:

```python
text_splitter.split_text(state_of_the_union)[0]
```
```output
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
```

:::
:::js
Use `.splitText` to obtain the string content directly:

```ts
splitter.splitText(stateOfTheUnion)[0]
```
```output
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
```
:::
Loading