-
Notifications
You must be signed in to change notification settings - Fork 3
add README with detailed usage and import instructions for dgraph-import #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shiva-istari
wants to merge
2
commits into
main
Choose a base branch
from
shiva/import
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+129
−0
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| # Dgraph Import | ||
|
|
||
| ## Overview | ||
|
|
||
| The `dgraph import` command, introduced in **v25.0.0** is designed to unify and simplify bulk and live data loading into Dgraph. Previously, users had to choose between `dgraph bulk` and `dgraph live`. With `dgraph import`, you now have a single command for both workflows, eliminating manual steps and reducing operational complexity. | ||
|
|
||
| > **Note:** | ||
| > The original intent was to support both bulk and live loading, but **live loader mode is not yet supported**. Only bulk/snapshot import is available. | ||
|
|
||
| ## How Data Is Imported | ||
|
|
||
| When you run `dgraph import`, the tool first runs the bulk loader using your provided RDF/JSON and schema files. This generates the snapshot data in the form of `p` directories (BadgerDB files) for each group. | ||
| After the bulk loader completes, `dgraph import` connects to the Alpha endpoint, puts the cluster into drain mode, and **streams the contents of the generated `p` directories directly to the running cluster using gRPC bidirectional streaming**. Once the import is complete, the cluster exits drain mode and resumes normal operation. | ||
|
|
||
| If you already have a snapshot directory (from a previous bulk load), you can use the `--snapshot-dir` flag to skip the bulk loading phase and directly stream the snapshot data to the cluster. | ||
|
|
||
| This means you no longer need to stop Alpha nodes or manually manage files—`dgraph import` handles everything automatically. | ||
|
|
||
| ## Command Syntax | ||
|
|
||
| ``` | ||
| dgraph import [flags] | ||
| ``` | ||
|
|
||
| ### Essential Flags | ||
|
|
||
| | Flag | Description | | ||
| |------|-------------| | ||
| | `--files, -f` | Path to RDF/JSON data files (e.g., `data.rdf`, `data.json`) | | ||
| | `--schema, -s` | Path to DQL schema file | | ||
| | `--graphql_schema, -g` | Path to GraphQL schema file | | ||
| | `--format` | File format: `rdf` or `json` | | ||
| | `--snapshot-dir, -p` | Path to existing snapshot output directory for direct import | | ||
| | `--drop-all` | Drop all existing cluster data before import (enables bulk loader) | | ||
| | `--drop-all-confirm` | Confirmation flag for `--drop-all` operation | | ||
| | `--conn-str, -c` | Dgraph connection string (e.g., `dgraph://localhost:9080`) | | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Bulk Import with Data and Schema | ||
|
|
||
| ``` | ||
| dgraph import --files data.rdf --schema schema.dql \ | ||
| --drop-all --drop-all-confirm \ | ||
| --conn-str dgraph://localhost:9080 | ||
| ``` | ||
|
|
||
| Loads data from `data.rdf`, drops existing cluster data, runs the bulk loader to generate a snapshot, and streams it to the cluster. | ||
|
|
||
| ### Import from Existing Snapshot | ||
|
|
||
| ``` | ||
| dgraph import --snapshot-dir ./out --conn-str dgraph://localhost:9080 | ||
| ``` | ||
|
|
||
| Directly streams snapshot data (output of a previous bulk load) into the cluster, without running the bulk loader again. | ||
|
|
||
| ## Snapshot Directory Structure | ||
|
|
||
| The bulk loader generates an `out` directory with per-group subdirectories: | ||
|
|
||
| ``` | ||
| out/ | ||
| ├── 0/ | ||
| │ └── p/ # BadgerDB files for group 0 | ||
| ├── 1/ | ||
| │ └── p/ # BadgerDB files for group 1 | ||
| └── N/ | ||
| └── p/ # BadgerDB files for group N | ||
| ``` | ||
|
|
||
| When using `--snapshot-dir`, provide the `out` directory path. The import tool automatically locates `p` directories within each group folder. | ||
|
|
||
| **Important:** Do not specify the `p` directory directly. | ||
|
|
||
| ## How It Works | ||
|
|
||
| 1. **Drop-All Mode**: With `--drop-all` and `--drop-all-confirm`, the bulk loader generates a snapshot from provided data and schema files. | ||
| 2. **Snapshot Streaming**: The snapshot (contents of `p` directories) is streamed to the cluster via gRPC, copying all data directly into the running cluster. | ||
| 3. **Consistency**: The cluster enters drain mode during import. On error, all data is dropped for safety. | ||
|
|
||
| ## Import Examples | ||
|
|
||
| **RDF with DQL schema:** | ||
| ``` | ||
| dgraph import --files data.rdf --schema schema.dql \ | ||
| --drop-all --drop-all-confirm \ | ||
| --conn-str dgraph://localhost:9080 | ||
| ``` | ||
|
|
||
| **JSON with GraphQL schema:** | ||
| ``` | ||
| dgraph import --files data.json --schema schema.dql \ | ||
| --graphql-schema schema.graphql --format json \ | ||
| --drop-all --drop-all-confirm \ | ||
| --conn-str dgraph://localhost:9080 | ||
| ``` | ||
|
|
||
| **Existing snapshot:** | ||
| ``` | ||
| dgraph import --snapshot-dir ./out --conn-str dgraph://localhost:9080 | ||
| ``` | ||
|
|
||
| ## Benchmark Import | ||
|
|
||
| For testing with large datasets, Dgraph provides sample 1-million-record datasets. | ||
|
|
||
| **Download benchmark files:** | ||
|
|
||
| ``` | ||
| wget https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/1million.rdf.gz?raw=true | ||
| wget https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/1million.schema?raw=true | ||
| ``` | ||
|
|
||
| **Run benchmark import:** | ||
|
|
||
| ``` | ||
| dgraph import --files 1million.rdf.gz --schema 1million.schema \ | ||
| --drop-all --drop-all-confirm \ | ||
| --conn-str dgraph://localhost:9080 | ||
| ``` | ||
|
|
||
| ## Important Notes | ||
|
|
||
| - When `--drop-all` and `--drop-all-confirm` flags are set, **all existing data in the cluster will be dropped** before the import begins. | ||
| - Both `--drop-all` and `--drop-all-confirm` flags are required for bulk loading; the command aborts without them. | ||
| - Live loader mode is not supported; only snapshot/bulk import is available. | ||
| - Ensure sufficient disk space for snapshot generation. | ||
| - Connection string must use gRPC format: `dgraph://localhost:9080`. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing some context and explanation about difference / complementary vs dgraph bulk or live.
I don't know myself but it should be something like
"""dgraph import command has been introduced in v25.0 (?) to simplify the import of data generated by dgraph bulk load."""
Then a small explanation of the improvement.
""" Dgraph bulk previous process consisted in stopping alpha, copying p files at the righ places, restarting alpha ...""
dgraph import dramatically simplifies those steps. You don't have to stop the alpha nodes and dont't have to deal with copying files. Simply invoke dgraph import --snapshot-dir ... explain the command".
The second use case (import RDF) -> is it a replacement of dgraph live or an alternative / need to identify in which case I would need to use dgraph live (network constraint ?). Is there a performance difference ?
So I would describe snapshot first (bulk load output importer) and RDF/JSON file second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed new changes with additional context and motivation for the import tool in the README.