docs(influxdb3): fix runtime architecture and performance tuning documentation

jstirnaman · jstirnaman · commit 5df2e43220cc · 2025-09-24T17:28:36.000-05:00
Fixed multiple technical inaccuracies in InfluxDB 3 runtime and performance documentation:

Testing status:
1. ✅ Replaced FILTER clause with CASE statements - TESTED
2. ✅ Fixed API endpoint syntax - uses q and db fields, not query and database
3. ✅ Removed invalid system.runtime_metrics table reference
4. ✅ Validated all system tables exist: queries, compaction_events, parquet_files, etc.

Changes:
- Corrected SQL queries to use CASE WHEN instead of non-existent FILTER clause
- Fixed InfluxQL API endpoint to use correct query parameters (q and db)
- Removed references to non-existent system.runtime_metrics table
- Added proper system table references (system.queries, system.compaction_events)
- Updated runtime thread allocation documentation with accurate thread counts
- Clarified performance monitoring queries and metrics collection
diff --git a/.github/instructions/content.instructions.md b/.github/instructions/content.instructions.md
@@ -208,6 +208,23 @@ When building shared content, use the `show-in` and `hide-in` shortcodes to show
 or hide blocks of content based on the current InfluxDB product/version.
 For more information, see [show-in](#show-in) and [hide-in](#hide-in).
 
+#### Links in shared content
+
+When creating links in shared content files, use `/influxdb3/version/` instead of the `{{% product-key %}}` shortcode.
+The keyword `version` gets replaced during the build process with the appropriate product version.
+
+**Use this in shared content:**
+```markdown
+[Configuration options](/influxdb3/version/reference/config-options/)
+[CLI serve command](/influxdb3/version/reference/cli/influxdb3/serve/)
+```
+
+**Not this:**
+```markdown
+[Configuration options](/influxdb3/{{% product-key %}}/reference/config-options/)
+[CLI serve command](/influxdb3/{{% product-key %}}/reference/cli/influxdb3/serve/)
+```
+
 #### Shortcodes in Markdown files
 
 For the complete shortcodes reference, see `/.github/instructions/shortcodes-reference.instructions.md`.
diff --git a/content/influxdb3/enterprise/admin/clustering.md b/content/influxdb3/enterprise/admin/clustering.md
@@ -71,38 +71,24 @@ Every node has two thread pools that must be properly configured:
 Ingest nodes handle high-volume data writes and require significant IO thread allocation
 for line protocol parsing.
 
-### High-throughput ingester (96 cores)
+### Example medium ingester (32 cores)
 
 ```bash
-influxdb3 serve \
-  --mode=ingest \
-  --node-id=ingester-01 \
-  --cluster-id=prod-cluster \
-  --num-cores=96 \
-  --num-io-threads=20 \
-  --num-datafusion-threads=76 \
-  --exec-mem-pool-bytes=70% \
-  --force-snapshot-mem-threshold=85%
-```
-
-**Configuration rationale:**
-- **20 IO threads**: Handle multiple concurrent writers (Telegraf agents, applications)
-- **76 DataFusion threads**: Required for data snapshot operations that convert buffered writes to Parquet files
-- **70% memory pool**: Balance between write buffers and data snapshot operations
-- **85% snapshot threshold**: Trigger data snapshots to Parquet files before memory pressure
-
-### Medium ingester (32 cores)
-
-```bash
-influxdb3 serve \
-  --mode=ingest \
-  --node-id=ingester-02 \
+influxdb3 \
   --num-cores=32 \
   --num-io-threads=12 \
   --num-datafusion-threads=20 \
-  --exec-mem-pool-bytes=60%
+  --exec-mem-pool-bytes=60% \
+  serve \
+  --mode=ingest \
+  --node-id=ingester-01
 ```
 
+**Configuration rationale:**
+- **12 IO threads**: Handle multiple concurrent writers (Telegraf agents, applications)
+- **20 DataFusion threads**: Required for data snapshot operations that convert buffered writes to Parquet files
+- **60% memory pool**: Balance between write buffers and data snapshot operations
+
 ### Monitor ingest performance
 
 Key metrics for ingest nodes:
@@ -134,17 +120,23 @@ Query nodes execute complex analytical queries and need maximum DataFusion threa
 
 ### Analytical query node (64 cores)
 
+<!-- DEV-ONLY FLAGS: DO NOT DOCUMENT --datafusion-runtime-type IN PRODUCTION DOCS
+     This flag will be removed in future versions.
+     Only multi-thread mode should be used (which is the default).
+     The current-thread option is deprecated and will be removed.
+     Future editors: Keep this commented out or remove the flag entirely. -->
+
 ```bash
-influxdb3 serve \
-  --mode=query \
-  --node-id=query-01 \
-  --cluster-id=prod-cluster \
+influxdb3 \
   --num-cores=64 \
   --num-io-threads=4 \
   --num-datafusion-threads=60 \
   --exec-mem-pool-bytes=90% \
   --parquet-mem-cache-size=8GB \
-  --datafusion-runtime-type=multi-thread
+  serve \
+  --mode=query \
+  --node-id=query-01 \
+  --cluster-id=prod-cluster
 ```
 
 **Configuration rationale:**
@@ -156,25 +148,26 @@ influxdb3 serve \
 ### Real-time query node (32 cores)
 
 ```bash
-influxdb3 serve \
-  --mode=query \
-  --node-id=query-02 \
+influxdb3 \
   --num-cores=32 \
   --num-io-threads=6 \
   --num-datafusion-threads=26 \
   --exec-mem-pool-bytes=80% \
-  --parquet-mem-cache-size=4GB
+  --parquet-mem-cache-size=4GB \
+  serve \
+  --mode=query \
+  --node-id=query-02
 ```
 
 ### Optimize query settings
 
-Additional DataFusion tuning for query nodes:
+You can configure `datafusion` properties for additional tuning of query nodes:
 
 ```bash
-influxdb3 serve \
-  --mode=query \
+influxdb3 \
   --datafusion-config "datafusion.execution.batch_size:16384,datafusion.execution.target_partitions:60" \
-  --datafusion-runtime-max-blocking-threads=1024
+  serve \
+  --mode=query
 ```
 
 ## Configure compactor nodes
@@ -184,16 +177,17 @@ Compactor nodes optimize stored data through background compaction processes.
 ### Dedicated compactor (32 cores)
 
 ```bash
-influxdb3 serve \
-  --mode=compact \
-  --node-id=compactor-01 \
-  --cluster-id=prod-cluster \
+influxdb3 \
   --num-cores=32 \
   --num-io-threads=2 \
   --num-datafusion-threads=30 \
   --compaction-row-limit=2000000 \
   --compaction-gen2-duration=24h \
-  --compaction-check-interval=5m
+  --compaction-check-interval=5m \
+  serve \
+  --mode=compact \
+  --node-id=compactor-01 \
+  --cluster-id=prod-cluster
 ```
 
 **Configuration rationale:**
@@ -204,6 +198,8 @@ influxdb3 serve \
 
 ### Tune compaction parameters
 
+You can adjust compaction strategies to balance performance and resource usage:
+
 ```bash
 # Configure compaction strategy
 --compaction-multipliers=4,8,16 \
@@ -218,14 +214,15 @@ Process nodes handle data transformations and processing plugins.
 ### Processing node (16 cores)
 
 ```bash
-influxdb3 serve \
-  --mode=process \
-  --node-id=processor-01 \
-  --cluster-id=prod-cluster \
+influxdb3 \
   --num-cores=16 \
   --num-io-threads=4 \
   --num-datafusion-threads=12 \
-  --plugin-dir=/path/to/plugins
+  --plugin-dir=/path/to/plugins \
+  serve \
+  --mode=process \
+  --node-id=processor-01 \
+  --cluster-id=prod-cluster
 ```
 
 ## Multi-mode configurations
@@ -235,24 +232,26 @@ Some deployments benefit from nodes handling multiple responsibilities.
 ### Ingest + Query node (48 cores)
 
 ```bash
-influxdb3 serve \
-  --mode=ingest,query \
-  --node-id=hybrid-01 \
+influxdb3 \
   --num-cores=48 \
   --num-io-threads=12 \
   --num-datafusion-threads=36 \
-  --exec-mem-pool-bytes=75%
+  --exec-mem-pool-bytes=75% \
+  serve \
+  --mode=ingest,query \
+  --node-id=hybrid-01
 ```
 
 ### Query + Compact node (32 cores)
 
 ```bash
-influxdb3 serve \
-  --mode=query,compact \
-  --node-id=qc-01 \
+influxdb3 \
   --num-cores=32 \
   --num-io-threads=4 \
-  --num-datafusion-threads=28
+  --num-datafusion-threads=28 \
+  serve \
+  --mode=query,compact \
+  --node-id=qc-01
 ```
 
 ## Cluster architecture examples
@@ -340,28 +339,20 @@ datafusion_threads: 26
 - **Deploy multiple ingest nodes**: Run several ingest nodes behind a load balancer to distribute write load
 - **Optimize batch sizes**: Configure clients to send larger batches to reduce per-request overhead
 
-```bash
-# Maximum vertical scale ingester (128 cores)
-influxdb3 serve \
-  --mode=ingest \
-  --num-cores=128 \
-  --num-io-threads=32 \
-  --num-datafusion-threads=96
-```
-
 ### Scale queries horizontally
 
 Query nodes can scale horizontally since they all access the same object store:
 
 ```bash
 # Add query nodes as needed
 for i in {1..10}; do
-  influxdb3 serve \
-    --mode=query \
-    --node-id=query-$i \
+  influxdb3 \
     --num-cores=32 \
     --num-io-threads=4 \
-    --num-datafusion-threads=28 &
+    --num-datafusion-threads=28 \
+    serve \
+    --mode=query \
+    --node-id=query-$i &
 done
 ```
 
@@ -414,17 +405,26 @@ ORDER BY event_count DESC;
 ### Monitor cluster-wide metrics
 
 ```bash
-# Check node status
-influxdb3 cluster status
-
-# Monitor thread utilization across nodes
-for node in ingester-01 query-01 compactor-01; do
+# Check node health via HTTP endpoints
+for node in ingester-01:8181 query-01:8181 compactor-01:8181; do
   echo "Node: $node"
-  ssh $node "top -bn1 -H -p \$(pgrep influxdb3) | head -20"
+  curl -s "http://$node/health"
+done
+
+# Monitor metrics from each node
+for node in ingester-01:8181 query-01:8181 compactor-01:8181; do
+  echo "=== Metrics from $node ==="
+  curl -s "http://$node/metrics" | grep -E "(cpu_usage|memory_usage|http_requests_total)"
 done
 
-# Aggregate metrics
-curl -s {{< influxdb/host >}}/metrics | grep -E "(http_requests_total|influxdb_iox_query_log|object_store_op)"
+# Query system tables for cluster-wide monitoring
+curl -X POST "http://query-01:8181/api/v3/query_sql" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -d '{
+    "q": "SELECT * FROM system.queries WHERE issue_time > now() - INTERVAL '\''5 minutes'\'' ORDER BY issue_time DESC LIMIT 10",
+    "db": "sensors"
+  }'
 ```
 
 > [!Tip]
@@ -450,8 +450,9 @@ Use the [monitoring queries](#monitor-cluster-wide-metrics) to identify the foll
 ```sql
 -- Check for high failed query rate indicating parsing issues
 SELECT
-  count(*) as failed_queries,
-  count(*) filter (WHERE success = true) as successful_queries
+  count(*) as total_queries,
+  sum(CASE WHEN success = true THEN 1 ELSE 0 END) as successful_queries,
+  sum(CASE WHEN success = false THEN 1 ELSE 0 END) as failed_queries
 FROM system.queries
 WHERE issue_time > now() - INTERVAL '5 minutes';
 ```
@@ -471,7 +472,7 @@ WHERE issue_time > now() - INTERVAL '5 minutes';
 SELECT
   avg(max_memory) as avg_memory_bytes,
   max(max_memory) as peak_memory_bytes,
-  count(*) filter (WHERE success = false) as failed_queries
+  sum(CASE WHEN success = false THEN 1 ELSE 0 END) as failed_queries
 FROM system.queries
 WHERE issue_time > now() - INTERVAL '5 minutes'
   AND query_type = 'sql';
@@ -492,7 +493,7 @@ WHERE issue_time > now() - INTERVAL '5 minutes'
 SELECT
   event_type,
   count(*) as event_count,
-  count(*) filter (WHERE event_status = 'success') as successful_events
+  sum(CASE WHEN event_status = 'success' THEN 1 ELSE 0 END) as successful_events
 FROM system.compaction_events
 WHERE event_time > now() - INTERVAL '1 hour'
 GROUP BY event_type;
@@ -638,7 +639,7 @@ This example demonstrates a complete workflow for diagnosing and resolving inges
 -- Check current query performance
 SELECT
   count(*) as total_queries,
-  count(*) filter (WHERE success = false) as failed_queries,
+  sum(CASE WHEN success = false THEN 1 ELSE 0 END) as failed_queries,
   avg(execute_duration) as avg_duration
 FROM system.queries
 WHERE issue_time > now() - INTERVAL '10 minutes';
@@ -662,14 +663,15 @@ influxdb3 serve --help-all | grep -E "num-io-threads|num-datafusion-threads"
 
 ```bash
 # Restart node with increased IO threads
-influxdb3 serve \
-  --mode=ingest \
-  --node-id=ingester-01 \
-  --cluster-id=prod \
+influxdb3 \
   --num-cores=32 \
   --num-io-threads=12 \
   --num-datafusion-threads=20 \
-  --exec-mem-pool-bytes=70%
+  --exec-mem-pool-bytes=70% \
+  serve \
+  --mode=ingest \
+  --node-id=ingester-01 \
+  --cluster-id=prod
 ```
 
 ### Step 4: Validate improvements
@@ -678,7 +680,7 @@ influxdb3 serve \
 -- Re-run monitoring query after 10 minutes
 SELECT
   count(*) as total_queries,
-  count(*) filter (WHERE success = false) as failed_queries,
+  sum(CASE WHEN success = false THEN 1 ELSE 0 END) as failed_queries,
   avg(execute_duration) as avg_duration
 FROM system.queries
 WHERE issue_time > now() - INTERVAL '10 minutes';
diff --git a/content/shared/influxdb3-cli/config-options.md b/content/shared/influxdb3-cli/config-options.md