Skip to content

Latest commit

 

History

History
453 lines (343 loc) · 11.6 KB

File metadata and controls

453 lines (343 loc) · 11.6 KB

Performance Guide

Choosing the Right I/O Mode

XL provides two distinct I/O implementations with different performance characteristics. Understanding when to use each is critical for optimal performance.

Decision Matrix

Dataset Size Styling Needs Read Mode Write Mode Memory Performance
< 10k rows Any In-memory In-memory ~10MB Excellent
10k-50k rows Full styling In-memory In-memory ~25-50MB Good
50k-100k rows Full styling In-memory In-memory ~50-100MB Fair
100k+ rows Minimal Streaming (sequential) Streaming ~10-50MB (SST dependent) Excellent
100k+ rows Full styling In-memory for now In-memory for now O(n) See roadmap P7.5

I/O Modes Explained

In-Memory Mode (Default)

Use For: Small‑to‑medium workbooks (<100k rows) or any workload that needs full styling and metadata.

Characteristics:

  • Builds full XML trees in memory.
  • Full feature support (SST, styles, merged cells, RichText).
  • Deterministic output (byte‑identical structure on re‑write).
  • Compact XML with DEFLATED compression by default; pretty‑printed STORED output is available for debugging via WriterConfig.debug.

API:

import com.tjclp.xl.io.ExcelIO
import cats.effect.IO

val excel = ExcelIO.instance[IO]

// Write
excel.write(workbook, path).unsafeRunSync()

// Read
excel.read(path).map { wb =>
  // Process workbook
}.unsafeRunSync()

Memory Profile (approximate):

10k rows:   ~10MB
50k rows:   ~50MB
100k rows:  ~100MB
500k rows:  ~500MB (not recommended)

Performance:

  • Write: ~100k rows/sec
  • Read: ~55k rows/sec

Streaming Write Mode

Use For: Large data generation (100k+ rows) when you can live with minimal styling and inline strings.

Characteristics:

  • O(1) constant memory for worksheet data (~10MB regardless of row count).
  • fs2‑data‑xml event streaming; no intermediate XML trees.
  • Limitations: No SST support (inline strings only), minimal styles (no rich formatting or merged cells).

API:

import com.tjclp.xl.io.Excel
import com.tjclp.xl.io.RowData
import fs2.Stream

val excel = com.tjclp.xl.io.Excel.forIO

// Generate 1M rows
Stream.range(1, 1_000_001)
  .map(i => RowData(i, Map(
    0 -> CellValue.Text(s"Row $i"),
    1 -> CellValue.Number(BigDecimal(i))
  )))
  .through(excel.writeStream(path, "Data"))
  .compile.drain
  .unsafeRunSync()

// Memory: ~10MB constant!

Memory Profile:

100k rows:  ~10MB (constant)
1M rows:    ~10MB (constant)
10M rows:   ~10MB (constant)

Performance:

  • Write: ~88k rows/sec in benchmarks.
  • In practice, ~4–5x faster and vastly more memory‑efficient than typical in‑memory POI usage for large datasets.

Tradeoffs:

  • ✅ Constant memory for worksheet data.
  • ✅ Excellent throughput at high row counts.
  • ❌ No SST (larger files if many duplicate strings).
  • ❌ Minimal styles only (no rich formatting at scale).

Streaming Read Mode

Use For: Sequential processing of large worksheets (filtering, aggregation, ETL) where you don’t need full styling information.

Characteristics:

  • O(1) memory for worksheet data by streaming XML via fs2.io.readInputStream + fs2‑data‑xml.
  • SharedStrings (if present) are parsed once up front, so memory use scales with the number of unique strings, not total rows.
  • Emits Stream[F, RowData] with 1‑based row indices and 0‑based column indices.

API:

import com.tjclp.xl.io.Excel
import cats.effect.IO
import cats.effect.unsafe.implicits.global

val excel = Excel.forIO

excel.readStream(path)
  .filter(_.rowIndex > 1) // skip header
  .evalMap(row => IO.println(row))
  .compile.drain
  .unsafeRunSync()

Tradeoffs:

  • ✅ True streaming (no materialization of full sheets).
  • ✅ Great for ETL and analytics pipelines.
  • ❌ Does not expose full formatting/metadata; for that, use the in‑memory APIs.

Performance Optimization Techniques

1. Batch Operations (30-50% faster)

Bad (N intermediate sheets):

(1 to 10000).foldLeft(sheet) { (s, i) =>
  s.put(ref"A$i", CellValue.Number(i))
}
// Creates 10,000 intermediate Sheet instances

Good (single putAll):

val cells = (1 to 10000).map(i => Cell(ref"A$i", CellValue.Number(i)))
sheet.put(cells)
// Creates 1 Sheet instance

Best (batch put with codecs):

sheet.put(
  (1 to 10000).map(i => ref"A$i" -> i)*
)
// Type-safe, auto-format inference, single allocation

2. Use Patch Composition (Deferred Execution)

Bad (eager evaluation):

var s = sheet
for i <- 1 to 1000 do
  s = s.put(ref"A$i", s"Row $i")  // 1000 Sheet instances
  s = s.merge(ref"A$i:B$i")       // 1000 more Sheet instances

Good (deferred with Patch):

import com.tjclp.xl.dsl.*

val patch = (1 to 1000).foldLeft(Patch.empty: Patch) { (p, i) =>
  p ++ (ref"A$i" := s"Row $i") ++ ref"A$i:B$i".merge
}
sheet.put(patch)  // Execute once at end

Impact: Reduces intermediate allocations by 50-70%


3. Prefer Batch Sheet.put for Type Safety + Auto-Formatting

Manual (verbose, error‑prone):

import com.tjclp.xl.*

val dateStyle = CellStyle.default.numFmt(NumFmt.Date)
val decimalStyle = CellStyle.default.numFmt(NumFmt.Number)

sheet
  .put(ref"A1", CellValue.Text("Revenue"))
  .put(ref"B1", CellValue.DateTime(LocalDate.of(2025, 11, 10)))
  .withCellStyle(ref"B1", dateStyle)    // Must manually set format
  .put(ref"C1", CellValue.Number(BigDecimal("1000000.50")))
  .withCellStyle(ref"C1", decimalStyle) // Must manually set format

Batch put with codecs (clean, automatic):

import com.tjclp.xl.*
import com.tjclp.xl.codec.syntax.*

sheet.put(
  ref"A1" -> "Revenue",                         // String
  ref"B1" -> LocalDate.of(2025, 11, 10),        // Auto: date format
  ref"C1" -> BigDecimal("1000000.50")           // Auto: number format
)

Benefits:

  • Auto‑inferred number formats via CellCodec.
  • Type‑safe (compile error if unsupported type).
  • Single bulk update instead of many tiny ones.
  • Cleaner code.

4. Stream from Source (Don't Materialize First)

Bad (materializes in memory first):

val allRows = database.fetchAll()  // Load 1M rows into memory
val cells = allRows.map(row => /* convert to cells */)
sheet.put(cells)  // Then write

Good (stream directly):

database.stream()  // fs2.Stream[IO, Row]
  .map(row => RowData(/* convert */))
  .through(excel.writeStream(path, "Data"))
  .compile.drain
  .unsafeRunSync()

Impact: Constant memory regardless of source size


5. Use Iterator for Lazy Range Operations

Current (after lazy optimizations):

// Already optimized! range.cells returns Iterator
sheet.fillBy(range"A1:Z100") { (col, row) =>
  CellValue.Text(s"${col.toLetter}${row.index1}")
}
// No intermediate Vector allocation

Before optimization (old code):

val newCells = range.cells.map(...).toVector  // ❌ Unnecessary
sheet.put(newCells)

Impact: 20-30% memory reduction for large ranges (already fixed in latest version)


Performance Benchmarks

Actual Measurements (100k rows)

XL In-Memory:

  • Write: ~1.1s @ ~100MB memory
  • Read: ~1.8s @ ~100MB memory

XL Streaming:

  • Write: ~1.1s @ ~10MB memory (constant)
  • Read: ~1.8s @ ~10MB memory (constant, P6.6 complete)

Apache POI SXSSF (streaming):

  • Write: ~5s @ ~800MB memory
  • Read: ~8s @ ~1GB memory

Improvement:

  • XL write: 4.5x faster, 80x less memory
  • XL read: 4.4x faster, 80x less memory

Common Performance Pitfalls

Pitfall 1: Fold with Put Instead of PutAll

// ❌ Bad: O(n) sheet copies
val result = data.foldLeft(sheet) { (s, item) =>
  s.put(ref"A${item.id}", item.name)
}

// ✅ Good: Single sheet copy
val cells = data.map(item => Cell(ref"A${item.id}", CellValue.Text(item.name)))
sheet.put(cells)

Impact: 10-20x faster for large datasets

Pitfall 2: Calling usedRange Repeatedly

// ❌ Bad: Recomputes 4 times (was even worse before optimization)
for _ <- 1 to 4 do
  println(sheet.usedRange)  // Traverses all cells each time

// ✅ Good: Compute once
val range = sheet.usedRange
for _ <- 1 to 4 do
  println(range)

Note: usedRange is now single-pass (75% faster after lazy optimizations), but still recomputes on every call.

Pitfall 3: Using the Wrong Mode for Feature Needs

// ✅ Good: Streaming read for sequential ETL (constant memory)
excel.readStream(largeFile)

// ✅ Good: In-memory read when you need styles/metadata
excel.read(largeFile)

Status: Streaming read is fixed (P6.6); choose based on feature needs, not memory concerns.


When to Optimize

Don't Optimize (Good Enough)

  • < 1000 rows: Any approach works fine
  • Simple scripts: Clarity > performance
  • Prototyping: Optimize later

Consider Optimizing

  • 10k-100k rows: Use batching (putAll, putMixed)
  • Complex styling: Use Patch composition
  • Repeated operations: Cache usedRange, style registries

Must Optimize

  • 100k+ rows: Use streaming write
  • Limited memory: Constant-memory critical
  • Production ETL: Performance matters

Memory Budget Calculator

Rule of Thumb: 1 cell ≈ 100 bytes in memory (JVM overhead)

10k cells:    ~1 MB
100k cells:   ~10 MB
1M cells:     ~100 MB
10M cells:    ~1 GB (use streaming!)

Factors that increase memory:

  • Rich styles: +50 bytes per styled cell
  • Merged cells: +200 bytes per merge
  • Rich text: +100 bytes per additional run
  • Formulas: +150 bytes per formula cell

Performance Roadmap

Current (As of 2025-11-20)

  • ✅ Streaming write: O(1) memory, ~88k rows/sec
  • ✅ Streaming read: O(1) memory via fs2.io.readInputStream (+ SST materialized once)
  • ✅ Lazy optimizations: 30-75% speedup (memoized canonicalKey, single-pass usedRange)

P6.6 (complete)

  • ✅ Fix streaming read to O(1) memory
  • ✅ Add memory regression tests

P6.7 (1 day)

  • ✅ DEFLATED compression by default (5-10x smaller files)
  • ✅ Configurable compression mode

P7.5 (3-4 weeks)

  • ✅ Two-phase streaming writer with full SST/styles
  • ✅ O(1) memory for writes with rich formatting
  • ✅ Disk-backed SST for very large datasets

Future (Post-1.0)

  • ⬜ Parallel XML serialization (4-8 threads)
  • ⬜ JMH benchmark suite vs POI
  • ⬜ Automatic query optimization (builder pattern)

Quick Reference

Small Workbooks (<10k rows)

// Use simple in-memory API
val sheet = Sheet("Data")
  .put(ref"A1", "Title")
  .put(ref"B1", 100)

ExcelIO.instance.write[IO](Workbook(Vector(sheet)), path).unsafeRunSync()

Medium Workbooks (10k-100k rows)

// Use batching
import com.tjclp.xl.codec.syntax.*

val cells = (1 to 100_000).map(i =>
  ref"A$i" -> s"Row $i"
)
val sheet = Sheet("Data").put(cells*)

ExcelIO.instance.write[IO](Workbook(Vector(sheet)), path).unsafeRunSync()

Large Workbooks (100k+ rows, minimal styling)

// Use streaming write
import com.tjclp.xl.io.Excel
import fs2.Stream

Stream.range(1, 1_000_001)
  .map(i => RowData(i, Map(0 -> CellValue.Text(s"Row $i"))))
  .through(Excel.forIO.writeStream(path, "Data"))
  .compile.drain
  .unsafeRunSync()

Reading Any Size (Streaming)

// Use streaming read for constant memory (O(1))
Excel.forIO.readStream(path)
  .evalMap(row => IO.println(row))
  .compile.drain
  .unsafeRunSync()

Related Documentation