Skip to content

Commit

Permalink
perf: implement batch processing
Browse files Browse the repository at this point in the history
  • Loading branch information
arshad-yaseen committed Jan 9, 2025
1 parent 22f3781 commit 701eec8
Show file tree
Hide file tree
Showing 6 changed files with 196 additions and 78 deletions.
5 changes: 1 addition & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@


## [0.1.4](https://github.com/arshad-yaseen/pdf-to-images-browser/compare/0.1.3...0.1.4) (2025-01-07)


### 🔧 Maintenance

* improve bundle ([5ee399a](https://github.com/arshad-yaseen/pdf-to-images-browser/commit/5ee399a316f06992ca74614e2cd3c275430514f7))
- improve bundle ([5ee399a](https://github.com/arshad-yaseen/pdf-to-images-browser/commit/5ee399a316f06992ca74614e2cd3c275430514f7))

## [0.1.3](https://github.com/arshad-yaseen/pdf-to-images-browser/compare/0.1.2...0.1.3) (2025-01-07)

Expand Down
35 changes: 28 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ A lightweight, browser-based library for converting PDF files to images with eas
- 📦 Multiple output formats (PNG/JPEG) and types (base64, buffer, blob, dataURL)
- ⚡ Convert specific pages or page ranges
- 🛡️ Robust error handling and TypeScript support
- 🧠 Memory efficient with batch processing and cleanup

[Demo](https://pdf-to-images-browser.arshadyaseen.com/)

Expand Down Expand Up @@ -83,13 +84,16 @@ The PDF document to convert. Accepts:

Optional configuration object with the following properties:

| Option | Type | Default | Description |
| ----------- | --------------------------------------------- | ----------- | ---------------------------------- |
| `format` | `'png' \| 'jpg'` | `'png'` | Output image format |
| `scale` | `number` | `1.0` | Scale factor for the output images |
| `pages` | `PDFPageSelection` | `'all'` | Which pages to convert |
| `output` | `'buffer' \| 'base64' \| 'blob' \| 'dataurl'` | `'base64'` | Output format |
| `docParams` | `PDFDocumentParams` | `undefined` | Additional PDF.js parameters |
| Option | Type | Default | Description |
| ------------ | --------------------------------------------- | ----------- | ------------------------------------ |
| `format` | `'png' \| 'jpg'` | `'png'` | Output image format |
| `scale` | `number` | `1.0` | Scale factor for the output images |
| `pages` | `PDFPageSelection` | `'all'` | Which pages to convert |
| `output` | `'buffer' \| 'base64' \| 'blob' \| 'dataurl'` | `'base64'` | Output format |
| `docParams` | `PDFDocumentParams` | `undefined` | Additional PDF.js parameters |
| `batchSize` | `number` | `5` | Number of pages to process per batch |
| `batchDelay` | `number` | `100` | Delay in ms between batches |
| `onProgress` | `function` | `undefined` | Progress callback function |

### Page Selection Options

Expand Down Expand Up @@ -180,6 +184,23 @@ const bufferImages = await pdfToImages(pdfFile, {
});
```

### Using Batch Processing

```typescript
// Process 3 pages at a time with progress updates
const images = await pdfToImages(pdfFile, {
batchSize: 3,
batchDelay: 50,
onProgress: ({completed, total, batch}) => {
console.log(`Processed ${completed} of ${total} pages`);
// Handle new batch of images
batch.forEach(image => {
// Process each image in the batch
});
},
});
```

## Error Handling

The library throws specific errors that you can catch and handle:
Expand Down
102 changes: 43 additions & 59 deletions src/core.ts
Original file line number Diff line number Diff line change
@@ -1,24 +1,16 @@
import {getDocument} from 'pdfjs-dist';
import type {
DocumentInitParameters,
PDFDocumentProxy,
} from 'pdfjs-dist/types/src/display/api';
import type {DocumentInitParameters} from 'pdfjs-dist/types/src/display/api';

import {
CanvasRenderingError,
InvalidOutputOptionError,
InvalidPagesOptionError,
} from './errors';
import {InvalidPagesOptionError} from './errors';
import type {PDFSource, PDFToImagesOptions, PDFToImagesResult} from './types';
import {
configurePDFToImagesParameters,
convertPDFBase64ToBuffer,
extractBase64FromDataURL,
generatePDFPageRange,
renderPDFPageToImage,
} from './utils';

/**
* Converts a PDF document to an array of images.
* Converts a PDF document to an array of images with improved performance.
* @param source - The PDF source to convert.
* @param options - Optional configuration options for the conversion.
* @returns A promise that resolves to an array of images.
Expand All @@ -38,6 +30,8 @@ async function processPDF(
documentParams: DocumentInitParameters,
options: PDFToImagesOptions,
): Promise<(string | Blob | ArrayBuffer)[]> {
const {batchSize = 5, batchDelay = 100, onProgress} = options;

const pdfDoc = await getDocument(documentParams).promise;
const numPages = pdfDoc.numPages;
const pages = options.pages || 'all';
Expand All @@ -62,55 +56,45 @@ async function processPDF(
throw new InvalidPagesOptionError();
}

const images = [];
for (const pageNumber of pageNumbers) {
const image = await renderPageToImage(pdfDoc, pageNumber, options);
images.push(image);
// Yield to event loop to prevent UI blocking
await new Promise(resolve => setTimeout(resolve, 0));
}
const allImages: (string | Blob | ArrayBuffer)[] = [];
const totalPages = pageNumbers.length;

return images;
}
// Process pages in batches
for (let i = 0; i < pageNumbers.length; i += batchSize) {
const batchPageNumbers = pageNumbers.slice(i, i + batchSize);
const batchPromises = batchPageNumbers.map(pageNumber =>
renderPDFPageToImage(pdfDoc, pageNumber, options),
);

async function renderPageToImage(
pdfDoc: PDFDocumentProxy,
pageNumber: number,
options: PDFToImagesOptions,
): Promise<string | Blob | ArrayBuffer> {
const {scale = 1.0, format = 'png', output = 'base64'} = options;

const page = await pdfDoc.getPage(pageNumber);
const viewport = page.getViewport({scale});

const canvas = document.createElement('canvas');
const context = canvas.getContext('2d') as CanvasRenderingContext2D;

canvas.height = viewport.height;
canvas.width = viewport.width;

const renderContext = {canvasContext: context, viewport};

await page.render(renderContext).promise;

const mimeType = format === 'jpg' ? 'image/jpeg' : 'image/png';
const dataURL = canvas.toDataURL(mimeType);

switch (output) {
case 'dataurl':
return dataURL;
case 'base64':
return extractBase64FromDataURL(dataURL);
case 'buffer':
return convertPDFBase64ToBuffer(extractBase64FromDataURL(dataURL));
case 'blob':
return new Promise<Blob>((resolve, reject) => {
canvas.toBlob(
blob => (blob ? resolve(blob) : reject(new CanvasRenderingError())),
mimeType,
);
// Process batch concurrently
const batchResults = await Promise.all(batchPromises);

// Clean up previous batch's canvases to free memory
if (typeof window !== 'undefined') {
batchPromises.length = 0;
await new Promise(resolve => setTimeout(resolve, 0)); // Yield to GC (just a hint)
}

allImages.push(...batchResults);

// Report progress if callback provided
if (onProgress) {
onProgress({
completed: Math.min(i + batchSize, totalPages),
total: totalPages,
batch: batchResults,
});
default:
throw new InvalidOutputOptionError();
}

batchResults.length = 0;

// Prevent UI blocking between batches
if (i + batchSize < pageNumbers.length) {
await new Promise(resolve => setTimeout(resolve, batchDelay));
}
}

pageNumbers.length = 0;

return allImages;
}
26 changes: 25 additions & 1 deletion src/types.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,33 @@
import type {DocumentInitParameters} from 'pdfjs-dist/types/src/display/api';

/**
* Configuration for batch processing of PDF pages
*/
export interface BatchProcessingConfig {
/**
* Number of pages to process in each batch
* @default 5
*/
batchSize?: number;
/**
* Callback for progress updates
*/
onProgress?: (progress: {
completed: number;
total: number;
batch: (string | Blob | ArrayBuffer)[];
}) => void;
/**
* Time in ms to wait between batches to prevent UI blocking
* @default 100
*/
batchDelay?: number;
}

/**
* Configuration options for converting PDF to images.
*/
export type PDFToImagesOptions = {
export type PDFToImagesOptions = BatchProcessingConfig & {
/**
* Output image format - either PNG or JPEG
* @default 'png'
Expand Down
80 changes: 79 additions & 1 deletion src/utils.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
import type {DocumentInitParameters} from 'pdfjs-dist/types/src/display/api';
import type {
DocumentInitParameters,
PDFDocumentProxy,
} from 'pdfjs-dist/types/src/display/api';

import {DEFAULT_PDF_TO_IMAGES_OPTIONS} from './constants';
import {CanvasRenderingError, InvalidOutputOptionError} from './errors';
import type {PDFSource, PDFToImagesOptions} from './types';

export function extractBase64FromDataURL(dataURL: string): string {
Expand Down Expand Up @@ -52,3 +56,77 @@ export function configurePDFToImagesParameters(

return {documentParams, opts};
}

export async function renderPDFPageToImage(
pdfDoc: PDFDocumentProxy,
pageNumber: number,
options: PDFToImagesOptions,
): Promise<string | Blob | ArrayBuffer> {
const {scale = 1.0, format = 'png', output = 'base64'} = options;

const page = await pdfDoc.getPage(pageNumber);
const viewport = page.getViewport({scale});

// Create canvas
const canvas = document.createElement('canvas');
const context = canvas.getContext('2d', {
alpha: false, // Optimize for non-transparent images
}) as CanvasRenderingContext2D;

canvas.height = viewport.height;
canvas.width = viewport.width;

// Render PDF page
const renderContext = {
canvasContext: context,
viewport,
enableWebGL: true, // Enable WebGL rendering if available
};

await page.render(renderContext).promise;

// Convert to desired format
const mimeType = format === 'jpg' ? 'image/jpeg' : 'image/png';
const result = await processCanvasOutput(canvas, mimeType, output);

// Clean up
canvas.width = 0;
canvas.height = 0;

// Help browser GC the canvas
if (typeof window !== 'undefined') {
await new Promise(resolve => setTimeout(resolve, 0));
}

return result;
}

export async function processCanvasOutput(
canvas: HTMLCanvasElement,
mimeType: string,
output: PDFToImagesOptions['output'],
): Promise<string | Blob | ArrayBuffer> {
try {
switch (output) {
case 'dataurl':
return canvas.toDataURL(mimeType);
case 'base64':
return extractBase64FromDataURL(canvas.toDataURL(mimeType));
case 'buffer': {
const base64 = extractBase64FromDataURL(canvas.toDataURL(mimeType));
return convertPDFBase64ToBuffer(base64);
}
case 'blob':
return new Promise<Blob>((resolve, reject) => {
canvas.toBlob(
blob => (blob ? resolve(blob) : reject(new CanvasRenderingError())),
mimeType,
);
});
default:
throw new InvalidOutputOptionError();
}
} catch (error) {
throw new CanvasRenderingError();
}
}
26 changes: 20 additions & 6 deletions tests/ui/app/components/demo.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ export default function Demo() {
const [images, setImages] = useState<string[]>([]);
const [loading, setLoading] = useState(false);
const [error, setError] = useState<string | null>(null);
const [progress, setProgress] = useState<{
completed: number;
total: number;
} | null>(null);

const handleFileChange = async (
event: React.ChangeEvent<HTMLInputElement>,
Expand All @@ -18,18 +22,26 @@ export default function Demo() {
try {
setLoading(true);
setError(null);
setImages([]);
setProgress(null);

const imageResults = await pdfToImages(file, {
await pdfToImages(file, {
format: 'png',
output: 'dataurl',
scale: 2,
scale: 1.5,
batchSize: 3, // Process 3 pages at a time
batchDelay: 50, // 50ms delay between batches
onProgress: ({completed, total, batch}) => {
setProgress({completed, total});
// Append new batch of images
setImages(prev => [...prev, ...(batch as string[])]);
},
});

setImages(imageResults as string[]);
} catch (err) {
setError(err instanceof Error ? err.message : 'Failed to convert PDF');
} finally {
setLoading(false);
setProgress(null);
}
};

Expand All @@ -51,15 +63,17 @@ export default function Demo() {

{loading && (
<div className="text-gray-600 dark:text-gray-400">
Converting PDF to images...
{progress
? `Converting pages ${progress.completed} of ${progress.total}...`
: 'Preparing PDF conversion...'}
</div>
)}

{error && <div className="text-red-500 dark:text-red-400">{error}</div>}
</div>

{images.length > 0 && (
<div className="grid gap-4 sm:grid-cols-1 md:grid-cols-2">
<div className="grid gap-4 sm:grid-cols-1 md:grid-cols-2 lg:grid-cols-3">
{images.map((image, index) => (
<div
key={index}
Expand Down

0 comments on commit 701eec8

Please sign in to comment.