Skip to content

Commit 6337bf9

Browse files
authored
add code samples to readme (#26)
* add code samples to readme * updated with param names
1 parent 235ce0d commit 6337bf9

1 file changed

Lines changed: 42 additions & 0 deletions

File tree

README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,24 @@ This is the client library backing the [Dataflux Dataset for Pytorch](https://gi
88

99
The fast list component of this client leverages Python multiprocessing to parallelize the listing of files within a GCS bucket. It does this by implementing a workstealing algorithm, where each worker in the list operation is able to steal work from its siblings once it has finished all currently slated listing work. This parallelization leads to a real world speed increase up to 10 times faster than sequential listing. Note that paralellization is limited by the machine on which the client runs, and optimal performance is typically found with a worker count that is 1:1 with the available cores. Benchmarking has demonstrated that the larger the object count, the better Dataflux performs when compared to a linear listing.
1010

11+
### Example Code
12+
```python
13+
from dataflux_core import fast_list
14+
15+
number_of_workers = 20
16+
project = "MyProject"
17+
bucket = "TargetBucket"
18+
target_folder_prefix = "folder1/"
19+
20+
print("Fast list operation starting...")
21+
list_result = fast_list.ListingController(
22+
max_parallelism=number_of_workers,
23+
project=project,
24+
bucket=bucket,
25+
prefix=target_folder_prefix,
26+
).run()
27+
```
28+
1129
#### Storage Class
1230

1331
By default, fast list will only list objects of STANDARD class in GCS buckets. This can be overridden by passing in a string list of storage classes to include while running the Listing Controller. Note that this default behavior was chosen to avoid the cost associated with downloading non-standard GCS classes. Details on GCS Storage Classes can be further explored in the [Storage Class Documentation](https://cloud.google.com/storage/docs/storage-classes).
@@ -25,6 +43,30 @@ By default, fast list will only list objects of STANDARD class in GCS buckets. T
2543

2644
The compose download component of the client uses the results of the fast list to efficiently download the files necessary for a machine learning workload. When downloading files from remote stores, small file size often bottlenecks the speed at which files can be downloaded. To avoid this bottleneck, compose download leverages the ability of GCS buckets to concatinate small files into larger composed files in GCS prior to downloading. This greatly improves download performance, particularly on datasets with very large numbers of small files.
2745

46+
### Example Code
47+
```python
48+
from dataflux_core import download
49+
50+
# The maximum size in bytes of a composite download object.
51+
# If this value is set to 0, no composition will occur.
52+
max_compose_bytes = 10000000
53+
project = "MyProject"
54+
bucket = "TargetBucket"
55+
56+
download_params = download.DataFluxDownloadOptimizationParams(
57+
max_compose_bytes
58+
)
59+
60+
print("Download operation starting...")
61+
download_result = download.dataflux_download(
62+
project_name=project,
63+
bucket_name=bucket,
64+
# The list_results parameter is the value returned by fast list in the previous code example.
65+
objects=list_result,
66+
dataflux_download_optimization_params=download_params,
67+
)
68+
```
69+
2870
#### Multiple Download Options
2971

3072
Looking at the [download code](dataflux_core/download.py) you will notice three distinct download functions. The default function used in the dataflux-pytorch client is `dataflux_download`. The other functions serve to improve performance for specific use cases.

0 commit comments

Comments
 (0)