Hardware Provisioning

Let's follow Hardware Provisioning in the Spark doc, and check each item from HyperStore's point's of view.

Storage Systems

The first question you have to ask is if Spark and Cloudian HyperStore share the same cluster.

Cluster Sharing	Data Locality	Hadoop FileSystem	Notes

 YES        |     node      |      hsfs         | managed chunk size
 YES        |     rack      |       s3a         | server side encryption, erasure coding
  NO        |     none      |       s3a         | Running Spark on Hadoop(e.g. HDP, CDH)

Depending on your use cases, there is a trade off between performance and features. So you may need to consider what to compromise, and what to achieve.

Local Disks

Depending on a job, Spark writes an intermediate file to local disk temporarily. So it is a good idea to have dedicated disks for Spark. According to the doc,

4-8 disks per node, configured without RAID

If it is affordable, please follow this suggestion.

But in Cloudian HyperStore cluster, all the disk slots are usually occupied by HyperStore related services. Since this is highly dependent on a job, you can leave is as the default setting(Spark writes to /tmp).

Memory

You can set how much memory a Spark worker uses. If you assign 4GB to a worker, and each executor uses 1GB, then, up to 4 executors can run at the same time. Thus, the amount of memory decides parallelism as well.

So, simply add more RAMs to meet your requirements. But as the doc suggests, keep at least 25% of them for OS. For example, if you add 32GB, assign only 24GB to Spark, and reserve 8GB for OS.

In Cloudian HyperStore, most of RAMs are used by OS as file cache, which is usually SSTables managed by Cassandra. So you may see some performance degradation if you assign too much memory for Spark.

CPU cores

Like memory, you can set how many cores a Spark worker uses. So if you assign 2 cores in the example above, then 4 workers couldn't get launched, but only 2. Thus, memory and cores have to be set correctly.

Simply add more cores for Spark, and assign them.

Network

In Cloudian HyperStore cluster, there are usually two networks. One is in front, used by S3 and Admin. Another is in backend, used by HyperStore, Cassandra, and redis. And we recommend 1GB for front, and 10GB or more for backend.

Note that hsfs retrieve objects directly from HyperStore. So it uses 10GB, though it is supposed to minimize network transfers. While s3a retrieves objects from S3, which is 1GB.

So if network can be a bottleneck, hsfs is highly recommended.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardware Provisioning

Storage Systems

Local Disks

Memory

CPU cores

Network

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally