Skip to content

Latest commit

 

History

History
245 lines (193 loc) · 10.3 KB

AWS_instruction.md

File metadata and controls

245 lines (193 loc) · 10.3 KB

Transfering Files

Transfer files to/from birdstore

  • Download folders: scp -r <username>@birdstore.dk.ucsd.edu:<server_data_dir> <local_data_dir>
  • Download files: scp <username>@birdstore.dk.ucsd.edu:<server_file_path> <local_data_dir>/
  • Upload folders: scp -r <local_data_dir>/ <username>@birdstore.dk.ucsd.edu:<server_data_dir>
  • Upload files: scp <local_file_dir>/ <username>@birdstore.dk.ucsd.edu:<server_data_dir>/

Transfer files to/from AWS S3

Method 1 (recommended): Use CrossFTP.

Method 2: Using command-line tool

  1. Install the aws command-line tool https://aws.amazon.com/cli/
  2. Run "aws configure", enter the Access Key ID and Secret Access Key in the csv file (datauser_credentials.csv). Set region to "us-west-1".
  3. To download images of a stack, run "aws s3 cp --recursive s3://mousebrainatlas-data/CSHL_data_processed/MD585/MD585_prep2_lossless_jpeg <local_folder>".
  4. To upload, run aws s3 cp <local_filepath> s3://mousebrainatlas-data/<s3_filepath>

Method 3: Using web console (This method cannot download the whole folder)

  1. Go to https://mousebrainatlas.signin.aws.amazon.com/console
  2. Login as username: datauser password:
  3. Choose "S3"
  4. The data are in the bucket called "mousebrainatlas-data".
  5. Click on a file and click "Download".


Using AWS and CfnCluster

  • The nodewatcher running on the compute nodes is too aggressive in terminating idle compute nodes. One must set minimum size to the desired size in order to keep the fleet alive for long durations.
  • The most efficient pipeline is to download subsets of data from S3 to each compute node's /scratch, process the subset, then upload results to S3, then delete the data and results. The granularity of this pipeline should depend on the local storage size of the compute node. If the storage is very small, this should be done for every file. Compared with writing simultaneously to the shared NFS, this pipeline avoids write contention by writing to local scratch space and avoids the latency of reading from shared NFS by reading from local scratch as well.
  • The best cell should be self-containing (works by itself if the ipython notebook is restarted or cluster is rebooted.)

Understanding cfnCluster

https://github.com/awslabs/cfncluster/blob/master/docs/source/autoscaling.rst http://cfncluster.readthedocs.io/en/latest/processes.html

Install cfnCluster

Setup an Admin node. it could be local machine or an aws EC2 instance.

Install cfncluster on ADMIN https://github.com/awslabs/cfncluster or sudo pip install cfncluster. Version is cfncluster-1.3.1 (as of 3/16/2017)

Reference: https://cfncluster.readthedocs.io/en/latest/getting_started.html

Create custom AMI for cfncluster nodes

http://cfncluster.readthedocs.io/en/latest/ami_customization.html Create EC2 instance using Community AMI

Installed packages CFN_AMI11 ami-62194a02, in Community AMIs now.

Run cfncluster configure or

http://cfncluster.readthedocs.io/en/latest/configuration.html
custom_ami - ami-XXXXXXX
base_os - ubuntu14.04
compute_instance_type - m4.4xlarge
master_instance_type - m4.2xlarge
ebs_settings - custom
master_root_volume_size - 30
compute_root_volume_size - 30
volume_size - 50
vpc_id - see instance description
master_subnet_id - see instance description
aws_access_key_id - use access key, not IAM
aws_secret_access_key
aws_region_name
key_name

Master node has no choice but be on-demand. Compute node can be spot.

Cluster name must satisfy regular expression pattern: [a-zA-Z][-a-zA-Z0-9]

Output:"MasterPublicIP"="52.53.116.181"
Output:"MasterPrivateIP"="172.31.21.42"
Output:"GangliaPublicURL"="http://52.53.116.181/ganglia/"
Output:"GangliaPrivateURL"="http://172.31.21.42/ganglia/"

Timing

  • Create Master node takes 10 minutes.
  • Compute node 6 minutes.

Then access with ssh -i aws/YuncongKey.pem [email protected] Must specify custom_ami or base_os otherwise you cannot SSH to either master or compute nodes.

Security Group: Must enable defaultVPC and "AllowSSH22" Enable "Allow5000" for flask server and "Allow8888" for jupyter notebook

Monitor Cluster

  • EC2 Console
  • Autoscaling Group Console
  • CloudFormation Console
  • Ganglia

Access Key

Note that DO NOT put any file containing access key in github repo. Otherwise AWS will detect it and inactivate it automatically.

Jupyter Notebook

Access from browser https://<master node ip>:8888

Custom Bootstrap Actions

http://cfncluster.readthedocs.io/en/latest/pre_post_install.html Must make the S3 script public readable, otherwise cfncluster create will return 403.

S3

Use this in bucket policy to enable make public by default. { "Version": "2012-10-17", "Statement": [ { "Sid": "MakeItPublic", "Effect": "Allow", "Principal": "", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::mousebrainatlas-data/" } ] }

Build Customized AMI

http://cfncluster.readthedocs.io/en/latest/ami_customization.html

Must use a standalone instance. Should not use cluster's master or compute node.

The base CfnCluster AMI is often updated with new releases. This AMI has all of the components required for CfnCluster to function installed and configured. If you wish to customize an AMI for CfnCluster, you must start with this as the base.

Find the AMI which corresponds with the region you will be utilizing in the list here: https://github.com/awslabs/cfncluster/blob/master/amis.txt. Within the EC2 Console, choose "Launch Instance". Navigate to "Community AMIs", and enter the AMI id for your region into the search box. Select the AMI, choose your instance type and properties, and launch your instance. Log into your instance using the ec2-user and your SSH key. Customize your instance as required Run the following command to prepare your instance for AMI creation: sudo /usr/local/sbin/ami_cleanup.sh Stop the instance Create a new AMI from the instance Enter the AMI id in the custom_ami field within your cluster configuration.

Expand Shared EBS

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-expand-volume.html#recognize-expanded-volume-linux Stop instance. Modify volume in console. Restart instance. lsblk shows new size but df -h still shows old size. Do sudo resize2fs /dev/xvda1.

NFS

Use master instance type - m4.2xlarge, larger memory on NFS server for a performance improvement (Runtime measured for Align step of Global Align) Instance Type Runtime t2.micro 2687 seconds m4.2xlarge 671 seconds Set async option for NFS Edit /etc/exports, change sync to async Restart NFS server sudo service nfs-kernel-server restart

Ganglia

sudo apt-get install libapache2-mod-php7.0 php7.0-xml ; sudo /etc/init.d/apache2 restart Reference: http://blog.vuksan.com/2016/05/03/ganglia-webfrontend-ubuntu-1604-install-issue

Sun Grid Engine

Beginner Tutorial

Add user ubuntu to list of all grid managers

Change to super user sudo -i Set environment variable $SGE_ROOT: export SGE_ROOT=/opt/sge Add user: /opt/sge/bin/lx-amd64/qconf -am ubuntu

Another simple way to enable admin permission

sudo -u sgeadmin -i qconf -de ip-XXXXX.compute.internal

alias sudosgeadmin="sudo -u sgeadmin -i"

Remove execution host from gridengine

  • first, you need to disable the host from queue to avoid any jobs to be allocated to this host qmod -d [email protected]
  • wait for jobs to be finished execution on this host, then kill the execution script qconf -ke thishost.com
  • remove it from the cluster, this opens an editor, just remove the lines referring to this host qconf -mq all.q
  • remove it from allhosts group, this also opens an editor, remove lines referring to this host qconf -mhgrp @allhosts
  • remove it from execution host list qconf -de thishost
  • I normally go to the host and delete the sge scripts as well

If still stuck deleting a host, grep hostnames in /opt/sge/default/spool and remove the strings.

Reference: https://resbook.wordpress.com/2011/03/21/remove-execution-host-from-gridengine/

Performance Tuning

  • Set minimum memory requirement to allow scheduling on a node as 5 GB: qconf -mc Change 0 under mem_free to 5G
  • Change SGE schedule interval: qconf -msconf Change schedule_interval to 0:0:15

Parallel environment

qsub -pe mpi %(jobs_per_node)d -V -l mem_free=60G -o %(stdout_log)s -e %(stderr_log)s %(script)s

Startup Script

https://ucsd-mousebrainatlas-scripts.s3.amazonaws.com/set_env.sh

echo "export RAW_DATA_DIR='/shared/data/CSHL_data'
export DATA_DIR='/shared/data/CSHL_data_processed'
export VOLUME_ROOTDIR='/shared/data/CSHL_volumes2'
export SCOREMAP_VIZ_ROOTDIR='/shared/data/CSHL_lossless_scoremaps_Sat16ClassFinetuned_v2'
export SVM_ROOTDIR='/shared/data/CSHL_patch_features_Sat16ClassFinetuned_v2_classifiers/'
export PATCH_FEATURES_ROOTDIR='/shared/data/CSHL_patch_features_Sat16ClassFinetuned_v2'
export SPARSE_SCORES_ROOTDIR='/shared/data/CSHL_patch_Sat16ClassFinetuned_v2_predictions'
export SCOREMAPS_ROOTDIR='/shared/data/CSHL_lossless_scoremaps_Sat16ClassFinetuned_v2'
export HESSIAN_ROOTDIR='/shared/data/CSHL_hessians/'
export REPO_DIR='/shared/MouseBrainAtlas'
export LABELING_DIR='/shared/CSHL_data_labelings_losslessAlignCropped'" >> /home/ubuntu/.bashrc

GPU instance

Create custom GPU instance AMI for cfncluster nodes