Skip to content
This repository was archived by the owner on Oct 12, 2023. It is now read-only.

Commit 20c86f1

Browse files
authored
Feature/custom package (#272)
* Added custom package script * Added feature custom download * Fixed typo * Fixed directory for installation * Fixed full folder directory * Add dependencies and fix pattern * Fix pattern not found * Added repo * Switching to devtools * Fixing devtools install with directory * Fix in for merger.R * Working cluster custom packages * Removed printed statements * Working on custom docs * Custom packages sample docs * Fixed typo in azure files typo * Fixed typos based on PR
1 parent d02599d commit 20c86f1

File tree

12 files changed

+194
-27
lines changed

12 files changed

+194
-27
lines changed

R/cluster.R

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,10 @@ makeCluster <-
151151
"wget https://raw.githubusercontent.com/Azure/doAzureParallel/",
152152
"master/inst/startup/install_bioconductor.R"
153153
),
154+
paste0(
155+
"wget https://raw.githubusercontent.com/Azure/doAzureParallel/",
156+
"master/inst/startup/install_custom.R"
157+
),
154158
"chmod u+x install_bioconductor.R",
155159
installAndStartContainerCommand
156160
)

R/commandLineUtilities.R

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ dockerRunCommand <-
123123
dockerOptions <-
124124
paste(
125125
dockerOptions,
126+
"-e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR",
126127
"-e AZ_BATCH_TASK_ID=$AZ_BATCH_TASK_ID",
127128
"-e AZ_BATCH_JOB_ID=$AZ_BATCH_JOB_ID",
128129
"-e AZ_BATCH_TASK_WORKING_DIR=$AZ_BATCH_TASK_WORKING_DIR",

docs/20-package-management.md

Lines changed: 48 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -38,29 +38,37 @@ You can install packages by specifying the package(s) in your JSON pool configur
3838
}
3939
```
4040

41+
## Installing Packages per-*foreach* Loop
42+
43+
You can also install cran packages by using the **.packages** option in the *foreach* loop. You can also install github/bioconductor packages by using the **github** and **bioconductor" option in the *foreach* loop. Instead of installing packages during pool creation, packages (and its dependencies) can be installed before each iteration in the loop is run on your Azure cluster.
44+
45+
### Installing a Github Package
46+
47+
doAzureParallel supports github package with the **github** option.
48+
49+
Please do not use "https://github.com/" as prefix for the github package name above.
50+
4151
## Installing packages from a private GitHub repository
4252

43-
Clusters can be configured to install packages from a private GitHub repository by setting the __githubAuthenticationToken__ property. If this property is blank only public repositories can be used. If a token is added then public and the private github repo can be used together.
53+
Clusters can be configured to install packages from a private GitHub repository by setting the __githubAuthenticationToken__ property in the credentials file. If this property is blank only public repositories can be used. If a token is added then public and the private github repo can be used together.
4454

4555
When the cluster is created the token is passed in as an environment variable called GITHUB\_PAT on start-up which lasts the life of the cluster and is looked up whenever devtools::install_github is called.
4656

57+
Credentials File for github authentication token
58+
``` json
59+
{
60+
...
61+
"githubAuthenticationToken": "",
62+
...
63+
}
64+
65+
```
66+
67+
Cluster File
4768
```json
4869
{
4970
{
50-
"name": <your pool name>,
51-
"vmSize": <your pool VM size name>,
52-
"maxTasksPerNode": <num tasks to allocate to each node>,
53-
"poolSize": {
54-
"dedicatedNodes": {
55-
"min": 2,
56-
"max": 2
57-
},
58-
"lowPriorityNodes": {
59-
"min": 1,
60-
"max": 10
61-
},
62-
"autoscaleFormula": "QUEUE"
63-
},
71+
...
6472
"rPackages": {
6573
"cran": [],
6674
"github": ["<project/some_private_repository>"],
@@ -71,10 +79,18 @@ When the cluster is created the token is passed in as an environment variable ca
7179
}
7280
```
7381

74-
_More information regarding github authentication tokens can be found [here](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)_
82+
_More information regarding github authentication tokens can be found [here](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
7583

76-
## Installing Packages per-*foreach* Loop
77-
You can also install cran packages by using the **.packages** option in the *foreach* loop. You can also install github/bioconductor packages by using the **github** and **bioconductor" option in the *foreach* loop. Instead of installing packages during pool creation, packages (and its dependencies) can be installed before each iteration in the loop is run on your Azure cluster.
84+
### Installing Multiple Packages
85+
By using character vectors of the packages,
86+
87+
```R
88+
number_of_iterations <- 10
89+
results <- foreach(i = 1:number_of_iterations,
90+
.packages=c('package_1', 'package_2'),
91+
github = c('Azure/rAzureBatch', 'Azure/doAzureParallel'),
92+
bioconductor = c('IRanges', 'Biobase')) %dopar% { ... }
93+
```
7894

7995
To install a single cran package:
8096
```R
@@ -94,7 +110,6 @@ number_of_iterations <- 10
94110
results <- foreach(i = 1:number_of_iterations, github='azure/rAzureBatch') %dopar% { ... }
95111
```
96112

97-
Please do not use "https://github.com/" as prefix for the github package name above.
98113

99114
To install multiple github packages:
100115
```R
@@ -114,7 +129,7 @@ number_of_iterations <- 10
114129
results <- foreach(i = 1:number_of_iterations, bioconductor=c('package_1', 'package_2')) %dopar% { ... }
115130
```
116131

117-
## Installing Packages from BioConductor
132+
## Installing a BioConductor Package
118133
The default deployment of R used in the cluster (see [Customizing the cluster](./30-customize-cluster.md) for more information) includes the Bioconductor installer by default. Simply add packages to the cluster by adding packages in the array.
119134

120135
```json
@@ -134,17 +149,27 @@ The default deployment of R used in the cluster (see [Customizing the cluster](.
134149
},
135150
"autoscaleFormula": "QUEUE"
136151
},
152+
"containerImage:" "rocker/tidyverse:latest",
137153
"rPackages": {
138154
"cran": [],
139155
"github": [],
140156
"bioconductor": ["IRanges"]
141157
},
142-
"commandLine": []
158+
"commandLine": [],
159+
"subnetId": ""
143160
}
144161
}
145162
```
146163

147-
Note: Container references that are not provided by tidyverse do not support Bioconductor installs. If you choose another container, you must make sure that Biocondunctor is installed.
164+
Note: Container references that are not provided by tidyverse do not support Bioconductor installs. If you choose another container, you must make sure that Bioconductor is installed.
165+
166+
## Installing Custom Packages
167+
doAzureParallel supports custom package installation in the cluster. Custom packages installation on the per-*foreach* loop level is not supported.
168+
169+
For steps on installing custom packages, it can be found [here](../samples/package_management/custom/README.md).
170+
171+
Note: If the package requires a compilation such as apt-get installations, users will be required
172+
to build their own containers.
148173

149-
## Uninstalling packages
174+
## Uninstalling a Package
150175
Uninstalling packages from your pool is not supported. However, you may consider rebuilding your pool.

inst/startup/install_custom.R

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
args <- commandArgs(trailingOnly = TRUE)
2+
3+
sharedPackageDirectory <- file.path(
4+
Sys.getenv("AZ_BATCH_NODE_SHARED_DIR"),
5+
"R",
6+
"packages")
7+
8+
tempDir <- file.path(
9+
Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"),
10+
"tmp")
11+
12+
.libPaths(c(sharedPackageDirectory, .libPaths()))
13+
14+
pattern <- NULL
15+
if (length(args) > 1) {
16+
if (!is.null(args[2])) {
17+
pattern <- args[2]
18+
}
19+
}
20+
21+
devtoolsPackage <- "devtools"
22+
if (!require(devtoolsPackage, character.only = TRUE)) {
23+
install.packages(devtoolsPackage)
24+
require(devtoolsPackage, character.only = TRUE)
25+
}
26+
27+
packageDirs <- list.files(
28+
path = tempDir,
29+
full.names = TRUE,
30+
recursive = FALSE)
31+
32+
for (i in 1:length(packageDirs)) {
33+
print("Package Directories")
34+
print(packageDirs[i])
35+
36+
devtools::install(packageDirs[i],
37+
args = c(
38+
paste0(
39+
"--library=",
40+
"'",
41+
sharedPackageDirectory,
42+
"'")))
43+
44+
print("Package Directories Completed")
45+
}
46+
47+
unlink(
48+
tempDir,
49+
recursive = TRUE)

inst/startup/merger.R

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@ batchJobPreparationDirectory <-
1818
Sys.getenv("AZ_BATCH_JOB_PREP_WORKING_DIR")
1919
batchTaskWorkingDirectory <- Sys.getenv("AZ_BATCH_TASK_WORKING_DIR")
2020
taskPackageDirectory <- paste0(batchTaskWorkingDirectory)
21-
clusterPackageDirectory <- paste0(Sys.getenv("AZ_BATCH_NODE_SHARED_DIR", "/R/packages"))
21+
clusterPackageDirectory <- file.path(Sys.getenv("AZ_BATCH_NODE_SHARED_DIR"),
22+
"R",
23+
"packages")
2224

2325
libPaths <- c(
2426
taskPackageDirectory,

samples/azure_files/azure_files_cluster.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,8 @@
2020
},
2121
"commandLine": [
2222
"mkdir /mnt/batch/tasks/shared/data",
23-
"mount -t cifs //<STORAGE_ACCOUNT_NAME>.file.core.windows.net/<FILE_SHARE_NAME> /mnt/batch/tasks/shared/data -o vers=3.0,username=<STORAGE_ACCOUNT_NAME>,password=<STORAGE_ACCOUNT_KEY>==,dir_mode=0777,file_mode=0777,sec=ntlmssp"]
23+
"mount -t cifs //<STORAGE_ACCOUNT_NAME>.file.core.windows.net/<FILE_SHARE_NAME> /mnt/batch/tasks/shared/data -o vers=3.0,username=<STORAGE_ACCOUNT_NAME>,password=<STORAGE_ACCOUNT_KEY>,dir_mode=0777,file_mode=0777,sec=ntlmssp",
24+
"wget https://raw.githubusercontent.com/Azure/doAzureParallel/feature/custom-package/inst/startup/install_custom.R",
25+
"docker run --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR rocker/tidyverse:latest Rscript --no-save --no-environ --no-restore --no-site-file --verbose $AZ_BATCH_NODE_STARTUP_DIR/wd/install_custom.R /mnt/batch/tasks/shared/data"
26+
]
2427
}

samples/azure_files/readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,4 @@ This samples shows how to update the cluster configuration to create a new mount
1212

1313
For large data sets or large traffic applications be sure to review the Azure Files [scalability and performance targets](https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#scalability-targets-for-blobs-queues-tables-and-files).
1414

15-
For very large data sets we recommend using Azure Blobs. You can learn more in the [persistent storage](../../docs/23-persistent-storage.md) and [distrubuted data](../../docs/21-distributing-data.md) docs.
15+
For very large data sets we recommend using Azure Blobs. You can learn more in the [persistent storage](../../docs/23-persistent-storage.md) and [distributing data](../../docs/21-distributing-data.md) docs.

samples/package_management/bioconductor.r renamed to samples/package_management/bioconductor/bioconductor.r

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#Please see documentation at docs/20-package-management.md for more details on packagement management.
1+
#Please see documentation at docs/20-package-management.md for more details on package management.
22

33
# import the doAzureParallel library and its dependencies
44
library(doAzureParallel)
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
## Installing Custom Packages
2+
doAzureParallel supports custom package installation in the cluster. Custom packages are R packages that cannot be hosted on Github or be built on a docker image. The recommended approach for custom packages is building them from source and uploading them to an Azure File Share.
3+
4+
Note: If the package requires a compilation such as apt-get installations, users will be required
5+
to build their own containers.
6+
7+
### Building Package from Source in RStudio
8+
1. Open *RStudio*
9+
2. Go to *Build* on the navigation bar
10+
3. Go to *Build From Source*
11+
12+
### Uploading Custom Package to Azure Files
13+
For detailed steps on uploading files to Azure Files in the Portal can be found
14+
[here](https://docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-portal)
15+
16+
### Notes
17+
1) In order to build the custom packages' dependencies, we need to untar the R packages and build them within their directories. By default, we will build custom packages in the *$AZ_BATCH_NODE_SHARED_DIR/tmp* directory.
18+
2) By default, the custom package cluster configuration file will install any packages that are a *.tar.gz file in the file share. If users want to specify R packages, they must change this line in the cluster configuration file.
19+
20+
Finds files that end with *.tar.gz in the current Azure File Share directory
21+
``` json
22+
{
23+
...
24+
"commandLine": [
25+
...
26+
"mkdir $AZ_BATCH_NODE_STARTUP_DIR/tmp | for i in `ls $AZ_BATCH_NODE_SHARED_DIR/data/*.tar.gz | awk '{print $NF}'`; do tar -xvf $i -C $AZ_BATCH_NODE_STARTUP_DIR/tmp; done",
27+
...
28+
]
29+
}
30+
```
31+
3) For more information on using Azure Files on Batch, follow our other [sample](./azure_files/readme.md) of using Azure Files
32+
4) Replace your Storage Account name, endpoint and key in the cluster configuration file
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#Please see documentation at docs/20-package-management.md for more details on package management.
2+
3+
# import the doAzureParallel library and its dependencies
4+
library(doAzureParallel)
5+
6+
# set your credentials
7+
doAzureParallel::setCredentials("credentials.json")
8+
9+
# Create your cluster if not exist
10+
cluster <- doAzureParallel::makeCluster("custom_packages_cluster.json")
11+
12+
# register your parallel backend
13+
doAzureParallel::registerDoAzureParallel(cluster)
14+
15+
# check that your workers are up
16+
doAzureParallel::getDoParWorkers()
17+
18+
summary <- foreach(i = 1:1, .packages = c("customR")) %dopar% {
19+
sessionInfo()
20+
# Method from customR
21+
hello()
22+
}
23+
24+
summary
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
{
2+
"name": "custom-package-pool",
3+
"vmSize": "Standard_D2_v2",
4+
"maxTasksPerNode": 1,
5+
"poolSize": {
6+
"dedicatedNodes": {
7+
"min": 2,
8+
"max": 2
9+
},
10+
"lowPriorityNodes": {
11+
"min": 0,
12+
"max": 0
13+
},
14+
"autoscaleFormula": "QUEUE"
15+
},
16+
"rPackages": {
17+
"cran": [],
18+
"github": [],
19+
"bioconductor": []
20+
},
21+
"commandLine": [
22+
"mkdir /mnt/batch/tasks/shared/data",
23+
"mount -t cifs //<Account Name>.file.core.windows.net/<File Share> /mnt/batch/tasks/shared/data -o vers=3.0,username=<Account Name>,password=<Account Key>,dir_mode=0777,file_mode=0777,sec=ntlmssp",
24+
"mkdir $AZ_BATCH_NODE_STARTUP_DIR/tmp | for i in `ls $AZ_BATCH_NODE_SHARED_DIR/data/*.tar.gz | awk '{print $NF}'`; do tar -xvf $i -C $AZ_BATCH_NODE_STARTUP_DIR/tmp; done",
25+
"docker run --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR rocker/tidyverse:latest Rscript --no-save --no-environ --no-restore --no-site-file --verbose $AZ_BATCH_NODE_STARTUP_DIR/wd/install_custom.R /mnt/batch/tasks/shared/data"
26+
]
27+
}

0 commit comments

Comments
 (0)