Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3840] Builtin scripts parameter consolidation #2228

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

e-strauss
Copy link
Contributor

@e-strauss e-strauss commented Feb 13, 2025

This PR consolidates the parameters of the builtin DML scripts and introduces a formatting sheme for the DML builtin parameters. Additionally, the changes were also applied to the Python API using the auto generator script, where I added a small fix, because some builtins were not parsed correctly.

The changes also resulted in a couple changes in the test script for the builtin scripts. For the discovery of the usages of the changed builtins, I had to manually "grep" the ./src/test/scripts folder, so that I dont break any tests, but also the ./script folder to catch the usages in tutorials, perftest, etc. I hope, I've catched everything there, but I can not guaranty it, since the github test just run on test script folder.

I applied the following formatting convention:

  • camelCase, no "_"
  • matrix / vector arguments must be capitalized
  • for scalars, the first character must be lower case

allowed:
X, Xtest, ScaleFactor,
tol, maxIter

not allowed:
X_test, max_iter

Furtheremore, I used the following python script / notebook for the discovery of all unique parameter names:
https://gist.github.com/e-strauss/ccec59f504ed3882a2141d859ff681ec

In the following, I want to give a quick overview over the consolidations, that I applied.
In total, I found 62 matches for consolidation and reduced the total number of unique parameters to 340, which was previously at 402.

I changed the spelling of the following 76 parameters names:
https://gist.github.com/e-strauss/3a5f27160178024c079ed37072c54cfa

Through either consolidation or applying the naming convention, I removed the following 138 parameter names:
https://gist.github.com/e-strauss/3a5f27160178024c079ed37072c54cfa

A list of all current parameters can be found here:
https://gist.github.com/e-strauss/8aad1143b016d7bda28f3884038d62d3

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 21.64%. Comparing base (1609470) to head (2b29972).
Report is 3 commits behind head on main.

Additional details and impacted files
@@              Coverage Diff              @@
##               main    #2228       +/-   ##
=============================================
- Coverage     72.47%   21.64%   -50.83%     
+ Complexity    45440    14140    -31300     
=============================================
  Files          1469     1469               
  Lines        170873   170873               
  Branches      33315    33315               
=============================================
- Hits         123836    36993    -86843     
- Misses        37622   130547    +92925     
+ Partials       9415     3333     -6082     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This commit consolidates the parameters of the builtin DML scripts and introduces a formatting sheme for the DML builtin parameters. Additionally, the changes were also applied to the Python API using the auto generator script, where I added a small fix, because some builtins were not parsed correctly.

The changes in the builtin scripts arguments:

order_rand -> Order
search_iterations -> iter
batch_size -> batchSize
n_bins -> n
ytest -> Ytest
max_iter -> maxIter
max_func_invoc, max_iterations -> maxIter
include_mean -> includeMean
icpt -> intercept
maxi -> maxIteration
maxii -> maxInnerIteration
cMask -> ctypes
is_verbose -> verbose
k_max -> maxK
kmax -> iter
(start/end)_stepsize -> (start/end)Stepsize
(start/end)_vicinity -> (start/end)Vicinity
sim_seed -> seed
k_value -> k
CL_T -> ctypes
trans_continuous -> tranCont
select_k -> selectK
k_min -> minK
k_max -> maxK
select_feature -> selectFeature
feature_max -> maxFeatures
feature_importance -> featureImportance
predict_con_tg -> predictCont
START_SELECTED -> initSelectFeature
frequency_threshold -> frequencyThreshold
distance_threshold -> distanceThreshold
is_verbose -> verbose
thresh -> threshold
reduced_dims -> reducedDims
max_iter -> maxIter
is_verbose -> verbose
print_iter -> printIter
min_leaf -> minLeaf
min_split -> minSplit
num_trees -> numTrees
max_depth -> maxDepth
max_features -> maxFeatures
max_values -> maxValues
max_dataratio -> maxDataRatio
sample_frac -> sampleFrac
feature_frac -> featureFrac
window_size -> windowSize
sample_percent -> sampleFrac
is_verbose -> verbose
alpha -> lr
batch_size -> batchSize
learning_rate -> lr
out_activation -> outActivation
loss_fcn -> lossFn
validation_split -> validationSplit
types -> ctypes
sml_type -> smlType
num_trees -> numTrees
learning_rate -> lr
max_depth -> maxDepth
lambda -> reg
icpt -> intercept
maxi, moi, iter -> maxIter
mii -> maxInnerIter
epsilon -> tol
maxIterations -> maxIter
maxii -> maxInnerIter
eps -> tol
max_iter, iterations -> maxIter
is_verbose -> verbose
avg_sample_size_per_centroid -> avgSampleSizePerCentroid
space_decomp -> spaceDecomp
n_components -> nComponents
init_param -> initParams
reg_covar -> reg
rnk -> rank
eps -> tolerance
maxi -> maxIter
dth, thr -> threshold
maxi -> maxIter
dth, thr -> threshold
maxi -> maxIter
@e-strauss e-strauss force-pushed the DML_parameter_consolidation branch from d60724e to 2b29972 Compare February 14, 2025 11:42
@e-strauss
Copy link
Contributor Author

I just squashed the commits and removed the changes in the github action configs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant