Fine-tune Qwen3-Embedding for code embeddings using Amazon SageMaker #4846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

doronbl wants to merge 3 commits into aws:default from doronbl:default

doronbl commented Jun 19, 2025

Issue #, if available:

Description of changes:
Added notebook showing Fine-tune Qwen3-Embedding for code embeddings using Amazon SageMaker
Testing done:
Yes

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

[ x] I have verified that my PR does not contain any new notebook/s which demonstrate a SageMaker functionality already showcased by another existing notebook in the repository
[ x] I have read the CONTRIBUTING doc and adhered to the guidelines regarding folder placement, notebook naming convention and example notebook best practices
[ x] I have updated the necessary documentation, including the README of the appropriate folder as well as the index.rst file
[ x] I have tested my notebook(s) and ensured it runs end-to-end
[x ] I have linted my notebook(s) and code using python3 -m black -l 100 {path}/{notebook-name}.ipynb

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

doronbl added 3 commits

June 19, 2025 11:05


          Initial version

fd7edeb

Initial version of Fine-tune Qwen3-Embedding for code embeddings using Amazon SageMaker


          Linter code modifications

2aec78e

Black linter code modifications


          Follow Contributing Guidelines

e74a142

Follow Contributing Guidelines
Including section adding and notebook renaming

review-notebook-app bot commented Jun 19, 2025

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sagemaker-bot reviewed

View reviewed changes

build_and_train_models/sm-finetune_qwen_for_code_embedding/inference_code/inference.py

@@ @@ -0,0 +1,153 @@ @@
+              # inference.py - Custom inference script for embedding operations
+              import json
+              import torch

Collaborator

sagemaker-bot Jun 19, 2025

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

The code uses broad library imports instead of importing specific required class, which consumes unnecessary memory and makes code maintenance harder by obscuring actual library usage. To optimize performance and improve code clarity, use targeted imports with 'from library import specific_class' syntax. Learn More https://docs.python.org/3/tutorial/modules.html.

build_and_train_models/sm-finetune_qwen_for_code_embedding/inference_code/inference.py

+                          "text2_embedding": embeddings[1].tolist(),
+                      }
+                  def output_fn(self, prediction, accept="application/json"):

Collaborator

sagemaker-bot Jun 19, 2025

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

This code relies on a client-controlled input (e.g., cookies, URL parameters, or headers) to determine user roles, which is vulnerable to manipulation. An attacker could potentially elevate their privileges by tampering with these inputs. To fix this, enforce role-based checks using server-side session data or an external authentication service. Avoid relying on any user-controlled data for role validation. Learn more about authorization vulnerabilities from OWASP[https://owasp.org/Top10/A01_2021-Broken_Access_Control/].

build_and_train_models/sm-finetune_qwen_for_code_embedding/inference_code/inference.py


		return self.model

		def input_fn(self, request_body, content_type="application/json"):

Collaborator

sagemaker-bot Jun 19, 2025

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

This code relies on a client-controlled input (e.g., cookies, URL parameters, or headers) to determine user roles, which is vulnerable to manipulation. An attacker could potentially elevate their privileges by tampering with these inputs. To fix this, enforce role-based checks using server-side session data or an external authentication service. Avoid relying on any user-controlled data for role validation. Learn more about authorization vulnerabilities from OWASP[https://owasp.org/Top10/A01_2021-Broken_Access_Control/].

build_and_train_models/sm-finetune_qwen_for_code_embedding/inference_code/inference.py

+                      Parse input data for inference.
+                      """
+                      if content_type == "application/json":
+                          logger.info(f"Inference request: {request_body}")

Collaborator

sagemaker-bot Jun 19, 2025

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html

build_and_train_models/sm-finetune_qwen_for_code_embedding/inference_code/inference.py

+                      Perform inference based on the operation type.
+                      """
+                      operation = data.get("operation", "encode")
+                      logger.info(f"Prediction input: {data}")

Collaborator

sagemaker-bot Jun 19, 2025

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html

build_and_train_models/sm-finetune_qwen_for_code_embedding/inference_code/inference.py



		def predict_fn(data, model):
		logger.info(f"predict_fn data: {data}")

Collaborator

sagemaker-bot Jun 19, 2025

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html

build_and_train_models/sm-finetune_qwen_for_code_embedding/inference_code/inference.py

+                      """
+                      Load the fine-tuned model from the model directory.
+                      """
+                      logger.info(f"Loading model from: {model_dir}")

Collaborator

sagemaker-bot Jun 19, 2025

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet