Skip to content

Fine-tune Qwen3-Embedding for code embeddings using Amazon SageMaker #4846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: default
Choose a base branch
from

Conversation

doronbl
Copy link

@doronbl doronbl commented Jun 19, 2025

Issue #, if available:

Description of changes:
Added notebook showing Fine-tune Qwen3-Embedding for code embeddings using Amazon SageMaker
Testing done:
Yes

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • [ x] I have verified that my PR does not contain any new notebook/s which demonstrate a SageMaker functionality already showcased by another existing notebook in the repository
  • [ x] I have read the CONTRIBUTING doc and adhered to the guidelines regarding folder placement, notebook naming convention and example notebook best practices
  • [ x] I have updated the necessary documentation, including the README of the appropriate folder as well as the index.rst file
  • [ x] I have tested my notebook(s) and ensured it runs end-to-end
  • [x ] I have linted my notebook(s) and code using python3 -m black -l 100 {path}/{notebook-name}.ipynb

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

doronbl added 3 commits June 19, 2025 11:05
Initial version of Fine-tune Qwen3-Embedding for code embeddings using Amazon SageMaker
Black linter code modifications
Follow Contributing Guidelines
Including section adding and notebook renaming
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -0,0 +1,153 @@
# inference.py - Custom inference script for embedding operations
import json
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

The code uses broad library imports instead of importing specific required class, which consumes unnecessary memory and makes code maintenance harder by obscuring actual library usage. To optimize performance and improve code clarity, use targeted imports with 'from library import specific_class' syntax. Learn More https://docs.python.org/3/tutorial/modules.html.

"text2_embedding": embeddings[1].tolist(),
}

def output_fn(self, prediction, accept="application/json"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

This code relies on a client-controlled input (e.g., cookies, URL parameters, or headers) to determine user roles, which is vulnerable to manipulation. An attacker could potentially elevate their privileges by tampering with these inputs. To fix this, enforce role-based checks using server-side session data or an external authentication service. Avoid relying on any user-controlled data for role validation. Learn more about authorization vulnerabilities from OWASP[https://owasp.org/Top10/A01_2021-Broken_Access_Control/].


return self.model

def input_fn(self, request_body, content_type="application/json"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

This code relies on a client-controlled input (e.g., cookies, URL parameters, or headers) to determine user roles, which is vulnerable to manipulation. An attacker could potentially elevate their privileges by tampering with these inputs. To fix this, enforce role-based checks using server-side session data or an external authentication service. Avoid relying on any user-controlled data for role validation. Learn more about authorization vulnerabilities from OWASP[https://owasp.org/Top10/A01_2021-Broken_Access_Control/].

Parse input data for inference.
"""
if content_type == "application/json":
logger.info(f"Inference request: {request_body}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html

Perform inference based on the operation type.
"""
operation = data.get("operation", "encode")
logger.info(f"Prediction input: {data}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html



def predict_fn(data, model):
logger.info(f"predict_fn data: {data}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html

"""
Load the fine-tuned model from the model directory.
"""
logger.info(f"Loading model from: {model_dir}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Log Injection occurs when unsafe input is directly written to log files without proper sanitization. This can allow attackers to manipulate log entries, potentially leading to security issues like log forging or cross-site scripting. To prevent this, always sanitize user input before logging by removing or encoding newline characters, using string encoding functions, and leveraging built-in sanitization features of logging libraries when available. Learn more - https://cwe.mitre.org/data/definitions/117.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants