Skip to content

[AQUA] Track md logs for error logging #1219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

agrimk
Copy link
Member

@agrimk agrimk commented Jun 30, 2025

Description

In this PR , we have added support to watch predict and access logs (if available) of Model Deployment and pass on error from the logs to telemetry.

Following changes were made as part of this requirement -

  • we introduced a 'get_tail_logs' method in logging module to return the logs as a list for a given log type. Log type can be predict or access.
  • We use the 'get_tail_logs' method in 'tail_logs' method in model_deployment module and have added some check for cases when logs are not configured.
  • We call this method in 'get_deployment_status' for both access and predict logs if a model deployment fails . We use the last log from this list and record it in telemetry.

Jira

https://jira.oci.oraclecorp.com/browse/ODSC-73585

Sample error JSON from logs

{
     "datetime": 1751029517799,
     "category": "aqua/service/deployment/status/FAILED",
     "action": " Errors occurred while bootstrapping the Model Deployment  Start Container Error  unable to start container  Error response from daemon  OCI runtime create failed  runc create failed  unable to start container process  error during container init  error running hook  0  error running hook  exit status 1  stdout    stderr  Auto detected mode as  legacy  nvidia container cli  requirement error  unsatisfied condition  cuda  12 6  please update your driver to a newer version  or use an earlier cuda container  unknown",
     "value": "6ihbnzka&model_mistralai/Mistral-7B-Instruct-v0.3&status= Errors occurred while bootstrapping the Model Deployment  Start Container Error  unable to start container  Error response from daemon  OCI runtime create failed  runc create failed  unable to start container process  error during container init  error running hook  0  error running hook  exit status 1  stdout    stderr  Auto detected mode as  legacy  nvidia container cli  requirement error  unsatisfied condition  cuda  12 6  please update your driver to a newer version  or use an earlier cuda container  unknown Starting containerStarting container",
     "count": 1,
     "region": "us-ashburn-1",
     "authenticationType": "resource"
   },

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 30, 2025
Copy link

📌 Cov diff with main:

Coverage-23%

📌 Overall coverage:

Coverage-18.75%

Copy link

📌 Cov diff with main:

Coverage-23%

📌 Overall coverage:

Coverage-18.75%

@mrDzurb mrDzurb changed the title Track md logs for error logging [AQUA] Track md logs for error logging Jun 30, 2025
Copy link

github-actions bot commented Jul 1, 2025

📌 Cov diff with main:

Coverage-36%

📌 Overall coverage:

Coverage-58.17%

@@ -728,6 +729,44 @@ def update(

return self._update_from_oci_model(response)

def tail_logs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use -

def logs(self, log_type: str = None) -> ConsolidatedLog:
        """Gets the access or predict logs.

        Parameters
        ----------
        log_type: (str, optional). Defaults to None.
            The log type. Can be "access", "predict" or None.

        Returns
        -------
        ConsolidatedLog
            The ConsolidatedLog object containing the logs.
        """

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I see that show_logs() within model_deployment.py also covers the logic described below.

@@ -1300,24 +1304,52 @@ def get_deployment_status(
max_wait_time=DEFAULT_WAIT_TIME,
poll_interval=DEFAULT_POLL_INTERVAL,
)
except Exception:
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more test cases to cover this logic?

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggesting minor changes

@@ -728,6 +729,44 @@ def update(

return self._update_from_oci_model(response)

def tail_logs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I see that show_logs() within model_deployment.py also covers the logic described below.


if infrastructure.private_endpoint_id:
if not hasattr(
oci.data_science.models.InstanceConfiguration, "private_endpoint_id"
):
# TODO: add oci version with private endpoint support.
raise EnvironmentError(
raise OSError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, did ruff suggest this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


if predict_logs and len(predict_logs) > 0:
status += predict_logs[0]["message"]
status = re.sub(r"[^a-zA-Z0-9]", " ", status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any reason why we're removing the non-alphanumeric characters here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without removing the non-alphanumeric characters, I was getting bad request when calling the head_object endpoint.

status = access_logs[0]["message"]

if predict_logs and len(predict_logs) > 0:
status += predict_logs[0]["message"]
Copy link
Member

@VipulMascarenhas VipulMascarenhas Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking whether we need to append both predict and access logs here. For aqua, I think UI passes the same ocids for predict and access logs, so we'll be duplicating the content. Instead, would it make sense to check for access logs first, and only look to add predict logs if these are empty? Better, if we know that both logs are the same, then we can just look at access.

telemetry_kwargs = {
"ocid": ocid,
"model_name": model_name,
"status": error_str + " " + status,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's split this in two fields: "status" can be error_str, and then add "log_message" can be the status content from the logs. This way we can identify the error messages coming from work requests and also the deployment logs separately.

@@ -2369,6 +2369,7 @@ def test_validate_multimodel_deployment_feasibility_positive_single(
)

def test_get_deployment_status(self):
model_deployment = copy.deepcopy(TestDataset.model_deployment_object[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a few unit tests to cover the status coming from the logs?

Copy link

github-actions bot commented Jul 5, 2025

📌 Cov diff with main:

Coverage-67%

📌 Overall coverage:

Coverage-58.21%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants