[AQUA] Track md logs for error logging #1219

agrimk · 2025-06-30T16:28:00Z

Description

In this PR , we have added support to watch predict and access logs (if available) of Model Deployment and pass on error from the logs to telemetry.

Following changes were made as part of this requirement -

we introduced a 'get_tail_logs' method in logging module to return the logs as a list for a given log type. Log type can be predict or access.
We use the 'get_tail_logs' method in 'tail_logs' method in model_deployment module and have added some check for cases when logs are not configured.
We call this method in 'get_deployment_status' for both access and predict logs if a model deployment fails . We use the last log from this list and record it in telemetry.

Jira

https://jira.oci.oraclecorp.com/browse/ODSC-73585

Sample error JSON from logs

{
     "datetime": 1751029517799,
     "category": "aqua/service/deployment/status/FAILED",
     "action": " Errors occurred while bootstrapping the Model Deployment  Start Container Error  unable to start container  Error response from daemon  OCI runtime create failed  runc create failed  unable to start container process  error during container init  error running hook  0  error running hook  exit status 1  stdout    stderr  Auto detected mode as  legacy  nvidia container cli  requirement error  unsatisfied condition  cuda  12 6  please update your driver to a newer version  or use an earlier cuda container  unknown",
     "value": "6ihbnzka&model_mistralai/Mistral-7B-Instruct-v0.3&status= Errors occurred while bootstrapping the Model Deployment  Start Container Error  unable to start container  Error response from daemon  OCI runtime create failed  runc create failed  unable to start container process  error during container init  error running hook  0  error running hook  exit status 1  stdout    stderr  Auto detected mode as  legacy  nvidia container cli  requirement error  unsatisfied condition  cuda  12 6  please update your driver to a newer version  or use an earlier cuda container  unknown Starting containerStarting container",
     "count": 1,
     "region": "us-ashburn-1",
     "authenticationType": "resource"
   },

github-actions · 2025-06-30T17:15:46Z

📌 Cov diff with main:

📌 Overall coverage:

github-actions · 2025-06-30T18:03:40Z

📌 Cov diff with main:

📌 Overall coverage:

github-actions · 2025-07-01T07:19:20Z

📌 Cov diff with main:

📌 Overall coverage:

mrDzurb · 2025-07-03T00:14:39Z

ads/model/deployment/model_deployment.py

@@ -728,6 +729,44 @@ def update(

        return self._update_from_oci_model(response)

+    def tail_logs(


Why can't we use -

def logs(self, log_type: str = None) -> ConsolidatedLog: """Gets the access or predict logs. Parameters ---------- log_type: (str, optional). Defaults to None. The log type. Can be "access", "predict" or None. Returns ------- ConsolidatedLog The ConsolidatedLog object containing the logs. """

+1. I see that show_logs() within model_deployment.py also covers the logic described below.

mrDzurb · 2025-07-03T00:16:19Z

ads/aqua/modeldeployment/deployment.py

@@ -1300,24 +1304,52 @@ def get_deployment_status(
                max_wait_time=DEFAULT_WAIT_TIME,
                poll_interval=DEFAULT_POLL_INTERVAL,
            )
-        except Exception:
+        except Exception as e:


Could you add more test cases to cover this logic?

VipulMascarenhas

suggesting minor changes

VipulMascarenhas · 2025-07-03T03:35:15Z

ads/model/deployment/model_deployment.py

@@ -728,6 +729,44 @@ def update(

        return self._update_from_oci_model(response)

+    def tail_logs(


+1. I see that show_logs() within model_deployment.py also covers the logic described below.

VipulMascarenhas · 2025-07-03T03:36:47Z

ads/model/deployment/model_deployment.py


        if infrastructure.private_endpoint_id:
            if not hasattr(
                oci.data_science.models.InstanceConfiguration, "private_endpoint_id"
            ):
                # TODO: add oci version with private endpoint support.
-                raise EnvironmentError(
+                raise OSError(


just curious, did ruff suggest this change?

VipulMascarenhas · 2025-07-03T03:39:46Z

ads/aqua/modeldeployment/deployment.py

+
+            if predict_logs and len(predict_logs) > 0:
+                status += predict_logs[0]["message"]
+            status = re.sub(r"[^a-zA-Z0-9]", " ", status)


nit: any reason why we're removing the non-alphanumeric characters here?

without removing the non-alphanumeric characters, I was getting bad request when calling the head_object endpoint.

VipulMascarenhas · 2025-07-03T03:43:54Z

ads/aqua/modeldeployment/deployment.py

+                status = access_logs[0]["message"]
+
+            if predict_logs and len(predict_logs) > 0:
+                status += predict_logs[0]["message"]


I'm thinking whether we need to append both predict and access logs here. For aqua, I think UI passes the same ocids for predict and access logs, so we'll be duplicating the content. Instead, would it make sense to check for access logs first, and only look to add predict logs if these are empty? Better, if we know that both logs are the same, then we can just look at access.

VipulMascarenhas · 2025-07-03T03:45:46Z

ads/aqua/modeldeployment/deployment.py

+                telemetry_kwargs = {
+                    "ocid": ocid,
+                    "model_name": model_name,
+                    "status": error_str + " " + status,


let's split this in two fields: "status" can be error_str, and then add "log_message" can be the status content from the logs. This way we can identify the error messages coming from work requests and also the deployment logs separately.

VipulMascarenhas · 2025-07-03T03:51:50Z

tests/unitary/with_extras/aqua/test_deployment.py

@@ -2369,6 +2369,7 @@ def test_validate_multimodel_deployment_feasibility_positive_single(
        )

    def test_get_deployment_status(self):
+        model_deployment = copy.deepcopy(TestDataset.model_deployment_object[0])


can you add a few unit tests to cover the status coming from the logs?

github-actions · 2025-07-05T16:50:46Z

📌 Cov diff with main:

📌 Overall coverage:

agrimk added 3 commits June 26, 2025 22:53

watching md predict/access logs for error

0a2725c

watching logs and pushing them to telemetry

c9e40b9

watching logs and pushing them to telemetry

4e97417

agrimk requested review from darenr, mayoor, mrDzurb, VipulMascarenhas, qiuosier and ahosler as code owners June 30, 2025 16:28

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 30, 2025

agrimk and others added 2 commits June 30, 2025 22:15

Merge branch 'main' into track_md_logs_for_error_logging

011a369

replaced async call with sync call

06ccb9f

agrimk added 2 commits June 30, 2025 23:04

added deployment object in get_deployment_status method

b802894

merge from master

95e5a04

mrDzurb changed the title ~~Track md logs for error logging~~ [AQUA] Track md logs for error logging Jun 30, 2025

agrimk and others added 2 commits July 1, 2025 12:14

fixed unit test of get_deployment_status

b60cab7

Merge branch 'main' into track_md_logs_for_error_logging

ba43d62

mrDzurb reviewed Jul 3, 2025

View reviewed changes

VipulMascarenhas reviewed Jul 3, 2025

View reviewed changes

added test cases and PR review comments

83dc9fc

		@@ -728,6 +729,44 @@ def update(

		return self._update_from_oci_model(response)

		def tail_logs(

[AQUA] Track md logs for error logging #1219

Are you sure you want to change the base?

[AQUA] Track md logs for error logging #1219

Uh oh!

Conversation

agrimk commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Jira

Sample error JSON from logs

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jul 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VipulMascarenhas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VipulMascarenhas Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 5, 2025

Uh oh!

Uh oh!

agrimk commented Jun 30, 2025 •

edited

Loading

VipulMascarenhas Jul 3, 2025 •

edited

Loading