Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] Feedback behaviour improvements #6440

Open
cdvv7788 opened this issue Feb 18, 2025 · 9 comments
Open

[ENHANCEMENT] Feedback behaviour improvements #6440

cdvv7788 opened this issue Feb 18, 2025 · 9 comments
Labels
cannot reproduce A bug that cannot be reproduced enhancement New feature or request

Comments

@cdvv7788
Copy link

cdvv7788 commented Feb 18, 2025

Is your feature request related to a problem? Please describe.
I have been following https://docs.arize.com/phoenix/tracing/how-to-interact-with-traces/capture-feedback
My issue is that I don't understand what is the behaviour that should be expected with the feedback.
I have a trace with several spans, and I want to add feedback in any spans. The API suggests this is possible, with it accepting span_id as a parameter. However what I have found is that:

  • Once I send a feedback, it goes to the trace too
  • If I send a new feedback to the SAME span, it overrides the trace feedback, but leaves the span untouched (is not updating)
  • If I send feedback to other spans in the same trace, they are basically ignored. The span is not annotated, nor the trace updated.

Describe the solution you'd like
I would like to understand what is the expected behaviour here. I would prefer to avoid implementing a solution for this on my end, it makes more sense for this to live in phoenix, but I am either doing something wrong, or we need to improve the docs/feature.

I would like several things, but that obviously depends on the direction the project is going.

  • If I send feedback to a span, it can show up in the trace, but if several spans have feedback they all should be considered
  • If for example I am measuring thumbs up, I would like to keep track of how many. However at the moment the entry is being overriden by the newest message (which kind of makes sense given that the endpoint creates or updates). There should be an explicit endpoint for creation only, handling updates separately, so we are not forced to a single observation per span/trace.
  • If the intention here is that we have a separate system tracking the score, and we update the final score directly, that's ok, but we need to be able to do it directly in the trace, which isn't currently possible (getting a 422 if I don't send the span_id field). I am forced to update the annotation in a span, which doesn't actually update, but updates the trace annotation (and has to be the original span that was annotated or it will just be ignored).
  • In the case the users all delete the feedback, there is no way to remove the annotations via REST API. If I could save the information about the annotation in the metadata (who created it and when) and then I could retrieve that information via API, I could filter out and remove the specific annotations that need to go.

Describe alternatives you've considered
Implementing this myself. I can just keep track of the trace_ids and keep the scores in my system instead of sending them to phoenix. Then I can attach them using the trace_id if I need it. Again, the hard blocker here is the lack of consistency in the annotations API. If I don't keep track of the exact span I used for the initial feedback, I have no way to update the annotation via the rest API.

TLDR;

The annotations API needs some love. The current documentation shows a very edge case of the feedback feature, and it can be improved across the board.
If I am doing something wrong or my expectations are just absurd, more context in the documentation can also help with this.

@cdvv7788 cdvv7788 added enhancement New feature or request triage issues that need triage labels Feb 18, 2025
@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix Feb 18, 2025
@axiomofjoy
Copy link
Contributor

@cdvv7788 Definitely agree that we can improve annotations, and your feedback is welcome and appreciated.

Is your feature request related to a problem? Please describe. I have been following https://docs.arize.com/phoenix/tracing/how-to-interact-with-traces/capture-feedback My issue is that I don't understand what is the behaviour that should be expected with the feedback. I have a trace with several spans, and I want to add feedback in any spans. The API suggests this is possible, with it accepting span_id as a parameter. However what I have found is that:

  • Once I send a feedback, it goes to the trace too

Currently, annotating the root span of a trace will make an annotation appear on both the root span and the trace itself. Annotations on non-root spans will appear on the span only.

Image
  • If I send a new feedback to the SAME span, it overrides the trace feedback, but leaves the span untouched (is not updating)
  • If I send feedback to other spans in the same trace, they are basically ignored. The span is not annotated, nor the trace updated.

Does the span feedback appear on the right-hand side for you when you select a particular span in the span tree or in the Spans tab?

Image

Describe the solution you'd like I would like to understand what is the expected behaviour here. I would prefer to avoid implementing a solution for this on my end, it makes more sense for this to live in phoenix, but I am either doing something wrong, or we need to improve the docs/feature.

I would like several things, but that obviously depends on the direction the project is going.

  • If I send feedback to a span, it can show up in the trace, but if several spans have feedback they all should be considered

It sounds like you would like all span annotations to show up as top-level annotations on the trace?

  • If for example I am measuring thumbs up, I would like to keep track of how many. However at the moment the entry is being overriden by the newest message (which kind of makes sense given that the endpoint creates or updates). There should be an explicit endpoint for creation only, handling updates separately, so we are not forced to a single observation per span/trace.

Our data model currently assumes that there is at most one annotation for a particular name per span. We may need to revisit that assumption. Can you help me understand your use-case that allows for multiple thumbs up for a single span? It sounds like multiple users are interacting with a single output from an LLM in your application.

  • If the intention here is that we have a separate system tracking the score, and we update the final score directly, that's ok, but we need to be able to do it directly in the trace, which isn't currently possible (getting a 422 if I don't send the span_id field). I am forced to update the annotation in a span, which doesn't actually update, but updates the trace annotation (and has to be the original span that was annotated or it will just be ignored).

The intention with score is that it allows floating-point based annotations and evaluations, e.g., if I computed a floating point number in code, I could upload. The score field in an individual annotation or evaluation is not intended to be an aggregate metric, and we definitely don't expect end users to compute and upload their own aggregate metrics. I think this probably ties back to the previous point where we may need to relax the constraint on annotations and automatically compute aggregate metrics.

  • In the case the users all delete the feedback, there is no way to remove the annotations via REST API. If I could save the information about the annotation in the metadata (who created it and when) and then I could retrieve that information via API, I could filter out and remove the specific annotations that need to go.

It sounds like having a DELETE route would solve this issue?

Describe alternatives you've considered Implementing this myself. I can just keep track of the trace_ids and keep the scores in my system instead of sending them to phoenix. Then I can attach them using the trace_id if I need it. Again, the hard blocker here is the lack of consistency in the annotations API. If I don't keep track of the exact span I used for the initial feedback, I have no way to update the annotation via the rest API.

It sounds like the root cause of this pain is that you want multiple annotations of the same name on the same span, if I understand correctly.

TLDR;

The annotations API needs some love. The current documentation shows a very edge case of the feedback feature, and it can be improved across the board. If I am doing something wrong or my expectations are just absurd, more context in the documentation can also help with this.

Thanks so much for the detailed feedback! It sounds like this has been painful, and we'd definitely like to accommodate your use-case.

@cdvv7788
Copy link
Author

Currently, annotating the root span of a trace will make an annotation appear on both the root span and the trace itself. Annotations on non-root spans will appear on the span only.

From what I have seen, annotating any trace will make the annotation appear on both the trace and the annotated span. Any further update has to be done through the same span or it will be ignored. The updates will be propagated to the trace, but the actual span will be stuck with the initial annotation.

Does the span feedback appear on the right-hand side for you when you select a particular span in the span tree or in the Spans tab?

It does. Again, only for the first span annotated. From there on, the annotations are just ignored.

It sounds like you would like all span annotations to show up as top-level annotations on the trace?

Ideally, I would like to just use phoenix as my database of annotations and it takes care of aggregating / summarizing as needed. I don't know if it is necessary for all of them to show up at the top level, but there must be something we can do for the UX.

Our data model currently assumes that there is at most one annotation for a particular name per span. We may need to revisit that assumption. Can you help me understand your use-case that allows for multiple thumbs up for a single span? It sounds like multiple users are interacting with a single output from an LLM in your application.

I am using slack threads, anyone can give feedback there. I would like to have a score in the range -1/1, and it is moved in one direction or the other depending with an average or something.

The intention with score is that it allows floating-point based annotations and evaluations, e.g., if I computed a floating point number in code, I could upload. The score field in an individual annotation or evaluation is not intended to be an aggregate metric, and we definitely don't expect end users to compute and upload their own aggregate metrics. I think this probably ties back to the previous point where we may need to relax the constraint on annotations and automatically compute aggregate metrics.

That would be great. Do you know how popular this feature is? This would change it's behaviour completely and while I think it is great, I don't want to break anyone's workflow.

It sounds like having a DELETE route would solve this issue?

Yes, but I also need to be able to save/retrieve metadata for this specific situation. If phoenix takes care of the aggregations at some point, my application will need to try and keep the feedback in sync (if feedback is added or removed, act accordingly).

Thanks!

@axiomofjoy
Copy link
Contributor

Currently, annotating the root span of a trace will make an annotation appear on both the root span and the trace itself. Annotations on non-root spans will appear on the span only.

From what I have seen, annotating any trace will make the annotation appear on both the trace and the annotated span. Any further update has to be done through the same span or it will be ignored. The updates will be propagated to the trace, but the actual span will be stuck with the initial annotation.

This is a bug, but I am not able to reproduce. Can you help me understand when are you hitting this issue, e.g., are you issuing multiple POST requests to /v1/span_annotations with the same span ID?

Does the span feedback appear on the right-hand side for you when you select a particular span in the span tree or in the Spans tab?

It does. Again, only for the first span annotated. From there on, the annotations are just ignored.

Sounds like the same issue as above.

It sounds like you would like all span annotations to show up as top-level annotations on the trace?

Ideally, I would like to just use phoenix as my database of annotations and it takes care of aggregating / summarizing as needed. I don't know if it is necessary for all of them to show up at the top level, but there must be something we can do for the UX.

Got it, thanks!

Our data model currently assumes that there is at most one annotation for a particular name per span. We may need to revisit that assumption. Can you help me understand your use-case that allows for multiple thumbs up for a single span? It sounds like multiple users are interacting with a single output from an LLM in your application.

I am using slack threads, anyone can give feedback there. I would like to have a score in the range -1/1, and it is moved in one direction or the other depending with an average or something.

Good to know, thanks! I can definitely see how you might want multiple annotations of the same name associated with the same span in this case.

The intention with score is that it allows floating-point based annotations and evaluations, e.g., if I computed a floating point number in code, I could upload. The score field in an individual annotation or evaluation is not intended to be an aggregate metric, and we definitely don't expect end users to compute and upload their own aggregate metrics. I think this probably ties back to the previous point where we may need to relax the constraint on annotations and automatically compute aggregate metrics.

That would be great. Do you know how popular this feature is? This would change it's behaviour completely and while I think it is great, I don't want to break anyone's workflow.

This is something we'll likely need to support.

It sounds like having a DELETE route would solve this issue?

Yes, but I also need to be able to save/retrieve metadata for this specific situation. If phoenix takes care of the aggregations at some point, my application will need to try and keep the feedback in sync (if feedback is added or removed, act accordingly).

By metadata, do you have in mind something like user ID? It sounds like you want to be able to create, update, and delete annotations for particular users.

Thanks!

Thanks for the feedback! Much appreciated!

@cdvv7788
Copy link
Author

cdvv7788 commented Feb 20, 2025

This is a bug, but I am not able to reproduce. Can you help me understand when are you hitting this issue, e.g., are you issuing multiple POST requests to /v1/span_annotations with the same span ID?

Yes. I have a slack conversation in a thread. Any reaction to any message (there is a different span_id per message) in that thread will trigger an annotation. For the first reaction, it works well, creates the annotation and propagates it to the trace. The second time is when problems show up:

  • If I add another reaction in the same message, it will override the trace annotation but the span won't be updated.
  • If I add a reaction in another message, phoenix will ignore the annotation completely (nothing will be added)

By metadata, do you have in mind something like user ID? It sounds like you want to be able to create, update, and delete annotations for particular users.

Something similar to how slack handles messages metadata. Arbitrary payloads under a key (https://api.slack.com/metadata/using). In this case I could pass user_id, but it can be useful for other things.

@axiomofjoy
Copy link
Contributor

Thanks @cdvv7788!

This is a bug, but I am not able to reproduce. Can you help me understand when are you hitting this issue, e.g., are you issuing multiple POST requests to /v1/span_annotations with the same span ID?

Yes. I have a slack conversation in a thread. Any reaction to any message (there is a different span_id per message) in that thread will trigger an annotation. For the first reaction, it works well, creates the annotation and propagates it to the trace. The second time is when problems show up:

  • If I add another reaction in the same message, it will override the trace annotation but the span won't be updated.
  • If I add a reaction in another message, phoenix will ignore the annotation completely (nothing will be added)

This is definitely unexpected behavior. Can you help me understand the exact requests that are being issued to Phoenix? Are you just sending POST requests to /v1/span_annotations? If the span ID remains the same between different requests, I expect the annotation to update on the span.

{
  "span_id": "67f6740bbe1ddc3f",
  "name": "correctness",
  "annotator_kind": "HUMAN",
  "result": {
    "label": "correct",
    "score": 1,
    "explanation": "The response answered the question I asked"
   }
}

Can you also send the version of Phoenix you are using?

By metadata, do you have in mind something like user ID? It sounds like you want to be able to create, update, and delete annotations for particular users.

Something similar to how slack handles messages metadata. Arbitrary payloads under a key (https://api.slack.com/metadata/using). In this case I could pass user_id, but it can be useful for other things.

Got it, thanks! We do currently add metadata to span annotations via the POST. It would probably unblock this flow if we added a GET route for annotations that included that metadata.

@cdvv7788
Copy link
Author

This is definitely unexpected behavior. Can you help me understand the exact requests that are being issued to Phoenix? Are you just sending POST requests to /v1/span_annotations? If the span ID remains the same between different requests, I expect the annotation to update on the span.

This is the payload I am sending. Label is either thumbs_up or thumbs_down. Score is either 1, -1 or 0.

{
    "span_id": span_id,
     "name": "my_agent",
      "annotator_kind": "human",
      "result": {"label": label, "score": value},
      "metadata": {
             "user_id": user_id,
       },
}

Got it, thanks! We do currently add metadata to span annotations via the POST. It would probably unblock this flow if we added a GET route for annotations that included that metadata.

Yes, I have something in the code but I am not sure where to check them.
The flow would still be blocked with single annotations in the trace tho.

Phoenix version: 7.12.1
I just saw there was a major version release recently. I will update tomorrow and report back 😃

@axiomofjoy
Copy link
Contributor

This is definitely unexpected behavior. Can you help me understand the exact requests that are being issued to Phoenix? Are you just sending POST requests to /v1/span_annotations? If the span ID remains the same between different requests, I expect the annotation to update on the span.

This is the payload I am sending. Label is either thumbs_up or thumbs_down. Score is either 1, -1 or 0.

{
    "span_id": span_id,
     "name": "my_agent",
      "annotator_kind": "human",
      "result": {"label": label, "score": value},
      "metadata": {
             "user_id": user_id,
       },
}

That looks correct to my eye. This one is tough for me to debug since I am not able to reproduce the issue. If you are willing, I'd love to hop on a call to take a look with you and see if we can get to the bottom of it. https://calendly.com/xander-arize/30min

Got it, thanks! We do currently add metadata to span annotations via the POST. It would probably unblock this flow if we added a GET route for annotations that included that metadata.

Yes, I have something in the code but I am not sure where to check them. The flow would still be blocked with single annotations in the trace tho.

Phoenix version: 7.12.1 I just saw there was a major version release recently. I will update tomorrow and report back 😃

Sounds great!

@cdvv7788
Copy link
Author

cdvv7788 commented Feb 24, 2025

@axiomofjoy I updated with no luck.

Also, I am not able to reproduce the exact same issue. While the issue persists in my main repository, I created a smaller version to try and reproduce the issue at https://github.com/cdvv7788/phoenix-debug

To make this work, just run docker compose up --build, go to http://localhost:6006, check the created traces and copy one of the inners traces' span id:

Image

Then run: docker compose run trace-generator span_feedback.py "whatever_span_id_i_got" "thumbs-up" 1

It seems to be working as expected here at the span level, but I am not able to reproduce the whole issue because it is not being propagated to the parent span. From our discussion, this is unexpected behaviour right? Can you please confirm if you are observing it too?

Image

While I have indications in the list that the spans are annotated, the parent trace doesn't give any hint.

To simplify my usecase, I think that having trace level annotations (multiple and auto aggregated somehow) should be enough to get going.

@RogerHYang RogerHYang added cannot reproduce A bug that cannot be reproduced and removed triage issues that need triage labels Mar 3, 2025
@axiomofjoy
Copy link
Contributor

Hey @cdvv7788, we're addressing some of these issues as part of our annotation improvement and configuratio milestone here and here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cannot reproduce A bug that cannot be reproduced enhancement New feature or request
Projects
Status: 📘 Todo
Development

No branches or pull requests

3 participants