Skip to content

feat(datadog): Improve Datadog plugin tag support #11943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

deiwin
Copy link

@deiwin deiwin commented Jan 31, 2025

Description

Added options to include path and method tags in the Datadog plugin and added support for constant_tags in route-level plugin configuration. More detailed reasoning in each commit's message.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@deiwin deiwin changed the title WIP: Include path in tags with Datadog plugin feat: Include path in tags with Datadog plugin Jan 31, 2025
@deiwin deiwin force-pushed the add_path_to_datadog_plugin branch from 826b279 to 7767c7d Compare January 31, 2025 15:23
@Baoyuantop
Copy link
Contributor

Hi @deiwin, I see PR is still in draft status, is there still time to advance?

@deiwin
Copy link
Author

deiwin commented Apr 1, 2025

Hi! I haven't been able to work on this recently, but would love to see this completed.

Based on how I read the code, the code changes should work as-is, but the docs & tests still need work.

Unfortunately I wasn't able to get the tests to run on mac (following this guide), so I put this on hold for now.

@Baoyuantop
Copy link
Contributor

Unfortunately I wasn't able to get the tests to run on mac

Can you share the specific problem?

@deiwin
Copy link
Author

deiwin commented Apr 23, 2025

Following the linked guide, everything seems to work up to and including the make run but when I run any tests (e.g. with the suggested docker exec -it apisix-dev-env prove t/admin/routes.t command), then I get many errors & failures for all tests that I've tried. E.g.:

❯ docker exec -it apisix-dev-env prove t/admin/routes.t
t/admin/routes.t .. 1/? t/admin/routes.t TEST 2: get route(id: 1) - timeout when waiting for the process 45487 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 2: get route(id: 1) - WARNING: killing the child process 45487 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 4/? t/admin/routes.t TEST 3: delete route(id: 1) - timeout when waiting for the process 45496 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 3: delete route(id: 1) - WARNING: killing the child process 45496 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 7/?
#   Failed test 'ERROR: client socket timed out - t/admin/routes.t TEST 3: delete route(id: 1)
# '
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 2206.

#   Failed test 't/admin/routes.t TEST 3: delete route(id: 1) - status code ok'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 949.
#          got: ''
#     expected: '200'

#   Failed test 't/admin/routes.t TEST 3: delete route(id: 1) - response_body - response is expected (repeated req 0, req 0)'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 1660.
#          got: ''
#     expected: '[delete] code: 200 message: passed
# '
t/admin/routes.t TEST 4: delete route(id: not_found) - timeout when waiting for the process 45507 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 4: delete route(id: not_found) - WARNING: killing the child process 45507 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 11/? t/admin/routes.t TEST 5: post route + delete - timeout when waiting for the process 45516 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 5: post route + delete - WARNING: killing the child process 45516 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 14/? t/admin/routes.t TEST 6: uri + upstream - timeout when waiting for the process 45525 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 6: uri + upstream - WARNING: killing the child process 45525 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 17/?
#   Failed test 'ERROR: client socket timed out - t/admin/routes.t TEST 6: uri + upstream
# '
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 2206.

#   Failed test 't/admin/routes.t TEST 6: uri + upstream - status code ok'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 949.
#          got: ''
#     expected: '200'

#   Failed test 't/admin/routes.t TEST 6: uri + upstream - response_body - response is expected (repeated req 0, req 0)'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 1660.
#          got: ''
#     expected: '[push] code: 200 message: passed
# '
t/admin/routes.t TEST 7: uri + plugins - timeout when waiting for the process 45534 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 7: uri + plugins - WARNING: killing the child process 45534 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 21/? t/admin/routes.t TEST 8: invalid route: duplicate method - timeout when waiting for the process 45543 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 8: invalid route: duplicate method - WARNING: killing the child process 45543 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 24/? t/admin/routes.t TEST 9: invalid method - timeout when waiting for the process 45552 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 9: invalid method - WARNING: killing the child process 45552 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 27/?
#   Failed test 'ERROR: client socket timed out - t/admin/routes.t TEST 9: invalid method
# '
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 2206.

#   Failed test 't/admin/routes.t TEST 9: invalid method - status code ok'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 936.
#          got: ''
#     expected: '400'

#   Failed test 't/admin/routes.t TEST 9: invalid method - response_body - response is expected (repeated req 0, req 0)'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 1660.
#          got: ''
#     expected: '{"error_msg":"invalid configuration: property \"methods\" validation failed: failed to validate item 1: matches none of the enum values"}
# '
t/admin/routes.t .. 30/? t/admin/routes.t TEST 10: invalid service id - timeout when waiting for the process 45561 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 10: invalid service id - WARNING: killing the child process 45561 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 31/? t/admin/routes.t TEST 11: service id: not exist - timeout when waiting for the process 45570 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 11: service id: not exist - WARNING: killing the child process 45570 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 34/? t/admin/routes.t TEST 12: invalid id - timeout when waiting for the process 45579 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 12: invalid id - WARNING: killing the child process 45579 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 37/?
#   Failed test 'ERROR: client socket timed out - t/admin/routes.t TEST 12: invalid id
# '
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 2206.

#   Failed test 't/admin/routes.t TEST 12: invalid id - status code ok'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 936.
#          got: ''
#     expected: '400'

#   Failed test 't/admin/routes.t TEST 12: invalid id - response_body - response is expected (repeated req 0, req 0)'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 1660.
#          got: ''
#     expected: '{"error_msg":"wrong route id"}
# '
t/admin/routes.t TEST 13: id in the rule - timeout when waiting for the process 45588 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 13: id in the rule - WARNING: killing the child process 45588 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 41/? t/admin/routes.t TEST 14: integer id less than 1 - timeout when waiting for the process 45597 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 14: integer id less than 1 - WARNING: killing the child process 45597 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 44/? t/admin/routes.t TEST 15: invalid upstream_id - timeout when waiting for the process 45606 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 15: invalid upstream_id - WARNING: killing the child process 45606 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 47/?
#   Failed test 'ERROR: client socket timed out - t/admin/routes.t TEST 15: invalid upstream_id
# '
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 2206.

#   Failed test 't/admin/routes.t TEST 15: invalid upstream_id - status code ok'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 936.
#          got: ''
#     expected: '400'

#   Failed test 't/admin/routes.t TEST 15: invalid upstream_id - response_body - response is expected (repeated req 0, req 0)'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 1660.
#          got: ''
#     expected: '{"error_msg":"invalid configuration: property \"upstream_id\" validation failed: object matches none of the required"}
# '
t/admin/routes.t TEST 16: not exist upstream_id - timeout when waiting for the process 45615 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 16: not exist upstream_id - WARNING: killing the child process 45615 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 51/? t/admin/routes.t TEST 17: wrong route id, do not need it - timeout when waiting for the process 45624 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 17: wrong route id, do not need it - WARNING: killing the child process 45624 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 54/? t/admin/routes.t TEST 18: wrong route id, do not need it - timeout when waiting for the process 45633 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 18: wrong route id, do not need it - WARNING: killing the child process 45633 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 57/?
#   Failed test 'ERROR: client socket timed out - t/admin/routes.t TEST 18: wrong route id, do not need it
# '
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 2206.

#   Failed test 't/admin/routes.t TEST 18: wrong route id, do not need it - status code ok'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 936.
#          got: ''
#     expected: '400'

#   Failed test 't/admin/routes.t TEST 18: wrong route id, do not need it - response_body - response is expected (repeated req 0, req 0)'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 1660.
#          got: ''
#     expected: '{"error_msg":"wrong route id, do not need it"}
# '
t/admin/routes.t TEST 19: limit-count with `disable` option - timeout when waiting for the process 45642 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 19: limit-count with `disable` option - WARNING: killing the child process 45642 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 61/? t/admin/routes.t TEST 20: host: *.foo.com - timeout when waiting for the process 45651 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 20: host: *.foo.com - WARNING: killing the child process 45651 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 64/? t/admin/routes.t TEST 21: invalid host: a.*.foo.com - timeout when waiting for the process 45660 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 21: invalid host: a.*.foo.com - WARNING: killing the child process 45660 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 67/?
#   Failed test 'ERROR: client socket timed out - t/admin/routes.t TEST 21: invalid host: a.*.foo.com
# '
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 2206.

#   Failed test 't/admin/routes.t TEST 21: invalid host: a.*.foo.com - status code ok'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 936.
#          got: ''
#     expected: '400'

#   Failed test 't/admin/routes.t TEST 21: invalid host: a.*.foo.com - response_body_like - response is expected ()'
#   at /usr/local/share/perl/5.30.0/Test/Nginx/Socket.pm line 1706.
#                   ''
#     doesn't match '(?^s:{"error_msg":"invalid configuration: property \\"host\\" validation failed: failed to match pattern .*
# )'
t/admin/routes.t TEST 22: invalid host: *.a.*.foo.com - timeout when waiting for the process 45669 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 22: invalid host: *.a.*.foo.com - WARNING: killing the child process 45669 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 71/? t/admin/routes.t TEST 23: removing the init_dir key from etcd can still list all routes - timeout when waiting for the process 45678 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
t/admin/routes.t TEST 23: removing the init_dir key from etcd can still list all routes - WARNING: killing the child process 45678 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. 74/? END - timeout when waiting for the process 45687 to exit at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 668.
END - WARNING: killing the child process 45687 with force... at /usr/local/share/perl/5.30.0/Test/Nginx/Util.pm line 707.
t/admin/routes.t .. Failed 21/76 subtests

Test Summary Report
-------------------
t/admin/routes.t (Wstat: 0 Tests: 76 Failed: 21)
  Failed tests:  7-9, 17-19, 27-29, 37-39, 47-49, 57-59
                67-69
  Parse errors: No plan found in TAP output
Files=1, Tests=76, 210 wallclock secs ( 0.03 usr  0.01 sys +  1.24 cusr  0.98 csys =  2.26 CPU)
Result: FAIL

@deiwin
Copy link
Author

deiwin commented Apr 23, 2025

I actually opened this PR hoping that I could use the CI system to validate my changes, but unfortunately it looks like not all PRs are automatically validated by CI.

@Baoyuantop
Copy link
Contributor

Hi @deiwin, the running Docker container may have limited resources, causing the process to execute slowly and fail to complete within the time expected by the test framework. You can try to modify the timeout parameters of the test framework to give the process more time to complete the operation. You can check the configuration of the Test::Nginx::Socket module to see if the timeout parameter can be adjusted.

@deiwin
Copy link
Author

deiwin commented May 7, 2025

Any specific conf changes (file & parameter) you could recommend? My Docker should have 8 (M1) cores & 15.5GB of memory available, so it shouldn't be too constrained.

@Baoyuantop
Copy link
Contributor

Any specific conf changes (file & parameter) you could recommend? My Docker should have 8 (M1) cores & 15.5GB of memory available, so it shouldn't be too constrained.

For example timeout in t/APISIX.pm file

deiwin added 2 commits May 7, 2025 18:08
Currently there is no way to distinguish Datadog metrics for different
HTTP endpoints if these endpoints are served through a single Apisix
route.

With these changes, if `include_path` is set to true, the path pattern
by which the HTTP request was matched to a route is included as a metric
tag with the `path:` key. This allows different endpoints to be
distinguished in metrics.
Currently there is no way to distinguish Datadog metrics for e.g. GET &
POST requests for the same endpoint within a single Apisix route,
although their performance characteristics are likely to be quite
different.

With these changes, if `include_method` is set to true, HTTP method is
included as a metric tag with the `method:` key, enabling such requests
to be differentiated.
@deiwin deiwin force-pushed the add_path_to_datadog_plugin branch from 7767c7d to 216c7b6 Compare May 7, 2025 15:08
@deiwin deiwin changed the title feat: Include path in tags with Datadog plugin feat(datadog): Improve Datadog plugin tag support May 7, 2025
@deiwin
Copy link
Author

deiwin commented May 7, 2025

Looks like that worked and I am now able to reliably run at least the relevant Datadog plugin tests! 🎉 Thank you, @Baoyuantop! 🙇

I've finished the changes now and updated the PR with them. I'll now open it for review.

@deiwin deiwin marked this pull request as ready for review May 7, 2025 15:18
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request plugin labels May 7, 2025
`constant_tags` are already supported at the plugin configuration level.
However, sometimes different values may be required for each route, but
this was previously not possible.

For example, if some routes are owned by team A and some routes by team
B, they could add `constant_tags: ["owner:team_a"]` or
`constant_tags: ["owner:team_b"]` to each route, and would then be able
to group metrics by team on Datadog.
@deiwin deiwin force-pushed the add_path_to_datadog_plugin branch from 216c7b6 to da581f7 Compare May 8, 2025 05:57
@deiwin
Copy link
Author

deiwin commented May 8, 2025

I fixed the lint issue (I hope) but the other CI errors seemed unrelated to my changes.

@Baoyuantop Baoyuantop self-requested a review May 8, 2025 10:08
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 9, 2025
@deiwin deiwin force-pushed the add_path_to_datadog_plugin branch from 9b34b89 to ab93afb Compare May 9, 2025 13:55
@deiwin
Copy link
Author

deiwin commented May 9, 2025

I realized there's one more change required for what I'm trying to achieve, so I added another commit to also add a response_status_class tag. Additional details in the commit message again.

Having a separate tag for the HTTP response code class allows useful
grouping in Datadog by e.g. successful requests (2xx), client errors
(4xx), and server errors (5xx). The existing `response_status` already
essentially includes that information, but because of how Datadog works,
one would need to list out each possible value between 400 and 431 to
group all client errors, for example. Having a separate tag for the
class greatly simplifies this.

Generally it may be problematic to add new tags by default without an
opt-in configuration option, because Datadog charges based on [custom
metrics count][1] and additional tags can increase that count. However,
the additional tag is safe to add here, because it is guaranteed not to
increase the custom metric count. That is because the new tag is based
on the already included `response_status` value, which is more granular
than the new `response_status_class` value, so the number of unique tag
combinations will not change.

[1]: https://docs.datadoghq.com/metrics/custom_metrics/
@deiwin deiwin force-pushed the add_path_to_datadog_plugin branch from ab93afb to 5120f05 Compare May 12, 2025 11:07
@deiwin
Copy link
Author

deiwin commented May 12, 2025

I fixed the lint issue with the too long line: https://github.com/apache/apisix/compare/ab93afb5125b041bbc90ffab1cd2a0ec1b8834af..5120f05bed6eef9248d0a8a8f0dccc59cafdedcd

I also found a way to run the lint (or at least luacheck) locally in the container (luarocks install luacheck + make lint), so won't have to wait for the CI run to see these issues in the future.

The other CI failures seem unrelated to my changes again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request plugin size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants