Skip to content

feat: parse server response and react to error messages #281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 6, 2022

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Aug 22, 2022

Start backoff mechanism on critical error and log a warning on
client errors.

Any change to the transport status is guarded by a mutex but
some methods were using the field directly, causing race
conditions.

apmproxy tests are using a timeout of 5s to wait for the status
to change during backoff. However there is a subtle bug that
causes the test to fail because the initial delay is exactly 5s.
To avoid this, the tests wait 7s now.

Blocked by #280

Closes #225
Closes #205

@github-actions github-actions bot added the aws-λ-extension AWS Lambda Extension label Aug 22, 2022
@kruskall kruskall marked this pull request as draft August 22, 2022 19:36
@apmmachine
Copy link

apmmachine commented Aug 22, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-09-01T10:51:35.908+0000

  • Duration: 8 min 36 sec

Test stats 🧪

Test Results
Failed 0
Passed 160
Skipped 2
Total 162

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Any change to the transport status is guarded by a mutex but
some methods were using the field directly, causing race
conditions.

apmproxy tests are using a timeout of 5s to wait for the status
to change during backoff. However there is a subtle bug that
causes the test to fail because the initial delay is exactly 5s.
To avoid this, the tests wait 7s now.

Update status name and documentation.
Start backoff mechanism on critical error and log a warning on
client errors.
@kruskall kruskall force-pushed the fix/transport-state branch from d726fca to 6d5679d Compare August 23, 2022 08:07
@kruskall kruskall marked this pull request as ready for review August 23, 2022 08:18
@kruskall
Copy link
Member Author

/test

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, mostly just wondering about the decision not to backoff on auth & validation failures. We should probably have a test or two covering the new states.

Comment on lines +227 to +228
// No need to start backoff, this is a temporary status. It usually
// means we went over the limit of events/s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would ClientFailing be temporary? From what I can see above, it's either due to auth failure (probably not temporary without user intervention?) or some validation error (probably implies a bug in the agent or server?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see above, it's either due to auth failure (probably not temporary without user intervention?)

Ah, good point! I've made auth errors a critical failure

When would ClientFailing be temporary?

From what I could see from the middlewares in the APM server repository, this would happen on data decoding/validation errors, request body too large or invalid query. Those are errors tied to a specific request, I don't think we should trigger a backoff and associated delay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I could see from the middlewares in the APM server repository, this would happen on data decoding/validation errors, request body too large or invalid query. Those are errors tied to a specific request, I don't think we should trigger a backoff and associated delay.

good point, I think this is fine as is now.

@kruskall kruskall requested a review from axw August 27, 2022 20:43
Comment on lines +227 to +228
// No need to start backoff, this is a temporary status. It usually
// means we went over the limit of events/s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I could see from the middlewares in the APM server repository, this would happen on data decoding/validation errors, request body too large or invalid query. Those are errors tied to a specific request, I don't think we should trigger a backoff and associated delay.

good point, I think this is fine as is now.

@kruskall kruskall requested a review from simitt August 30, 2022 20:05
@kruskall kruskall requested a review from simitt August 31, 2022 14:44
Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the changes.

@simitt
Copy link
Contributor

simitt commented Sep 6, 2022

@axw any other concerns from your side? I believe this is ready to merge otherwise.

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the silence, I lost sight of this one. LGTM.

@kruskall kruskall merged commit d37b671 into elastic:main Sep 6, 2022
@kruskall kruskall deleted the fix/transport-state branch September 6, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-λ-extension AWS Lambda Extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse APM Server response and react to error codes Log a warning when authentication with APM server fails
4 participants