Skip to content

[Parallel Router] Random strong test failures #3029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AmirhosseinPoolad opened this issue May 9, 2025 · 6 comments
Open

[Parallel Router] Random strong test failures #3029

AmirhosseinPoolad opened this issue May 9, 2025 · 6 comments
Assignees

Comments

@AmirhosseinPoolad
Copy link
Contributor

AmirhosseinPoolad commented May 9, 2025

Strong tests for parallel routing can randomly fail. Here's the failed CI run for a PR that did not touch VTR's code in any way and only changed an unrelated workflow file:
https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/14929980706/job/41943464845#step:8:3197

There was a run failure and some QoR failures in some of the parallel router tests. When I ran the strong tests for the branch locally it successfully worked without any run or QoR failures, and VTR master also doesn't have any strong test failures.

@AmirhosseinPoolad
Copy link
Contributor Author

@AlexandreSinger @ueqri Any ideas why this might have happened? Thank a lot!

@AmirhosseinPoolad
Copy link
Contributor Author

AmirhosseinPoolad commented May 9, 2025

Update: Just re-ran the test for the PR and it was successful.
https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/14929980706/job/41947736930

Update 2: Same tests failed in master branch of my fork (Same as VTR master).
https://github.com/AmirhosseinPoolad/vtr-verilog-to-routing/actions/runs/14933274445/job/41954656119

@ueqri
Copy link
Contributor

ueqri commented May 12, 2025

@AmirhosseinPoolad Hi Amir, I am investigating the test failures you mentioned. I have checked the stdout log files of the failed tests in the two action runs:

  • Log file of the test that failed at strong_multiclock when using two threads + four queues in the action run
  • Log file of the test that failed at strong_multiclock when using two threads + eight queues + queue draining (deterministic for Dijkstra mode) in the action run

(Updated) It turned out to be segmentation fault issues (program crashed with signal 11) in CI logs. The log files seems to be incomplete (only partial routing logs were printed and no post-routing log is there), which might be because the corresponding VPR executions (specifically the routing stage) were interrupted. I was wondering if the GitHub runner could have killed the program due to some restrictions (?) on parallel execution.

Additionally, I ran these specific tests locally on wintermute 70 times with no errors or failures occurring. I highly doubt that the issue might be related to the GitHub runner environment.

If there are any GitHub runner docs I should refer to (for the limitation/restrictions) or more context other than the test log files, please let me know and I will try to fix those tests based on that info. Also, could you please pin me the next time this happens? Hopefully that would provide more helpful context for debugging.

@vaughnbetz
Copy link
Contributor

From looking at the log, there's nothing very strange about the command line or this circuit.
One possible clue: it appears the seg fault occurs either in a routing attempt in a binary search routing or right after an attempt where we have a disconnected rr-graph -- no path exists between some source and dest. That triggers an immediate routing failure (in the first routing iteration) and we'll return through a slightly different code path before trying the next routing. The failure is an empty heap -- maybe something bad happens right now with an empty heap and the multi-queue? Some race condition in that case?

@ueqri : if this issue can't be resolved quickly, we should temporarily comment out this test in CI, and can reactivate it once it is stable.

@AlexandreSinger
Copy link
Contributor

@ueqri What is the status on this? Should we temporarily comment this test out? It is causing random failures.

ueqri added a commit to ueqri/vtr-verilog-to-routing that referenced this issue May 15, 2025
…outer

Temporarily disabled the `strong_multiclock` test in `vtr_reg_strong` CI
regression tests for the parallel connection router, due to some random
failures as mentioned in Issue verilog-to-routing#3029.

After fixing the problem with the `strong_multiclock` test, this will be
reactivated.
@ueqri
Copy link
Contributor

ueqri commented May 15, 2025

Still investigating the issue. I tried ThreadSanitizer but nothing particularly interesting was found. Only a few data races were detected (e.g., prune_node, should_not_explore_neighbors), which were written on purpose (as mentioned in the quote) and ensured safety. The sanitizer log can be found here if you are interested.

Quote from the paper: To further reduce lock contention, we added a cheap read-only check before acquiring the lock (Line 10), motivated by Shun et al.

I am currently following the clue provided by Vaughn to see if anything interesting can be found. We can comment out this test for now in the meanwhile (PR #3047 created).

Also, since the CI reproduces these random failures so frequently (I cannot reproduce it even once locally after running ~100 times), it might be worth building an identical CI environment myself and run tests inside that environment to catch the seg faults with gdb hopefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants