[Parallel Router] Random strong test failures #3029

AmirhosseinPoolad · 2025-05-09T14:29:11Z

Strong tests for parallel routing can randomly fail. Here's the failed CI run for a PR that did not touch VTR's code in any way and only changed an unrelated workflow file:
https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/14929980706/job/41943464845#step:8:3197

There was a run failure and some QoR failures in some of the parallel router tests. When I ran the strong tests for the branch locally it successfully worked without any run or QoR failures, and VTR master also doesn't have any strong test failures.

AmirhosseinPoolad · 2025-05-09T14:29:42Z

@AlexandreSinger @ueqri Any ideas why this might have happened? Thank a lot!

AmirhosseinPoolad · 2025-05-09T14:54:36Z

Update: Just re-ran the test for the PR and it was successful.
https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/14929980706/job/41947736930

Update 2: Same tests failed in master branch of my fork (Same as VTR master).
https://github.com/AmirhosseinPoolad/vtr-verilog-to-routing/actions/runs/14933274445/job/41954656119

ueqri · 2025-05-12T15:52:44Z

@AmirhosseinPoolad Hi Amir, I am investigating the test failures you mentioned. I have checked the stdout log files of the failed tests in the two action runs:

Log file of the test that failed at strong_multiclock when using two threads + four queues in the action run
Log file of the test that failed at strong_multiclock when using two threads + eight queues + queue draining (deterministic for Dijkstra mode) in the action run

(Updated) It turned out to be segmentation fault issues (program crashed with signal 11) in CI logs. The log files seems to be incomplete (only partial routing logs were printed and no post-routing log is there), which might be because the corresponding VPR executions (specifically the routing stage) were interrupted. I was wondering if the GitHub runner could have killed the program due to some restrictions (?) on parallel execution.

Additionally, I ran these specific tests locally on wintermute 70 times with no errors or failures occurring. I highly doubt that the issue might be related to the GitHub runner environment.

If there are any GitHub runner docs I should refer to (for the limitation/restrictions) or more context other than the test log files, please let me know and I will try to fix those tests based on that info. Also, could you please pin me the next time this happens? Hopefully that would provide more helpful context for debugging.

vaughnbetz · 2025-05-13T22:31:36Z

From looking at the log, there's nothing very strange about the command line or this circuit.
One possible clue: it appears the seg fault occurs either in a routing attempt in a binary search routing or right after an attempt where we have a disconnected rr-graph -- no path exists between some source and dest. That triggers an immediate routing failure (in the first routing iteration) and we'll return through a slightly different code path before trying the next routing. The failure is an empty heap -- maybe something bad happens right now with an empty heap and the multi-queue? Some race condition in that case?

@ueqri : if this issue can't be resolved quickly, we should temporarily comment out this test in CI, and can reactivate it once it is stable.

AlexandreSinger · 2025-05-15T21:00:07Z

@ueqri What is the status on this? Should we temporarily comment this test out? It is causing random failures.

…outer Temporarily disabled the `strong_multiclock` test in `vtr_reg_strong` CI regression tests for the parallel connection router, due to some random failures as mentioned in Issue verilog-to-routing#3029. After fixing the problem with the `strong_multiclock` test, this will be reactivated.

ueqri · 2025-05-15T21:43:04Z

Still investigating the issue. I tried ThreadSanitizer but nothing particularly interesting was found. Only a few data races were detected (e.g., prune_node, should_not_explore_neighbors), which were written on purpose (as mentioned in the quote) and ensured safety. The sanitizer log can be found here if you are interested.

Quote from the paper: To further reduce lock contention, we added a cheap read-only check before acquiring the lock (Line 10), motivated by Shun et al.

I am currently following the clue provided by Vaughn to see if anything interesting can be found. We can comment out this test for now in the meanwhile (PR #3047 created).

Also, since the CI reproduces these random failures so frequently (I cannot reproduce it even once locally after running ~100 times), it might be worth building an identical CI environment myself and run tests inside that environment to catch the seg faults with gdb hopefully.

amin1377 assigned ueqri May 9, 2025

soheilshahrouz mentioned this issue May 15, 2025

Clean up confusing rr-node lookup convention for x/y location of CHANX nodes #3042

Merged

ueqri mentioned this issue May 15, 2025

[RegTest] Temporarily disable strong_multiclock test for parallel connection router #3047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parallel Router] Random strong test failures #3029

[Parallel Router] Random strong test failures #3029

AmirhosseinPoolad commented May 9, 2025 •

edited

Loading

AmirhosseinPoolad commented May 9, 2025

AmirhosseinPoolad commented May 9, 2025 •

edited

Loading

ueqri commented May 12, 2025 •

edited

Loading

vaughnbetz commented May 13, 2025

AlexandreSinger commented May 15, 2025

ueqri commented May 15, 2025 •

edited

Loading

[Parallel Router] Random strong test failures #3029

[Parallel Router] Random strong test failures #3029

Comments

AmirhosseinPoolad commented May 9, 2025 • edited Loading

AmirhosseinPoolad commented May 9, 2025

AmirhosseinPoolad commented May 9, 2025 • edited Loading

ueqri commented May 12, 2025 • edited Loading

vaughnbetz commented May 13, 2025

AlexandreSinger commented May 15, 2025

ueqri commented May 15, 2025 • edited Loading

AmirhosseinPoolad commented May 9, 2025 •

edited

Loading

AmirhosseinPoolad commented May 9, 2025 •

edited

Loading

ueqri commented May 12, 2025 •

edited

Loading

ueqri commented May 15, 2025 •

edited

Loading