Skip to content

TestAffineQuantizedTensorParallel fails on H100 #1000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jainapurva opened this issue Oct 3, 2024 · 1 comment
Closed

TestAffineQuantizedTensorParallel fails on H100 #1000

jainapurva opened this issue Oct 3, 2024 · 1 comment

Comments

@jainapurva
Copy link
Contributor

The TestAffineQuantizedTensorParallel fails on H100 for bfloat16, float16 and float32 dtypes. Need to debug the reason and fix for it.

Error:

ERROR: test_tp_float32 (torchao.testing.utils.TorchAOTensorParallelTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/appy/.conda/envs/dev_ao/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 541, in wrapper
    self._join_processes(fn)
  File "/home/appy/.conda/envs/dev_ao/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 767, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/home/appy/.conda/envs/dev_ao/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 821, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 300.0712020397186 seconds

----------------------------------------------------------------------
Ran 9 tests in 2701.009s

FAILED (errors=9)
/home/appy/.conda/envs/dev_ao/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 36 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d 
@jainapurva
Copy link
Contributor Author

The issue was caused due to pytorch version.

yanbing-j pushed a commit to yanbing-j/ao that referenced this issue Dec 9, 2024
* Add warning comments referring to unimplemented functionality

* JSON formatted response using OpenAI API types for server completion requests

* Add models endpoint (pytorch#1000)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant