Skip to content

License cache not always reused #4273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MarkusObendrauf opened this issue Apr 22, 2025 · 3 comments · May be fixed by #4274
Open

License cache not always reused #4273

MarkusObendrauf opened this issue Apr 22, 2025 · 3 comments · May be fixed by #4274
Labels

Comments

@MarkusObendrauf
Copy link

Description

The licensedcode cache is not always utilized when running multiple processes in parallel. This was noticed while running stress-tests on scancode. We observed that, when multiple tests were started in separate processes at the same time, each process would separately build its own cache instead of using the existing one. This had a considerable performance cost, eventually leading to a LockTimeout.

The root cause is in licensedcode/cache.py: After a process obtains a lock, it should check if another thread has already built the cache, but it does not.

How To Reproduce

This was noticed when stress-testing a local test:

        with NamedTemporaryFile() as test_file:
            test_contents = bytes(MIT_LICENSE_TEXT.encode("utf-8"))
            test_file.write(test_contents)
            test_file.seek(0)
            results = get_licenses(test_file.name)  # slow
            license_expression = results["detected_license_expression"]
            self.assertEqual(license_expression, "mit") 

We ran this on 100 processes in parallel.

Traceback (most recent call last):
  File "scancode/api.py", line 200, in get_licenses
    for detection in detections:
  File "licensedcode/detection.py", line 1947, in detect_licenses
    index = cache.get_index()
  File "licensedcode/cache.py", line 459, in get_index
    return get_cache(
  File "licensedcode/cache.py", line 399, in get_cache
    return populate_cache(
  File "licensedcode/cache.py", line 419, in populate_cache
    _LICENSE_CACHE = LicenseCache.load_or_build(
  File "licensedcode/cache.py", line 136, in load_or_build
    with lockfile.FileLock(lock_file).locked(timeout=timeout):
  File "runtime/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "scancode/lockfile.py", line 29, in locked
    raise LockTimeout(timeout)
scancode.lockfile.LockTimeout: 360

System configuration

  • OS: Linux
  • What version of scancode-toolkit was used to generate the scan file? scancode-toolkit-mini 32.3.2
  • What installation method was used to install/run scancode? pip
@pombredanne
Copy link
Member

The root cause is in licensedcode/cache.py: After a process obtains a lock, it should check if another thread has already built the cache, but it does not.

We used to have a procedure to check and wait for concurrent re/building of the cache, but this code has long been removed as this was complex, brittle, slow, and error prone: not a happy combo! ScanCode release builds always come with a pre-built, built-in cache now, so this is not an issue anymore. If you are ever building an index yourself, you should do it in a single process with exclusive access to the index IMHO.

MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 22, 2025
Fix: aboutcode-org#4273

Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>
@MarkusObendrauf MarkusObendrauf linked a pull request Apr 22, 2025 that will close this issue
6 tasks
@MarkusObendrauf
Copy link
Author

We used to have a procedure to check and wait for concurrent re/building of the cache, but this code has long been removed as this was complex, brittle, slow, and error prone: not a happy combo! ScanCode release builds always come with a pre-built, built-in cache now, so this is not an issue anymore. If you are ever building an index yourself, you should do it in a single process with exclusive access to the index IMHO.

Oh, thank you for the context! Is there an issue with the proposed fix in the PR? We don't have much control over how our stress tests are run, so we've needed to make these changes on our branch to fix failing tests.

@armijnhemel
Copy link
Collaborator

The root cause is in licensedcode/cache.py: After a process obtains a lock, it should check if another thread has already built the cache, but it does not.

We used to have a procedure to check and wait for concurrent re/building of the cache, but this code has long been removed as this was complex, brittle, slow, and error prone: not a happy combo! ScanCode release builds always come with a pre-built, built-in cache now, so this is not an issue anymore. If you are ever building an index yourself, you should do it in a single process with exclusive access to the index IMHO.

If the cache shouldn't be generated when using many threads, then why is it possible to do (as Markus demonstrated)? It would be better to fail very early in the process, unless scancode is run in "cacheless mode" (if there is such a thing).

MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 24, 2025
Fix: aboutcode-org#4273

Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>
MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 24, 2025
Fix: aboutcode-org#4273

Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>
MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 24, 2025
Fix: aboutcode-org#4273

Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants