License cache not always reused #4273

MarkusObendrauf · 2025-04-22T13:37:16Z

Description

The licensedcode cache is not always utilized when running multiple processes in parallel. This was noticed while running stress-tests on scancode. We observed that, when multiple tests were started in separate processes at the same time, each process would separately build its own cache instead of using the existing one. This had a considerable performance cost, eventually leading to a LockTimeout.

The root cause is in licensedcode/cache.py: After a process obtains a lock, it should check if another thread has already built the cache, but it does not.

How To Reproduce

This was noticed when stress-testing a local test:

        with NamedTemporaryFile() as test_file:
            test_contents = bytes(MIT_LICENSE_TEXT.encode("utf-8"))
            test_file.write(test_contents)
            test_file.seek(0)
            results = get_licenses(test_file.name)  # slow
            license_expression = results["detected_license_expression"]
            self.assertEqual(license_expression, "mit")

We ran this on 100 processes in parallel.

Traceback (most recent call last):
  File "scancode/api.py", line 200, in get_licenses
    for detection in detections:
  File "licensedcode/detection.py", line 1947, in detect_licenses
    index = cache.get_index()
  File "licensedcode/cache.py", line 459, in get_index
    return get_cache(
  File "licensedcode/cache.py", line 399, in get_cache
    return populate_cache(
  File "licensedcode/cache.py", line 419, in populate_cache
    _LICENSE_CACHE = LicenseCache.load_or_build(
  File "licensedcode/cache.py", line 136, in load_or_build
    with lockfile.FileLock(lock_file).locked(timeout=timeout):
  File "runtime/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "scancode/lockfile.py", line 29, in locked
    raise LockTimeout(timeout)
scancode.lockfile.LockTimeout: 360

System configuration

OS: Linux
What version of scancode-toolkit was used to generate the scan file? scancode-toolkit-mini 32.3.2
What installation method was used to install/run scancode? pip

The text was updated successfully, but these errors were encountered:

pombredanne · 2025-04-22T13:54:09Z

The root cause is in licensedcode/cache.py: After a process obtains a lock, it should check if another thread has already built the cache, but it does not.

We used to have a procedure to check and wait for concurrent re/building of the cache, but this code has long been removed as this was complex, brittle, slow, and error prone: not a happy combo! ScanCode release builds always come with a pre-built, built-in cache now, so this is not an issue anymore. If you are ever building an index yourself, you should do it in a single process with exclusive access to the index IMHO.

Fix: aboutcode-org#4273 Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>

MarkusObendrauf · 2025-04-22T14:03:17Z

We used to have a procedure to check and wait for concurrent re/building of the cache, but this code has long been removed as this was complex, brittle, slow, and error prone: not a happy combo! ScanCode release builds always come with a pre-built, built-in cache now, so this is not an issue anymore. If you are ever building an index yourself, you should do it in a single process with exclusive access to the index IMHO.

Oh, thank you for the context! Is there an issue with the proposed fix in the PR? We don't have much control over how our stress tests are run, so we've needed to make these changes on our branch to fix failing tests.

armijnhemel · 2025-04-22T14:05:54Z

The root cause is in licensedcode/cache.py: After a process obtains a lock, it should check if another thread has already built the cache, but it does not.

We used to have a procedure to check and wait for concurrent re/building of the cache, but this code has long been removed as this was complex, brittle, slow, and error prone: not a happy combo! ScanCode release builds always come with a pre-built, built-in cache now, so this is not an issue anymore. If you are ever building an index yourself, you should do it in a single process with exclusive access to the index IMHO.

If the cache shouldn't be generated when using many threads, then why is it possible to do (as Markus demonstrated)? It would be better to fail very early in the process, unless scancode is run in "cacheless mode" (if there is such a thing).

Fix: aboutcode-org#4273 Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>

MarkusObendrauf added the bug label Apr 22, 2025

MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 22, 2025

Check if license cache exists after obtaining lock

61eb276

Fix: aboutcode-org#4273 Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>

MarkusObendrauf linked a pull request Apr 22, 2025 that will close this issue

Check if license cache exists after obtaining lock #4274

Open

6 tasks

MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 24, 2025

Check if license cache exists after obtaining lock

5a53575

Fix: aboutcode-org#4273 Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>

MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 24, 2025

Check if license cache exists after obtaining lock

35bdd47

Fix: aboutcode-org#4273 Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>

MarkusObendrauf pushed a commit to MarkusObendrauf/scancode-toolkit that referenced this issue Apr 24, 2025

Check if license cache exists after obtaining lock

42e065b

Fix: aboutcode-org#4273 Signed-off-by: Markus Obendrauf <markus.obendrauf@tngtech.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License cache not always reused #4273

License cache not always reused #4273

MarkusObendrauf commented Apr 22, 2025

pombredanne commented Apr 22, 2025

MarkusObendrauf commented Apr 22, 2025

armijnhemel commented Apr 22, 2025

License cache not always reused #4273

License cache not always reused #4273

Comments

MarkusObendrauf commented Apr 22, 2025

Description

How To Reproduce

System configuration

pombredanne commented Apr 22, 2025

MarkusObendrauf commented Apr 22, 2025

armijnhemel commented Apr 22, 2025