Skip to content

Conversation

@hugovk
Copy link
Member

@hugovk hugovk commented Jan 10, 2026

We can apply @henryiii's improvement to packaging in pypa/packaging#1030 (see also https://iscinumpy.dev/post/packaging-faster/) to improve the performance of canonicalize_name and make it ~3.7 times faster.

Benchmark

Run Prepared.normalize(n) on every name in PyPI:

# benchmark_names_stdlib.py
import sqlite3
import timeit
from importlib.metadata import Prepared

# Get data with:
# curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite
# Or ues pre-cached files from:
# https://gist.github.com/hugovk/efdbee0620cc64df7b405b52cf0b6e42

CACHE_FILE = "/tmp/bench/names.txt"
DB_FILE = "/tmp/bench/pypi-data.sqlite"

try:
    with open(CACHE_FILE) as f:
        TEST_ALL_NAMES = [line.rstrip("\n") for line in f]
except FileNotFoundError:
    TEST_ALL_NAMES = []
    with sqlite3.connect(DB_FILE) as conn:
        with open(CACHE_FILE, "w") as cache:
            for (name,) in conn.execute("SELECT name FROM projects"):
                if name:
                    TEST_ALL_NAMES.append(name)
                    cache.write(name + "\n")


def bench():
    for n in TEST_ALL_NAMES:
        Prepared.normalize(n)


if __name__ == "__main__":
    print(f"Loaded {len(TEST_ALL_NAMES):,} names")
    t = timeit.timeit("bench()", globals=globals(), number=1)
    print(f"Time: {t:.4f} seconds")

Benchmark data can be found at https://gist.github.com/hugovk/efdbee0620cc64df7b405b52cf0b6e42

Before

With optimisations:

./python.exe benchmark_names_stdlib.py
Loaded 8,344,947 names
Time: 5.1483 seconds

After

./python.exe benchmark_names_stdlib.py
Loaded 8,344,947 names
Time: 1.3754 seconds

3.7 times faster.

@hugovk hugovk requested review from jaraco and warsaw as code owners January 10, 2026 13:44
@hugovk hugovk changed the title importlib.metadata: Use translate to improve performance of canonicalize_name gh-143658: importlib.metadata: Use translate to improve performance of canonicalize_name Jan 10, 2026
@hugovk hugovk added performance Performance or resource usage topic-importlib labels Jan 10, 2026
@picnixz picnixz changed the title gh-143658: importlib.metadata: Use translate to improve performance of canonicalize_name gh-143658: importlib.metadata: Use str.translate to improve performance of importlib.metadata.Prepared.normalized Jan 10, 2026
Co-authored-by: Bénédikt Tran <[email protected]>
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests actually? if not, maybe it'd be good to add some.

Copy link
Member

@johnslavik johnslavik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small ideas

@hugovk hugovk merged commit cbf9b8c into python:main Jan 13, 2026
50 checks passed
@hugovk hugovk deleted the 3.15-importlib.metadata-canonicalize_name branch January 13, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance or resource usage topic-importlib

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants