Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tuples for [r]find() & [r]index() #2

Draft
wants to merge 135 commits into
base: main
Choose a base branch
from
Draft

Conversation

nineteendo
Copy link
Owner

@nineteendo nineteendo commented Jun 3, 2024

Motivation

For finding multiple substrings, there's currently no single algorithm that outperforms others in most cases. Their
performance varies significantly between the best-case and worst-case, making it difficult to choose one:

algorithm loop in loop startswith re1 find str unit
find chars best case 1.00 1.19 1.56 1.37 x 180 nsec
find chars mixed case 1.00 1.22 1.54 91.01 x 178 nsec
find chars worst case 1262.20 1597.56 131.71 1.00 x 32.80 usec
find subs best case 1.00 1.33 1.17 x 212 nsec
find subs mixed case 1.00 1.30 3327.27 x 220 nsec
find subs worst case 35.82 3.62 1.00 x 1.46 msec
find many prefixes 1760.80 1.00 122.26 x 301.00 usec
find many infixes 122.17 1.00 7.71 x 4.33 msec
rfind chars best case 1.00 2.61 4.34 1.96 x 114 nsec
rfind chars mixed case 1.00 2.66 4.34 4561.40 x 114 nsec
rfind chars worst case 50.10 55.25 6.94 1.00 x 1.01 msec
rfind subs best case 1.58 2.55 1.00 x 229 nsec
rfind subs mixed case 1.00 1.59 1921.05 x 380 nsec
rfind subs worst case 38.07 4.97 1.00 x 1.45 msec
rfind many suffixes 1094.46 1.00 50.51 x 487.00 usec
rfind many infixes 54.70 1.00 2.41 x 9.69 msec

That's why I'm suggesting a dynamic algorithm that doesn't suffer from these problems:

algorithm loop in loop startswith re1 find str find tuple unit
find chars best case 2.67 3.19 4.15 3.64 1.00 x 68 nsec
find chars mixed case 2.29 2.80 3.53 208.23 1.00 x 78 nsec
find chars worst case 1489.21 1884.89 155.40 1.18 1.00 x 27.80 usec
find subs best case 3.07 4.07 3.59 1.00 x 69 nsec
find subs mixed case 2.32 3.02 7713.38 1.00 x 95 nsec
find subs worst case 35.82 3.62 1.00 1.01 x 1.46 msec
find many prefixes 1760.80 1.00 122.26 82.06 x 301.00 usec
find many infixes 122.17 1.00 7.71 5.22 x 4.33 msec
rfind chars best case 1.33 3.45 5.76 2.59 1.00 x 86 nsec
rfind chars mixed case 1.08 2.86 4.67 4905.66 1.00 x 106 nsec
rfind chars worst case 50.10 55.25 6.94 1.00 1.102 x 1.01 msec
rfind subs best case 3.70 5.98 2.34 1.00 x 98 nsec
rfind subs mixed case 3.42 5.43 6576.58 1.00 x 111 nsec
rfind subs worst case 38.07 4.97 1.00 1.043 x 1.45 msec
rfind many suffixes 1094.46 1.00 50.51 50.31 x 487.00 usec
rfind many infixes 54.70 1.00 2.41 2.39 x 9.69 msec
algorithms
# find_tuple.py
def find0(string, chars):
    for i, char in enumerate(string):
        if char in chars:
            break
    else:
        i = -1
    return i

def find1(string, subs):
    for i in range(len(string)):
        if string.startswith(subs, i):
            break
    else:
        i = -1
    return i

def find2(string, pattern):
    match = pattern.search(string)
    i = match.start() if match else -1
    return i

def find3(string, subs):
    i = -1
    for sub in subs:
        new_i = string.find(sub, 0, None if i == -1 else i + len(sub))
        if new_i != -1:
            i = new_i
    return i

def find4(string, subs):
    i = string.find(subs)
    return i

def rfind0(string, chars):
    i = len(string) - 1
    while i >= 0 and string[i] not in chars:
        i -= 1
    return i

def rfind1(string, subs):
    for i in range(len(string), -1, -1):
        if string.startswith(subs, i):
            break
    else:
        i = -1
    return i

rfind2 = find2

def rfind3(string, subs):
    i = -1
    for sub in subs:
        new_i = string.rfind(sub, 0 if i == -1 else i)
        if new_i != -1:
            i = new_i
    return i

def rfind4(string, subs):
    i = string.rfind(subs)
    return i
benchmark script
# find_tuple.sh
echo find chars best case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'ab' + '_' * 999_998; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find chars mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'b' + '_' * 999_999; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find chars worst case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find subs best case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'abcd' + '_' * 999_996; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find subs mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'cd' + '_' * 999_998; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find subs worst case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find many prefixes
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('|'.join(f'prefix{i}' for i in range(100)))" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find4(string, subs)"
echo find many infixes
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('|'.join(f'{i}infix{i}' for i in range(100)))" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find4(string, subs)"

echo ---

echo rfind chars best case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_998 + 'ba'; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind chars mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_999 + 'b'; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind chars worst case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind subs best case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_996 + 'cdab'; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind subs mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_998 + 'cd'; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind subs worst case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind many suffixes
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile(f'(?r){'|'.join(f'{i}suffix' for i in range(100))}')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind4(string, subs)"
echo rfind many infixes
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile(f'(?r){'|'.join(f'{i}infix{i}' for i in range(100))}')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind4(string, subs)"

Examples

>>> "0123456789".find(("a", "b", "c"))
-1
>>> "0123456789".find(("0", "1", "2"))
0
>>> "0123456789".rfind(("7", "8", "9"))
9

Use cases

While I haven't found a lot of use cases, this new addition would improve the readability and performance for all of
them:

cpython/Lib/ntpath.py

Lines 238 to 240 in 73906d5

i = len(p)
while i and p[i-1] not in seps:
i -= 1

cpython/Lib/ntpath.py

Lines 378 to 380 in f90ff03

i, n = 1, len(path)
while i < n and path[i] not in seps:
i += 1

cpython/Lib/genericpath.py

Lines 164 to 167 in 0d42ac9

sepIndex = p.rfind(sep)
if altsep:
altsepIndex = p.rfind(altsep)
sepIndex = max(sepIndex, altsepIndex)

>2K files with /max\(\w+\.rfind/

Python implementation

The implementation written in Python is clear and simple (in C overflow and exceptions need to be handled manually):

MIN_CHUNK_SIZE = 32
MAX_CHUNK_SIZE = 16384
EXP_CHUNK_SIZE = 2

def find_tuple(string, subs, start=0, end=None):
    end = len(string) if end is None else end
    result = -1
    chunk_size = MIN_CHUNK_SIZE
    chunk_start = start
    while True:
        chunk_end = min(chunk_start + chunk_size, end)
        if chunk_end < end:
            chunk_end -= 1
        for sub in subs:
            sub_end = min(chunk_end + len(sub), end)
            new_result = string.find(sub, chunk_start, sub_end)
            if new_result != -1:
                if new_result == chunk_start:
                    return new_result
                chunk_end = new_result - 1 # Only allow earlier match
                result = new_result
        if result != -1 or chunk_end >= end:
            return result # Found match or searched entire range
        chunk_start = chunk_end + 1
        chunk_size = min(chunk_size * EXP_CHUNK_SIZE, MAX_CHUNK_SIZE)

def rfind_tuple(string, subs, start=0, end=None):
    end = len(string) if end is None else end
    result = -1
    chunk_size = MIN_CHUNK_SIZE
    chunk_end = end
    while True:
        chunk_start = max(start, chunk_end - chunk_size)
        if chunk_start > start:
            chunk_start += 1
        for sub in subs:
            sub_end = min(chunk_end + len(sub), end)
            new_result = string.rfind(sub, chunk_start, sub_end)
            if new_result != -1:
                if new_result == chunk_end:
                    return new_result
                chunk_start = new_result + 1 # Only allow later match
                result = new_result
        if result != -1 or chunk_start <= start:
            return result # Found match or searched entire range
        chunk_end = chunk_start - 1
        chunk_size = min(chunk_size * EXP_CHUNK_SIZE, MAX_CHUNK_SIZE)

Explanation

The search is split up in chunks, overlapping by the length of a substring. After the first match, we search for the
next substring in the part before (or after for reverse search). Each chunk will be twice as large as the previous
one, but capped at 16384. The dynamic size ensures good best- and worst-case performance.

find-tuple

C call hierarchy

find_sub() and find_subs() are called based on the argument type using an inline function. find_sub() is called
for tuples of length 1, chunk_find_sub() is called for tuples with more than 1 element:

graph TD;
    unicode_find_impl-.->find;
    unicode_index_impl-.->find;
    unicode_rfind_impl-.->find;
    unicode_rindex_impl-.->find;
    find-->find_sub;
    find-->find_subs;
    find_subs-->find_sub;
    find_subs-->chunk_find_sub;
    find_sub-->fast_find_sub;
    chunk_find_sub-->fast_find_sub;
Loading

Calibration

MIN_CHUNK_SIZE and MAX_CHUNK_SIZE were calibrated on this benchmark:

  • 32 was the highest mimimum size beating out all other algorithms in the best-case, setting it any lower would hurt
    performance for substrings after the first chunk.
  • 16384 was the highest maximum size with a measurable improvement in the worst case, setting it any higher
    would only hurt performance in the average case.
MIN_CHUNK_SIZE 4 8 16 32 64 128 256 512 1024 unit
find chars best case 1.01 1.00 1.07 1.09 1.06 1.10 1.06 1.10 1.10 x 66.7 nsec
find chars mixed case 1.00 1.01 1.10 1.14 1.10 1.17 1.09 1.23 1.24 x 75.3 nsec
find subs best case 1.00 1.01 1.02 1.05 1.00 1.04 1.01 1.01 1.01 x 70.9 nsec
find subs mixed case 1.00 1.01 1.06 1.22 1.39 2.22 3.31 5.51 10.0 x 84.1 nsec
rfind chars best case 1.02 1.01 1.02 1.00 1.01 1.00 1.00 1.03 1.02 x 92.9 nsec
rfind chars mixed case 1.00 1.01 1.06 1.12 1.29 1.68 2.34 3.69 6.12 x 98.8 nsec
rfind subs best case 1.00 1.01 1.02 1.01 1.03 1.00 1.00 1.02 1.02 x 96.9 nsec
rfind subs mixed case 1.00 1.00 1.06 1.08 1.29 1.89 2.75 4.54 8.05 x 106 nsec
MAX_CHUNK_SIZE 1024 2048 4096 8192 16384 32768 65536 unit
find chars worst case 1.63 1.27 1.11 1.09 1.01 1.00 1.05 x 27.7 usec
find subs worst case 1.03 1.01 1.00 1.00 1.00 1.03 1.00 x 1.47 msec
find many prefixes 1.08 1.04 1.02 1.00 1.00 1.47 1.47 x 24.7 msec
find many infixes 1.08 1.03 1.01 1.00 1.00 1.45 1.44 x 22.6 msec
rfind chars worst case 1.20 1.00 1.17 1.17 1.01 1.01 1.15 x 1.03 msec
rfind subs worst case 1.04 1.02 1.01 1.01 1.01 1.01 1.00 x 1.44 msec
rfind many suffixes 1.09 1.04 1.02 1.01 1.00 1.00 1.00 x 24.4 msec

Previous discussion

Footnotes

  1. Using the regex module for reverse search 2

  2. memrchr() is not available on macOS

  3. expected as find tuple does more work in the worst case

@nineteendo
Copy link
Owner Author

nineteendo commented Jun 27, 2024

We cooperate further to make this fit into a wider picture so that the changes made are sensible from all aspects.

We can still do that afterwards, I first need to try to gain traction on Discourse (assuming Discord was a typo).

send me a a complete post

By the way, summary of this PR is rendering as raw markdown to me at the moment.

That's intentional, otherwise you can't copy it. It actually looks like this:

````md
...
````

@nineteendo
Copy link
Owner Author

@dg-pb, are you running into problems?

@dg-pb
Copy link

dg-pb commented Jun 27, 2024

You can not choose both it is either either.

You on your own doing your own thing regardless of what I say or you start listening and concentrating on what needs to be done.

There is nothing wrong if you know what to do and you have your own plan, but in this case I don't see how I can be of any help.

@nineteendo
Copy link
Owner Author

nineteendo commented Jun 28, 2024

I would like to say you have been an invaluable help. I don't think this proposal would have been possible without your insight, especially the chunking. But I don't want to think about this any further as I don't see how this can be done, sorry. It feels like we're just wasting out time thinking about an unproven idea (like serhiy's).

If you still have other feedback, I'm of course willing to listen to you. e.g. if you want to expand the text from the PR summary.
Do you think we can encorporate your previous summary somehow? I'm fine with waiting to make it as detailed as possible.

@nineteendo nineteendo changed the title gh-118184: Support tuples for find, index, rfind & rindex gh-118184: Support tuples for find(), index(), rfind() & rindex() Jun 28, 2024
@nineteendo nineteendo changed the title gh-118184: Support tuples for find(), index(), rfind() & rindex() Support tuples for [r]find() & [r]index() Jun 28, 2024
@nineteendo

This comment was marked as resolved.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

My current feedback is as follows:

This PR explored the idea well.

If I was in your place I would take a step back and create new PR to start clean.

And see how to best incorporate this into the current architecture keeping in mind the whole set of string methods (as listed in https://discuss.python.org/t/string-search-overview-coverage-and-api/56711/7) and the fact that this algorithm will/should one day be replaced with more optimal version - thus it should be integrated for an easy swap.

Current integration has a flavour of "lets implement this in the easiest and fastest way" as apart to "lets implement this so that it is optimal having all things considered".

I understand this can be difficult (it is always so to me when I aim higher in relation to my current knowledge and experience) and might take some time. But to me what is important is the attitude, how long it takes is secondary.

If our attitudes differ, then it's no big deal. Maybe our situation in life differs - we have different goals, amount of time at hand, experience etc... In this case it is not of benefit to either of us to cooperate as we will only get into each other's way.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

Do you think we can encorporate your previous summary somehow?

You are free to use whatever material I have written in relation to this idea as long as the facts in it are not outdated and are still relevant. I think the most accurate, up-to-date evaluation of this from my side is at https://discuss.python.org/t/string-search-overview-coverage-and-api/56711/7. It is short simple, straight to the point and is well put into the wider context. But as I said, you can use any material I have written on this as long as it is still relevant to the current state of this.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

I don't think it is good time to be gaining traction after there has already been a lot of it and there were no significant changes in this PR since.

I intend to keep my word and make 1 post on your behalf if you choose so.

However, I am not in agreement that either issuing new PR or gaining traction on discourse is a step in the right direction at this time.

While there are others who think differently (such as @erlend-aasland or @pfmoore as indicated by you).

So maybe it would make more sense for you to ask them?

@nineteendo

This comment was marked as resolved.

@erlend-aasland
Copy link

erlend-aasland commented Jun 28, 2024

If I was in your place I would take a step back and create new PR to start clean.

No; please do not use the CPython repo for your own personal experiments! We already have 1.5k open PRs. Create the experimental PR on your own fork.

I intend to keep my word and make 1 post on your behalf if you choose so.

Why are you helping circumvent the ban? Please don't.

@pfmoore
Copy link

pfmoore commented Jun 28, 2024

Why are you helping circumvent the ban? Please don't.

Agreed.

Can someone else post a message for me, or do I need to wait until January?

@nineteendo if you have been given a ban from Discourse, then the idea is that you spend that time reflecting on why you were banned, and come back with a better understanding of how to interact on the forum.

Having the same style of conversation here, and "waiting until January" to simply go back to Discourse with no change in behavior will simply result in you getting another ban.

Uh oh, you pinged a random person.

Hardly random - avoiding the people who have tried to give you advice will not help you improve how you interact with the community 🙁

No; please do not use the CPython repo for your own personal experiments!

Exactly this.

Paul Moore had said I could just submit a PR

I assumed you could (and would) develop something that was ready for review. Not that you start another long discussion on the tracker. Please don't mischaracterise what I said on Discourse as support for what you're doing here.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

No; please do not use the CPython repo for your own personal experiments! We already have 1.5k open PRs. Create the experimental PR on your own fork.

This is what I meant.

@dg-pb

This comment was marked as resolved.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

Hardly random - avoiding the people who have tried to give you advice will not help you improve how you interact with the community 🙁

No, I actually pinned another Paul Moore accidentally at first. :)

@nineteendo
Copy link
Owner Author

nineteendo commented Jun 28, 2024

If I was in your place I would take a step back and create new PR to start clean.
see how to best incorporate this into the current architecture keeping in mind the whole set of string methods

Making the necessary adjustments on a new branch would also work, most likely like this:

  • stringlib_[r]find_subs()
  • asciilib_[r]find_subs()
  • usc1lib_[r]find_subs()
  • usc2lib_[r]find_subs()
  • usc4lib_[r]find_subs()

But the problem is that these methods either require a tuple (which requires separate handling for strings and bytes) or an array of SUB structs (which need to be allocated on the slow heap). So, the current approach seems like the best option.

Current integration has a flavour of "lets implement this in the easiest and fastest way"

I've tried a lot of different things in this pull request, the current implementation is simply the most performant.

I intend to keep my word and make 1 post on your behalf if you choose so.

Just post a link in the existing thread, it was never my intention to circumvent the ban, sorry. I've done that on a different forum in the past and became a moderator afterwards, but the attitude is vastly different here, so that seems like a very bad idea. Their patience with me is gone. I only realised recently this might been seen as ban evasion.

the idea is that you spend that time reflecting on why you were banned

I was banned for creating this thread, which is an "idea" to improve Discourse, while I was asked to not post ANY new ideas until 2025. I've already created drafts for then: this one, and #3, which I hope are more fleshed out. I will ask in September if I still need to wait until then as the only thing left to do is write a PEP and I would like to get some feedback first.

Please don't mischaracterise what I said on Discourse as support for what you're doing here.

Eh, this is my own repository, I don't think it's a problem to have a discussion here. Or did you mean "there"?

No, I actually pinned another Paul Moore accidentally at first. :)

Which is why you shouldn't remove that from your message, it makes the conversation hard to follow for new people.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

But the problem is

Yes, there are difficult problems to be solved for a nice integration of this.

The fact that there is an emphasis on problems and why it can NOT be done better as opposed to solutions is the main reason why I don't want to be part of this anymore.

@pfmoore
Copy link

pfmoore commented Jun 28, 2024

Eh, this is my own repository, I don't think it's a problem to have a discussion here. Or did you mean "there"?

Sorry, I didn't spot this was your own repo (I get a lot of notifications). Pinging me (and Erland) on a private development discussion was probably the mistake here.

I'll unsubscribe from this discussion, as I'm not interested in being involved.

@erlend-aasland
Copy link

@nineteendo, please do not edit my posts, even on your own repo. I'm unsubscribing.

@nineteendo
Copy link
Owner Author

nineteendo commented Jun 28, 2024

The fact that there is an emphasis on problems and why it can NOT be done better as opposed to solutions is the main reason why I don't want to be part of this anymore.

I'm trying to find a solution, as you still want to improve this, but so far I haven't found anything yet.

The functions in strlib are defined using STRINGLIB, so we would need to use STRINGLIB(find_subs). We need to somehow pass the substrings, either as a tuple or on the heap, you can't get around that:

  • A tuple requires separate handling for strings and bytes, maybe using a macro? If we define separate methods, we could just as well keep the implementation where it is now.
  • Using the heap was slow in my previous attempt, should I try again with a single alloc? I don't have too high hopes.

I suggest you look into it before deciding whether I need to pursue it. At some point you must give up when something is not feasible.

@nineteendo
Copy link
Owner Author

nineteendo commented Jun 28, 2024

I don't think it is good time to be gaining traction after there has already been a lot of it and there were no significant changes in this PR since.

I actually think there have been significant changes since the last benchmark posted there:

  • We're now comparing against the regex module, which has proper support for reverse search
  • The chunk size is now dynamic, so it's the fastest in 10/16 of the cases instead of 5/16 (with a proper comparison)
  • The code is a lot simpler now
  • I've written a much more detailed summary

I would like a proper review, so this can be accepted or rejected. Otherwise, I'll keep thinking about it. (this holds for my other PRs as well, but I'm more pasionate about this one)

So, could you please link to this PR, stating there have been significant changes in the last month? Or just tell me you won't? I'll leave you alone afterwards (also if you refuse). I'll lock this conversation after you've made your decision.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

I posted a link to this PR here: https://discuss.python.org/t/add-tuple-support-to-more-str-functions/50628/133

It has been mentioned twice in a short period of time. Once in the main thread and once in a comment.

@nineteendo
Copy link
Owner Author

nineteendo commented Jun 28, 2024

Thanks again for all the help, I couldn't have done it without you. I've leave this open in case someone wants to talk to me directly.

@dg-pb
Copy link

dg-pb commented Jun 28, 2024

I'm trying to find a solution, as you still want to improve this, but so far I haven't found anything yet.

I don't have anything definite yet either, but I know that this can be done and probably should be (for this PR to have a fair chance). I had initial look and a rough view how it could be is as follows:

1. Implement find_horspool_many in fastsearch.h

  1. It should use horspool_find and find_char under the hood.
  2. gh-119702: New dynamic algorithm selection for string search (+ rfind alignment) python/cpython#120025 eliminates many function calls so it will be easy to make use of it - it is bi-directional and handles most of cases.
  3. It should do not only find, but count as well. Essentially, all the same features as horspool_find.

This way, it will be easy to swap it with other solution when/if someone decides to implement something else. This is also a big sell for the PR because the biggest criticism is that this is not theoretically optimal solution. This provides a good answer to that: "yes, but it introduces an architecture where more optimal solution is easy to drop-in and provides a very good and practical temporary solution which is most efficient for 95% (or whatever) of use cases".

2. Implement FASTSEARCH_MANY in fastsearch.h

  1. This has the same purpose as FASTSEARCH, but for many substrings. I.e. handles special cases and eventually calls find_horspool_many.
  2. Some special cases will need to be handled separately (same as in FASTSEARCH). However, that is good, because there are cases where clever things can be applied. Such as:
    3.1. if all substrings are 1-character strings, then similar approach to split_whitespace can be used which will be much faster.
    2.2. Preparation step can only be done once (either done here or in horspool_find_many.

3. Figure out how to plug it in.

This will be the trickiest part how to do nicely and with minimal changes. At least to me because I have not done any work in these parts. Maybe this would be easier for you because you did?

Solution from my initial look lies somewhere in between unicodeobject.c:any_find_slice and find.h:methods. But to figure this out I would need to take 1-2 days to digest properly.

My plan

Before python#120025 is resolved, I am giving strings a break.

Because if it is implemented it does simplify a fair bit so it is more productive to give it a break.

Also, I want to work on something else for a while - I had my fair dose of strings in the last month or 2.

After I come back (when python#120025 is resolved), these are the things I will look at (most likely in this exact same order):

  1. Adding maxcount argument to find
  2. Implementing this PR
  3. Adding keepsep argument to split
  4. Implementing split(tuple)

For you

This is a starting point and things to think about if you want to work on this.

The way I see it the most productive use of your time would be to implement "mock" FASTSEARCH_MANY and work out how to plug it in.

The actual implementation of FASTSEARCH_MANY and horspool_find_many will be easy once this is done, because I have worked on fastsearch.h and know every single thing there while the actual algorithm of this PR is in a good shape.

Otherwise, you can concentrate on something else and come back to this once python#120025 is resolved and we can figure this out together then.

If you wish, we can continue on this path, but in this case there is 1 condition: no more posting in discourse and no more public PRs until the above is complete and in presentable state.

@nineteendo
Copy link
Owner Author

nineteendo commented Jun 28, 2024

Let's try with tuples, that seems like the easiest. Looks like chunk_find_sub() needs to be implemented for strings and bytes separately:

static int
PARSE_SUB_OBJ(PyObject **subobj, void **sub, int *sub_kind, Py_ssize_t *sub_len)
{
    *sub = PyUnicode_DATA(*subobj);
    *sub_kind = PyUnicode_KIND(*subobj);
    *sub_len = PyUnicode_GET_LENGTH(*subobj);
    return 0;
}

static Py_ssize_t
chunk_find_sub(const void *str, Py_ssize_t len,
               PyObject* subobj,
               Py_ssize_t chunk_start, Py_ssize_t chunk_end,
               Py_ssize_t end, int direction)
{
    int sub_kind;
    void *sub;
    Py_ssize_t sub_len, result;

    assert(chunk_end <= end);
    if (PARSE_SUB_OBJ(&subobj, &sub, &sub_kind, &sub_len) {
        return -1;
    }

    if (sub_kind > kind) {
        return -1;
    }

    if (chunk_end >= end - len2) { // Guard overflow
        result = fast_find_sub(str, len, sub, sub_kind, sub_len, chunk_start,
                               end, direction);
    }
    else {
        result = fast_find_sub(str, len, sub, sub_kind, sub_len, chunk_start,
                               chunk_end + sub_len, direction);
    }

    if (subobj) {
        PyBuffer_Release(&subbuf); // Only needed for bytes
    }

    return result;
}

@dg-pb
Copy link

dg-pb commented Jul 3, 2024

Think of it this way:

Current single-sub search looks as follows:

find_impl       _Py_bytes_find
    |                | 
any_find_slice    find_internal
       \             /
       find.h:functions
              |
          fastfind.c:FASTSEARCH
              |
          fastfind:c:search_algorithms

To integrate into the same architecture would mean something along the lines:

     unicodeobject.c         |  bytes_methods.c
                             |
       find_impl             |  _Py_bytes_find
       |       |             |     |        |
       | any_find_slice_multi|find_internal |
any_find_slice              \|/       find_internal_multi
#----------------------------+------------------------------
              \             / \          /
               \           /   \        /
                \         /     \      /
#-----------------------------------------------------------
find.h           functions   functions_multi
                     |              |
#-----------------------------------------------------------
fastfind.c      FASTSEARCH    FASTSEARCHMULTI
                         \     /
                     search_algorithms

chunk_find_sub would be placed in fastfind.c and named horspool_find_multi (or something similar) and is part of fastfind:c:search_algorithms and would be common for both bytes and strings.

This would be a good initial implementation.

@dg-pb
Copy link

dg-pb commented Jul 3, 2024

This would be the first step and next steps (improvements/ simplifications/special cases) would become evident in the process.

@nineteendo
Copy link
Owner Author

Could you give mermaid diagrams a shot?

```mermaid
graph TD;
    root-->child1;
    root-->child2;
    child1-->grandchild1;
    child1-->grandchild2;
    child2-->grandchild2;
```
graph TD;
    root-->child1;
    root-->child2;
    child1-->grandchild1;
    child1-->grandchild2;
    child2-->grandchild2;
Loading

@dg-pb
Copy link

dg-pb commented Jul 3, 2024

graph TD;
    subgraph unicodeobject.c
        find_impl-->any_find_slice
        find_impl-->any_find_slice_multi
    end
    subgraph bytes_methods.c
        _Py_bytes_find-->find_internal
        _Py_bytes_find-->find_internal_multi
    end
    subgraph find.h
        any_find_slice-->functions
        find_internal-->functions
        any_find_slice_multi-->functions_multi
        find_internal_multi-->functions_multi
    end
        subgraph fastfind.c
        functions--> FASTSEARCHMULTI
        functions_multi-->FASTSEARCH
        FASTSEARCH-->search_algorithms
        FASTSEARCHMULTI-->search_algorithms
    end
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants