-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tuples for [r]find()
& [r]index()
#2
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: d.grigonis <[email protected]>
Co-authored-by: d.grigonis <[email protected]>
Co-authored-by: d.grigonis <[email protected]>
Co-authored-by: d.grigonis <[email protected]>
We can still do that afterwards, I first need to try to gain traction on Discourse (assuming Discord was a typo).
That's intentional, otherwise you can't copy it. It actually looks like this: ````md
...
```` |
@dg-pb, are you running into problems? |
You can not choose both it is either either. You on your own doing your own thing regardless of what I say or you start listening and concentrating on what needs to be done. There is nothing wrong if you know what to do and you have your own plan, but in this case I don't see how I can be of any help. |
I would like to say you have been an invaluable help. I don't think this proposal would have been possible without your insight, especially the chunking. But I don't want to think about this any further as I don't see how this can be done, sorry. It feels like we're just wasting out time thinking about an unproven idea (like serhiy's). If you still have other feedback, I'm of course willing to listen to you. e.g. if you want to expand the text from the PR summary. |
find
, index
, rfind
& rindex
find()
, index()
, rfind()
& rindex()
find()
, index()
, rfind()
& rindex()
[r]find()
& [r]index()
This comment was marked as resolved.
This comment was marked as resolved.
My current feedback is as follows: This PR explored the idea well. If I was in your place I would take a step back and create new PR to start clean. And see how to best incorporate this into the current architecture keeping in mind the whole set of string methods (as listed in https://discuss.python.org/t/string-search-overview-coverage-and-api/56711/7) and the fact that this algorithm will/should one day be replaced with more optimal version - thus it should be integrated for an easy swap. Current integration has a flavour of "lets implement this in the easiest and fastest way" as apart to "lets implement this so that it is optimal having all things considered". I understand this can be difficult (it is always so to me when I aim higher in relation to my current knowledge and experience) and might take some time. But to me what is important is the attitude, how long it takes is secondary. If our attitudes differ, then it's no big deal. Maybe our situation in life differs - we have different goals, amount of time at hand, experience etc... In this case it is not of benefit to either of us to cooperate as we will only get into each other's way. |
You are free to use whatever material I have written in relation to this idea as long as the facts in it are not outdated and are still relevant. I think the most accurate, up-to-date evaluation of this from my side is at https://discuss.python.org/t/string-search-overview-coverage-and-api/56711/7. It is short simple, straight to the point and is well put into the wider context. But as I said, you can use any material I have written on this as long as it is still relevant to the current state of this. |
I don't think it is good time to be gaining traction after there has already been a lot of it and there were no significant changes in this PR since. I intend to keep my word and make 1 post on your behalf if you choose so. However, I am not in agreement that either issuing new PR or gaining traction on discourse is a step in the right direction at this time. While there are others who think differently (such as @erlend-aasland or @pfmoore as indicated by you). So maybe it would make more sense for you to ask them? |
This comment was marked as resolved.
This comment was marked as resolved.
No; please do not use the CPython repo for your own personal experiments! We already have 1.5k open PRs. Create the experimental PR on your own fork.
Why are you helping circumvent the ban? Please don't. |
Agreed.
@nineteendo if you have been given a ban from Discourse, then the idea is that you spend that time reflecting on why you were banned, and come back with a better understanding of how to interact on the forum. Having the same style of conversation here, and "waiting until January" to simply go back to Discourse with no change in behavior will simply result in you getting another ban.
Hardly random - avoiding the people who have tried to give you advice will not help you improve how you interact with the community 🙁
Exactly this.
I assumed you could (and would) develop something that was ready for review. Not that you start another long discussion on the tracker. Please don't mischaracterise what I said on Discourse as support for what you're doing here. |
This is what I meant. |
This comment was marked as resolved.
This comment was marked as resolved.
No, I actually pinned another Paul Moore accidentally at first. :) |
Making the necessary adjustments on a new branch would also work, most likely like this:
But the problem is that these methods either require a tuple (which requires separate handling for strings and bytes) or an array of
I've tried a lot of different things in this pull request, the current implementation is simply the most performant.
Just post a link in the existing thread, it was never my intention to circumvent the ban, sorry. I've done that on a different forum in the past and became a moderator afterwards, but the attitude is vastly different here, so that seems like a very bad idea. Their patience with me is gone. I only realised recently this might been seen as ban evasion.
I was banned for creating this thread, which is an "idea" to improve Discourse, while I was asked to not post ANY new ideas until 2025. I've already created drafts for then: this one, and #3, which I hope are more fleshed out. I will ask in September if I still need to wait until then as the only thing left to do is write a PEP and I would like to get some feedback first.
Eh, this is my own repository, I don't think it's a problem to have a discussion here. Or did you mean "there"?
Which is why you shouldn't remove that from your message, it makes the conversation hard to follow for new people. |
Yes, there are difficult problems to be solved for a nice integration of this. The fact that there is an emphasis on problems and why it can NOT be done better as opposed to solutions is the main reason why I don't want to be part of this anymore. |
Sorry, I didn't spot this was your own repo (I get a lot of notifications). Pinging me (and Erland) on a private development discussion was probably the mistake here. I'll unsubscribe from this discussion, as I'm not interested in being involved. |
@nineteendo, please do not edit my posts, even on your own repo. I'm unsubscribing. |
I'm trying to find a solution, as you still want to improve this, but so far I haven't found anything yet. The functions in strlib are defined using
I suggest you look into it before deciding whether I need to pursue it. At some point you must give up when something is not feasible. |
I actually think there have been significant changes since the last benchmark posted there:
I would like a proper review, so this can be accepted or rejected. Otherwise, I'll keep thinking about it. (this holds for my other PRs as well, but I'm more pasionate about this one) So, could you please link to this PR, stating there have been significant changes in the last month? Or just tell me you won't? I'll leave you alone afterwards (also if you refuse). |
I posted a link to this PR here: https://discuss.python.org/t/add-tuple-support-to-more-str-functions/50628/133 It has been mentioned twice in a short period of time. Once in the main thread and once in a comment. |
Thanks again for all the help, I couldn't have done it without you. I've leave this open in case someone wants to talk to me directly. |
I don't have anything definite yet either, but I know that this can be done and probably should be (for this PR to have a fair chance). I had initial look and a rough view how it could be is as follows: 1. Implement
|
Let's try with tuples, that seems like the easiest. Looks like static int
PARSE_SUB_OBJ(PyObject **subobj, void **sub, int *sub_kind, Py_ssize_t *sub_len)
{
*sub = PyUnicode_DATA(*subobj);
*sub_kind = PyUnicode_KIND(*subobj);
*sub_len = PyUnicode_GET_LENGTH(*subobj);
return 0;
}
static Py_ssize_t
chunk_find_sub(const void *str, Py_ssize_t len,
PyObject* subobj,
Py_ssize_t chunk_start, Py_ssize_t chunk_end,
Py_ssize_t end, int direction)
{
int sub_kind;
void *sub;
Py_ssize_t sub_len, result;
assert(chunk_end <= end);
if (PARSE_SUB_OBJ(&subobj, &sub, &sub_kind, &sub_len) {
return -1;
}
if (sub_kind > kind) {
return -1;
}
if (chunk_end >= end - len2) { // Guard overflow
result = fast_find_sub(str, len, sub, sub_kind, sub_len, chunk_start,
end, direction);
}
else {
result = fast_find_sub(str, len, sub, sub_kind, sub_len, chunk_start,
chunk_end + sub_len, direction);
}
if (subobj) {
PyBuffer_Release(&subbuf); // Only needed for bytes
}
return result;
} |
Think of it this way: Current single-sub search looks as follows: find_impl _Py_bytes_find
| |
any_find_slice find_internal
\ /
find.h:functions
|
fastfind.c:FASTSEARCH
|
fastfind:c:search_algorithms To integrate into the same architecture would mean something along the lines: unicodeobject.c | bytes_methods.c
|
find_impl | _Py_bytes_find
| | | | |
| any_find_slice_multi|find_internal |
any_find_slice \|/ find_internal_multi
#----------------------------+------------------------------
\ / \ /
\ / \ /
\ / \ /
#-----------------------------------------------------------
find.h functions functions_multi
| |
#-----------------------------------------------------------
fastfind.c FASTSEARCH FASTSEARCHMULTI
\ /
search_algorithms
This would be a good initial implementation. |
This would be the first step and next steps (improvements/ simplifications/special cases) would become evident in the process. |
Could you give mermaid diagrams a shot? ```mermaid
graph TD;
root-->child1;
root-->child2;
child1-->grandchild1;
child1-->grandchild2;
child2-->grandchild2;
``` graph TD;
root-->child1;
root-->child2;
child1-->grandchild1;
child1-->grandchild2;
child2-->grandchild2;
|
graph TD;
subgraph unicodeobject.c
find_impl-->any_find_slice
find_impl-->any_find_slice_multi
end
subgraph bytes_methods.c
_Py_bytes_find-->find_internal
_Py_bytes_find-->find_internal_multi
end
subgraph find.h
any_find_slice-->functions
find_internal-->functions
any_find_slice_multi-->functions_multi
find_internal_multi-->functions_multi
end
subgraph fastfind.c
functions--> FASTSEARCHMULTI
functions_multi-->FASTSEARCH
FASTSEARCH-->search_algorithms
FASTSEARCHMULTI-->search_algorithms
end
|
Motivation
For finding multiple substrings, there's currently no single algorithm that outperforms others in most cases. Their
performance varies significantly between the best-case and worst-case, making it difficult to choose one:
That's why I'm suggesting a dynamic algorithm that doesn't suffer from these problems:
algorithms
benchmark script
Examples
Use cases
While I haven't found a lot of use cases, this new addition would improve the readability and performance for all of
them:
cpython/Lib/ntpath.py
Lines 238 to 240 in 73906d5
cpython/Lib/ntpath.py
Lines 378 to 380 in f90ff03
cpython/Lib/genericpath.py
Lines 164 to 167 in 0d42ac9
>2K files with
/max\(\w+\.rfind/
Python implementation
The implementation written in Python is clear and simple (in C overflow and exceptions need to be handled manually):
Explanation
The search is split up in chunks, overlapping by the length of a substring. After the first match, we search for the
next substring in the part before (or after for reverse search). Each chunk will be twice as large as the previous
one, but capped at 16384. The dynamic size ensures good best- and worst-case performance.
C call hierarchy
find_sub()
andfind_subs()
are called based on the argument type using an inline function.find_sub()
is calledfor tuples of length 1,
chunk_find_sub()
is called for tuples with more than 1 element:Calibration
MIN_CHUNK_SIZE
andMAX_CHUNK_SIZE
were calibrated on this benchmark:performance for substrings after the first chunk.
would only hurt performance in the average case.
Previous discussion
find
,index
,rfind
&rindex
python/cpython#119501Footnotes
Using the
regex
module for reverse search ↩ ↩2memrchr()
is not available on macOS ↩expected as find tuple does more work in the worst case ↩