-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve pit performance #1673
base: main
Are you sure you want to change the base?
Improve pit performance #1673
Conversation
Anyone can fix main branch? CI fails due to main branch problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?
Lines 198 to 204 in 98f569e
if not overwrite and index_file.exists(): | |
with open(index_file, "rb") as fi: | |
(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE)) | |
n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE | |
if interval == self.INTERVAL_quarterly: | |
n_years //= 4 | |
start_year = first_year + n_years |
The whole s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data) |
@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using |
Current online update tools seem to be incompatible with these modifications, mind check it out? |
@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow. |
Here,when using model with PIT features, and update preds by short time range, like a day, these dataset will return empty dataframe。 while with long time range(one year between start_time and end_time), it works fine |
Description
see #1671
Consider pit data, assume we have
T
trade days andN
report_period record:We access PIT table in 3 Ways:
1. observe latest data each trade day
Just loop through table and keep only latest
report_date
value. consume O(N)2. observe latest several
report_period
data for expression likeP(Mean($$roewa_q, 2))
Read data file once.
X
itemAlgorithm could be improved by loop back from the end until find
X
different period. But groupby use C level loop which should be faster.3. observe specific period from each trade day
Get all data belong to given period
How Has This Been Tested?
pytest qlib/tests/test_all_pipeline.py
under upper directory ofqlib
.Screenshots of Test Results (if appropriate):
Types of changes