Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pit performance #1673

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Conversation

PaleNeutron
Copy link

@PaleNeutron PaleNeutron commented Oct 20, 2023

Description

see #1671

Consider pit data, assume we have T trade days and N report_period record:

date report_period value
0 2011-10-18 00:00:00 201103 0.318919
1 2012-03-23 00:00:00 201104 0.4039
2 2012-04-11 00:00:00 201004 0.403925
3 2012-04-11 00:00:00 200904 0.403925

We access PIT table in 3 Ways:

1. observe latest data each trade day

Just loop through table and keep only latest report_date value. consume O(N)

2. observe latest several report_period data for expression like P(Mean($$roewa_q, 2))

Read data file once.

  • Loop through trade day, slice data[:tradeday],
    • groupby report_period, get the last item.
    • return last X item

Algorithm could be improved by loop back from the end until find X different period. But groupby use C level loop which should be faster.

3. observe specific period from each trade day

Get all data belong to given period

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

image

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

@github-actions github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Oct 20, 2023
@PaleNeutron
Copy link
Author

Anyone can fix main branch? CI fails due to main branch problem.

Copy link
Contributor

@Fivele-Li Fivele-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():
with open(index_file, "rb") as fi:
(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))
n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE
if interval == self.INTERVAL_quarterly:
n_years //= 4
start_year = first_year + n_years

qlib/data/pit.py Show resolved Hide resolved
qlib/utils/__init__.py Outdated Show resolved Hide resolved
@PaleNeutron
Copy link
Author

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():
with open(index_file, "rb") as fi:
(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))
n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE
if interval == self.INTERVAL_quarterly:
n_years //= 4
start_year = first_year + n_years

The whole dump_pit.py should be rewrited since we implement FilePitStorage. So current dump file should look like

s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)

@PaleNeutron
Copy link
Author

@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage and LocalCalendarStorage.

@CharlieChi
Copy link

Current online update tools seem to be incompatible with these modifications, mind check it out?

@PaleNeutron
Copy link
Author

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

@CharlieChi
Copy link

CharlieChi commented Sep 3, 2024

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

qlib/workflow/online/update.py
        start_time_buffer = get_date_by_shift(
            self.last_end, -hist_ref + 1, clip_shift=False, freq=self.freq  # pylint: disable=E1130
        )
        start_time = get_date_by_shift(self.last_end, 1, freq=self.freq)
        seg = {"test": (start_time, self.to_date)}
        return self.rmdl.get_dataset(
            start_time=start_time_buffer, end_time=self.to_date, segments=seg, unprepared_dataset=unprepared_dataset
        )

Here,when using model with PIT features, and update preds by short time range, like a day, these dataset will return empty dataframe。 while with long time range(one year between start_time and end_time), it works fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for triage Cannot auto-triage, wait for triage.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants