DRAFT: Trying to improve pycbc_inspiral performance on GPU#5249
DRAFT: Trying to improve pycbc_inspiral performance on GPU#5249spxiwh wants to merge 45 commits intogwastro:masterfrom
Conversation
* add forward zeros after fft * fix generate data lisa ew * remove tlen from PSD function * add doc-string for data generation * update LISA EW model with changes * handle zero noise flag * move PSD functions to pycbc.psd and update to match utils * update model to use reworked PSD functions * remove unused generator object * comment about HorizonDistance * refactor pre-merger waveform to use a common function * rename early_warning_wform.py to pre_merger_waveform.py * update pre-merger model to use update waveform functions * fix broken import * rename lisa_ew.py to lisa_pre_merger.py * rename lisa_ew model to lisa_pre_merger * remove unused imports from lisa_pre_merger.py * remove phase marginalization * update __init__.py for pre-merger renaming * add some basic logging statements * fix return value for apply_pre_merger_kernel * do not copy output for TD to FD in apply_pre_merger_kernel * clarify doc-strings for generate_data_lisa_pre_merger * fix duration and cutoff time in pre-merger model * remove `kernel_length` from `generate_waveform_lisa_pre_merger` * correct doc-string for `generate_pre_merger_psds` * fix doc-strings for PSDFirKernel * raise NotImplementedError when calling `set_phase` * split data generation and pre-processing into two functions * fix missing dict keys in pre-process data * update inference code to read hdf files directly * fix coalescence time
Co-authored-by: Gareth S Cabourn Davies <gareth.cabourndavies@ligo.org>
* fix calcluation of cutoff_deltat Set both forward zeros and cutoff time based on dt * add inline comments explaining times
Changes to make threshold work and CPU work
Avoid memory copies in threshold
There was a problem hiding this comment.
Pull request overview
This PR attempts to improve GPU performance for pycbc_inspiral by implementing batched template processing using CuPy. The approach processes ~100 templates simultaneously on GPU with custom CUDA kernels for template generation, matched filtering, and chi-squared computation.
Key changes:
- Batched template generation via
spa_tmplt_batchwith CUDA kernels - GPU-optimized matched filtering and chi-squared computation
- New
inspiral_utils.pywith batched GPU workflow
Reviewed changes
Copilot reviewed 21 out of 24 changed files in this pull request and generated 33 comments.
Show a summary per file
| File | Description |
|---|---|
| bin/pycbc_inspiral | Modified to support GPU batch processing; contains hardcoded path and incomplete implementation |
| pycbc/filter/inspiral_utils.py | New GPU-batched template processing engine |
| pycbc/waveform/spa_tmplt*.py | Batched TaylorF2 template generation with CUDA kernels |
| pycbc/vetoes/chisq_cupy.py | Batched chi-squared computation with optimized CUDA kernels |
| pycbc/filter/matchedfilter.py | Extended for batch operations on GPU |
| pycbc/waveform/pre_merger_waveform.py | Unrelated LISA code mixed into PR |
| pycbc/psd/lisa_pre_merger.py | Unrelated LISA code mixed into PR |
| pycbc/inference/models/lisa_pre_merger.py | Unrelated LISA code mixed into PR |
| requirements.txt | Commented out markupsafe constraint |
| pycbc/scheme.py | CUPYScheme initialization updates |
| pycbc/types/array*.py | CuPy array type handling |
| pycbc/events/threshold_cupy.py | Batched thresholding with CUDA kernels |
| examples/inspiral/run_cupy.sh | Example script for GPU execution |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # EW CODE SWAP | ||
| # def _loglikelihood(self): |
There was a problem hiding this comment.
Comments like "EW CODE SWAP" suggest this is development/experimental code that shouldn't be merged in this state. These commented-out alternatives should either be removed or properly documented if they're needed for future work.
| # EW CODE SWAP | |
| # def _loglikelihood(self): | |
| # Implement the log-likelihood via `_loglr`, consistent with the | |
| # underlying model interface used by this class. |
| import logging | ||
| from math import sqrt | ||
| import numpy | ||
| import cupy |
There was a problem hiding this comment.
Importing cupy globally without checking if it's available could cause import errors on systems without cupy installed. The import should be conditional or wrapped in a try/except block that provides a helpful error message.
| import cupy | |
| try: | |
| import cupy | |
| except ImportError as e: | |
| raise ImportError( | |
| "The 'cupy' package is required for GPU-accelerated matched filtering " | |
| "in 'pycbc.filter.matchedfilter'. Please install cupy (see " | |
| "https://docs.cupy.dev/ for installation instructions) or configure " | |
| "PyCBC to use a CPU-only backend." | |
| ) from e |
| raise NotImplementedError("Needs writing properly") | ||
| # Not sure this function is accessed easily. However, right now, it has not | ||
| # been properly written. A starting point is here though! |
There was a problem hiding this comment.
This comment says "Needs writing properly" and raises NotImplementedError, indicating incomplete implementation. The function should either be completed or removed before merging.
| raise NotImplementedError("Needs writing properly") | |
| # Not sure this function is accessed easily. However, right now, it has not | |
| # been properly written. A starting point is here though! |
| snrv, idx = self.threshold_and_clusterers[segnum].threshold_and_cluster(self.snr_threshold / norm, window) | ||
|
|
||
| if len(idx) == 0: | ||
| print("NO TRIGGERS") |
There was a problem hiding this comment.
This debug print statement should be removed or replaced with proper logging before merging.
| print("NO TRIGGERS") | |
| logger.info("NO TRIGGERS") |
| # Copyright (C) 2018 Collin Capano | ||
| # This program is free software; you can redistribute it and/or modify it | ||
| # under the terms of the GNU General Public License as published by the | ||
| # Free Software Foundation; either version 3 of the License, or (at your | ||
| # option) any later version. | ||
| # | ||
| # This program is distributed in the hope that it will be useful, but | ||
| # WITHOUT ANY WARRANTY; without even the implied warranty of | ||
| # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General | ||
| # Public License for more details. | ||
| # | ||
| # You should have received a copy of the GNU General Public License along | ||
| # with this program; if not, write to the Free Software Foundation, Inc., | ||
| # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. | ||
|
|
||
| """ | ||
| This modules provides models that have analytic solutions for the | ||
| log likelihood. | ||
| """ | ||
|
|
||
| import copy | ||
| import logging | ||
|
|
||
| import pycbc.types | ||
|
|
||
| from .base import BaseModel | ||
|
|
||
| import pycbc.psd | ||
|
|
||
| from pycbc.waveform.pre_merger_waveform import ( | ||
| pre_process_data_lisa_pre_merger, | ||
| generate_waveform_lisa_pre_merger, | ||
| ) | ||
| from pycbc.psd.lisa_pre_merger import generate_pre_merger_psds | ||
| from pycbc.waveform.waveform import parse_mode_array | ||
| from pycbc.waveform.utils import apply_fseries_time_shift | ||
| from .tools import marginalize_likelihood | ||
|
|
||
|
|
||
| class LISAPreMergerModel(BaseModel): | ||
| r"""Model for pre-merger inference in LISA. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| variable_params : (tuple of) string(s) | ||
| A tuple of parameter names that will be varied. | ||
| static_params: Dict[str: Any] | ||
| Dictionary of static parameters used for waveform generation. | ||
| psd_file : str | ||
| Path to the PSD file. Uses the same PSD file for LISA_A and LISA_E | ||
| channels. | ||
| **kwargs : | ||
| All other keyword arguments are passed to ``BaseModel``. | ||
| """ | ||
| name = "lisa_pre_merger" | ||
|
|
||
| def __init__( | ||
| self, | ||
| variable_params, | ||
| static_params=None, | ||
| psd_file=None, | ||
| **kwargs | ||
| ): | ||
| # Pop relevant values from kwargs | ||
| cutoff_time = int(kwargs.pop('cutoff_time')) | ||
| kernel_length = int(kwargs.pop('kernel_length')) | ||
| window_length = int(kwargs.pop('window_length')) | ||
| extra_forward_zeroes = int(kwargs.pop('extra_forward_zeroes')) | ||
| tlen = int(kwargs.pop('tlen')) | ||
| sample_rate = float(kwargs.pop('sample_rate')) | ||
| data_file = kwargs.pop('data_file') | ||
|
|
||
| # set up base likelihood parameters | ||
| super().__init__(variable_params, **kwargs) | ||
|
|
||
| self.static_params = parse_mode_array(static_params) | ||
|
|
||
| if psd_file is None: | ||
| raise ValueError("Must specify a PSD file!") | ||
|
|
||
| # Zero phase PSDs for whitening | ||
| # Only store the frequency-domain PSDs | ||
| logging.info("Generating pre-merger PSDs") | ||
| self.whitening_psds = {} | ||
| self.whitening_psds['LISA_A'] = generate_pre_merger_psds( | ||
| psd_file, | ||
| sample_rate=sample_rate, | ||
| duration=tlen, | ||
| kernel_length=kernel_length, | ||
| )["FD"] | ||
| self.whitening_psds['LISA_E'] = generate_pre_merger_psds( | ||
| psd_file, | ||
| sample_rate=sample_rate, | ||
| duration=tlen, | ||
| kernel_length=kernel_length, | ||
| )["FD"] | ||
|
|
||
| # Store data for doing likelihoods. | ||
| self.kernel_length = kernel_length | ||
| self.window_length = window_length | ||
| self.sample_rate = sample_rate | ||
| self.cutoff_time = cutoff_time | ||
| self.extra_forward_zeroes = extra_forward_zeroes | ||
| self.tlen = tlen | ||
|
|
||
| # Load the data from the file | ||
| data = {} | ||
| for channel in ["LISA_A", "LISA_E"]: | ||
| data[channel] = pycbc.types.timeseries.load_timeseries( | ||
| data_file, | ||
| group=f"/{channel}", | ||
| ) | ||
|
|
||
| # Pre-process the pre-merger data | ||
| # Returns time-domain data | ||
| # Uses UIDs: 4235, 4236 | ||
| logging.info("Pre-processing pre-merger data") | ||
| pre_merger_data = pre_process_data_lisa_pre_merger( | ||
| data, | ||
| sample_rate=sample_rate, | ||
| psds_for_whitening=self.whitening_psds, | ||
| window_length=0, | ||
| cutoff_time=self.cutoff_time, | ||
| forward_zeroes=self.kernel_length, | ||
| ) | ||
|
|
||
| self.lisa_a_strain = pre_merger_data["LISA_A"] | ||
| self.lisa_e_strain = pre_merger_data["LISA_E"] | ||
|
|
||
| # Frequency-domain data for computing log-likelihood | ||
| self.lisa_a_strain_fd = pycbc.strain.strain.execute_cached_fft( | ||
| self.lisa_a_strain, | ||
| copy_output=True, | ||
| uid=3223965 | ||
| ) | ||
| self.lisa_e_strain_fd = pycbc.strain.strain.execute_cached_fft( | ||
| self.lisa_e_strain, | ||
| copy_output=True, | ||
| uid=3223967 | ||
| ) | ||
| # Data epoch | ||
| self._epoch = self.lisa_a_strain_fd._epoch | ||
|
|
||
| def get_waveforms(self, params): | ||
| """Generate the waveforms given the parameters. | ||
|
|
||
| Note: `params` should already include the static parameters. | ||
| """ | ||
| dt = params["tc"] - self._epoch | ||
| # Time between the end of the data and the time of coalescence | ||
| dt_end = params.get("cutoff_deltat", self.tlen - dt) | ||
| # Actual time between the end of the data and the cutoff time | ||
| # since cutoff time is specified relative to the merger | ||
| cutoff_time = self.cutoff_time - dt_end | ||
| # Additional zeros at the beginning of the data, these: | ||
| # - manually specified zeroes | ||
| # - kernel length zeros | ||
| # - zeros that will be wrapped around when the data is shifted | ||
| forward_zeroes = ( | ||
| self.extra_forward_zeroes | ||
| + self.kernel_length | ||
| + int(dt_end * self.sample_rate) | ||
| ) | ||
| # Generate the pre-merger waveform | ||
| # These waveforms are whitened | ||
| # Uses UIDs: 1235(0), 1236(0) | ||
| ws = generate_waveform_lisa_pre_merger( | ||
| params, | ||
| psds_for_whitening=self.whitening_psds, | ||
| window_length=self.window_length, | ||
| sample_rate=self.sample_rate, | ||
| cutoff_time=cutoff_time, | ||
| forward_zeroes=forward_zeroes, | ||
| ) | ||
|
|
||
| wf = {} | ||
| # Adjust epoch to match data and shift merger to the | ||
| # correct time. | ||
| # Can safely set copy=False since ws won't be used again. | ||
| for channel in ws.keys(): | ||
| wf[channel] = apply_fseries_time_shift( | ||
| ws[channel], float(dt), copy=False, | ||
| ) | ||
| wf[channel]._epoch = self._epoch | ||
| return wf | ||
|
|
||
| def _loglikelihood(self): | ||
| """Compute the pre-merger log-likelihood.""" | ||
| cparams = copy.deepcopy(self.static_params) | ||
| cparams.update(self.current_params) | ||
|
|
||
| # Generate the waveforms | ||
| wforms = self.get_waveforms(cparams) | ||
|
|
||
| # Compute <h|d> for each channel | ||
| snr_A = pycbc.filter.overlap_cplx( | ||
| wforms["LISA_A"], | ||
| self.lisa_a_strain_fd, | ||
| normalized=False, | ||
| ) | ||
| snr_E = pycbc.filter.overlap_cplx( | ||
| wforms["LISA_E"], | ||
| self.lisa_e_strain_fd, | ||
| normalized=False, | ||
| ) | ||
| # Compute <h|h> for each channel | ||
| a_norm = pycbc.filter.sigmasq(wforms["LISA_A"]) | ||
| e_norm = pycbc.filter.sigmasq(wforms["LISA_E"]) | ||
|
|
||
| hs = snr_A + snr_E | ||
| hh = (a_norm + e_norm) | ||
|
|
||
| return marginalize_likelihood(complex(hs), float(hh), phase=False) |
There was a problem hiding this comment.
This file contains LISA-specific inference model code that is unrelated to the GPU performance optimization goals of this PR. It should be removed.
| # setup up the ifft we will do | ||
| self.ifft = IFFT(self.corr_mem, self.snr_mem) | ||
| # Detect if we're using CUPY scheme | ||
| import pycbc.scheme |
There was a problem hiding this comment.
This import of module pycbc.scheme is redundant, as it was previously imported on line 37.
| import pycbc.scheme |
| else: | ||
| raise ValueError(_INV_FFT_MSG.format("IFFT", itype, otype)) | ||
|
|
||
| class BatchFFT(_BaseFFT): |
There was a problem hiding this comment.
This class does not call _BaseFFT.init during initialization. (BatchFFT.init may be missing a call to a base class init)
|
|
||
| class BatchIFFT(_BaseIFFT): | ||
| """Class for performing batched IFFTs via the cupy interface""" | ||
| def __init__(self, invecs, outvecs, batch_size): |
There was a problem hiding this comment.
This class does not call _BaseIFFT.init during initialization. (BatchIFFT.init may be missing a call to a base class init)
| def __init__(self, invecs, outvecs, batch_size): | |
| def __init__(self, invecs, outvecs, batch_size): | |
| # Initialize base _BaseIFFT with a representative pair and batch size | |
| super(BatchIFFT, self).__init__(invecs[0], outvecs[0], nbatch=batch_size, size=None) |
| is_cupy = hasattr(test_arr._data, 'device') | ||
| use_batched_gen = all_spatmplt and no_compression and is_cupy | ||
| except ImportError: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| pass | |
| # Optional GPU/CuPy support is not available; fall back to non-batched generation. | |
| logging.debug("CuPy/CUPYScheme not available; using non-batched waveform generation.") |
| kernel, latency, sample_rate = self.psd_to_linear_phase_whitening_fir_kernel(psd) | ||
| kernel, phase = self.linear_phase_fir_kernel_to_minimum_phase_whitening_fir_kernel(kernel, sample_rate) | ||
|
|
||
| # get merger model for SNR = 1. | ||
| f_psd = psd.f0 + np.arange(len(psd.data.data)) * psd.deltaF | ||
| horizon_distance = HorizonDistance(f_low, f_psd[-1], psd.deltaF, m1, m2) | ||
| f_model, model= horizon_distance(psd, 1.)[1] | ||
|
|
||
| # find the range of frequency bins covered by the merger | ||
| # model | ||
| kmin, kmax = f_psd.searchsorted(f_model[0]), f_psd.searchsorted(f_model[-1]) + 1 | ||
|
|
||
| # compute SNR=1 model's (d SNR^2 / df) spectral density | ||
| unit_snr2_density = np.zeros_like(phase) | ||
| unit_snr2_density[kmin:kmax] = model / psd.data.data[kmin:kmax] | ||
|
|
||
| # integrate across each frequency bin, converting to | ||
| # snr^2/bin. NOTE: this step is here for the record, but | ||
| # is commented out because it has no effect on the result | ||
| # given the renormalization that occurs next. | ||
| #unit_snr2_density *= psd.deltaF | ||
|
|
||
| # take 16th root, then normalize so max=1. why? I don't | ||
| # know, just feels good, on the whole. | ||
| unit_snr2_density = unit_snr2_density**(1./16) | ||
| unit_snr2_density /= unit_snr2_density.max() | ||
|
|
||
| # record phase vector and SNR^2 density vector | ||
| self.target_phase = phase | ||
| self.target_phase_mask = unit_snr2_density | ||
|
|
There was a problem hiding this comment.
This statement is unreachable.
| kernel, latency, sample_rate = self.psd_to_linear_phase_whitening_fir_kernel(psd) | |
| kernel, phase = self.linear_phase_fir_kernel_to_minimum_phase_whitening_fir_kernel(kernel, sample_rate) | |
| # get merger model for SNR = 1. | |
| f_psd = psd.f0 + np.arange(len(psd.data.data)) * psd.deltaF | |
| horizon_distance = HorizonDistance(f_low, f_psd[-1], psd.deltaF, m1, m2) | |
| f_model, model= horizon_distance(psd, 1.)[1] | |
| # find the range of frequency bins covered by the merger | |
| # model | |
| kmin, kmax = f_psd.searchsorted(f_model[0]), f_psd.searchsorted(f_model[-1]) + 1 | |
| # compute SNR=1 model's (d SNR^2 / df) spectral density | |
| unit_snr2_density = np.zeros_like(phase) | |
| unit_snr2_density[kmin:kmax] = model / psd.data.data[kmin:kmax] | |
| # integrate across each frequency bin, converting to | |
| # snr^2/bin. NOTE: this step is here for the record, but | |
| # is commented out because it has no effect on the result | |
| # given the renormalization that occurs next. | |
| #unit_snr2_density *= psd.deltaF | |
| # take 16th root, then normalize so max=1. why? I don't | |
| # know, just feels good, on the whole. | |
| unit_snr2_density = unit_snr2_density**(1./16) | |
| unit_snr2_density /= unit_snr2_density.max() | |
| # record phase vector and SNR^2 density vector | |
| self.target_phase = phase | |
| self.target_phase_mask = unit_snr2_density |
@xangma and I have been playing about on and off in the last year with trying to make
pycbc_inspiralperform more efficiently on the GPU (largely also using this as a test of how powerful AI-assisted tools can be for stuff like this). We've moved to a batch model (where we process O(100) templates at the same time) and have a number of additional cupy codes and underlying cupy-cuda kernels to enable this in here. The code agrees within the sort of precision we might expect for GPU vs CPU with the standardexample/inspiraltest code.HOWEVER, while this is better than what we've had before, we're nowhere near the "ideal" speedups that GPUs should be able to achieve. (I reckon we're seeing about a speedup between 10 and 100x relative to running on a single CPU, but there's all sorts of caveats in those numbers and I don't want to present this as a "proper" benchmark). However, all the code is here if someone else wants to try this out! Just run
examples/inspiral/run_cupy.sh. We pulled out most of the main processing loop into ainspiralutils.pyto make it easier to poke about with this. (Some bits are not yet there, e.g. we only have TaylorF2 waveforms working, and not all of the less-used chi-squared tests are batched, but this could be added if this would work in principle).SO @ahnitz @josh-willis @titodalcanton if this is of interest or useful, please feel free to have a look over. Currently the chi-squared seems to be the slowest part of it, but I don't think I know enough about what the GPU actually does under the hood to properly have a sense of where to focus further optimization.
... I seem to have this mixed up with some LISA code as well. I won't clean that up at this point unless we think this gets to the point where it's worth merging.