Skip to content

Comments

DRAFT: Trying to improve pycbc_inspiral performance on GPU#5249

Open
spxiwh wants to merge 45 commits intogwastro:masterfrom
spxiwh:gpu_dev2
Open

DRAFT: Trying to improve pycbc_inspiral performance on GPU#5249
spxiwh wants to merge 45 commits intogwastro:masterfrom
spxiwh:gpu_dev2

Conversation

@spxiwh
Copy link
Contributor

@spxiwh spxiwh commented Jan 9, 2026

@xangma and I have been playing about on and off in the last year with trying to make pycbc_inspiral perform more efficiently on the GPU (largely also using this as a test of how powerful AI-assisted tools can be for stuff like this). We've moved to a batch model (where we process O(100) templates at the same time) and have a number of additional cupy codes and underlying cupy-cuda kernels to enable this in here. The code agrees within the sort of precision we might expect for GPU vs CPU with the standard example/inspiral test code.

HOWEVER, while this is better than what we've had before, we're nowhere near the "ideal" speedups that GPUs should be able to achieve. (I reckon we're seeing about a speedup between 10 and 100x relative to running on a single CPU, but there's all sorts of caveats in those numbers and I don't want to present this as a "proper" benchmark). However, all the code is here if someone else wants to try this out! Just run examples/inspiral/run_cupy.sh. We pulled out most of the main processing loop into a inspiralutils.py to make it easier to poke about with this. (Some bits are not yet there, e.g. we only have TaylorF2 waveforms working, and not all of the less-used chi-squared tests are batched, but this could be added if this would work in principle).

SO @ahnitz @josh-willis @titodalcanton if this is of interest or useful, please feel free to have a look over. Currently the chi-squared seems to be the slowest part of it, but I don't think I know enough about what the GPU actually does under the hood to properly have a sense of where to focus further optimization.

... I seem to have this mixed up with some LISA code as well. I won't clean that up at this point unless we think this gets to the point where it's worth merging.

spxiwh and others added 30 commits December 5, 2024 09:47
* add forward zeros after fft

* fix generate data lisa ew

* remove tlen from PSD function

* add doc-string for data generation

* update LISA EW model with changes

* handle zero noise flag

* move PSD functions to pycbc.psd and update to match utils

* update model to use reworked PSD functions

* remove unused generator object

* comment about HorizonDistance

* refactor pre-merger waveform to use a common function

* rename early_warning_wform.py to pre_merger_waveform.py

* update pre-merger model to use update waveform functions

* fix broken import

* rename lisa_ew.py to lisa_pre_merger.py

* rename lisa_ew model to lisa_pre_merger

* remove unused imports from lisa_pre_merger.py

* remove phase marginalization

* update __init__.py for pre-merger renaming

* add some basic logging statements

* fix return value for apply_pre_merger_kernel

* do not copy output for TD to FD in apply_pre_merger_kernel

* clarify doc-strings for generate_data_lisa_pre_merger

* fix duration and cutoff time in pre-merger model

* remove `kernel_length` from `generate_waveform_lisa_pre_merger`

* correct doc-string for `generate_pre_merger_psds`

* fix doc-strings for PSDFirKernel

* raise NotImplementedError when calling `set_phase`

* split data generation and pre-processing into two functions

* fix missing dict keys in pre-process data

* update inference code to read hdf files directly

* fix coalescence time
Co-authored-by: Gareth S Cabourn Davies <gareth.cabourndavies@ligo.org>
* fix calcluation of cutoff_deltat

Set both forward zeros and cutoff time based on dt

* add inline comments explaining times
Copilot AI review requested due to automatic review settings January 9, 2026 13:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to improve GPU performance for pycbc_inspiral by implementing batched template processing using CuPy. The approach processes ~100 templates simultaneously on GPU with custom CUDA kernels for template generation, matched filtering, and chi-squared computation.

Key changes:

  • Batched template generation via spa_tmplt_batch with CUDA kernels
  • GPU-optimized matched filtering and chi-squared computation
  • New inspiral_utils.py with batched GPU workflow

Reviewed changes

Copilot reviewed 21 out of 24 changed files in this pull request and generated 33 comments.

Show a summary per file
File Description
bin/pycbc_inspiral Modified to support GPU batch processing; contains hardcoded path and incomplete implementation
pycbc/filter/inspiral_utils.py New GPU-batched template processing engine
pycbc/waveform/spa_tmplt*.py Batched TaylorF2 template generation with CUDA kernels
pycbc/vetoes/chisq_cupy.py Batched chi-squared computation with optimized CUDA kernels
pycbc/filter/matchedfilter.py Extended for batch operations on GPU
pycbc/waveform/pre_merger_waveform.py Unrelated LISA code mixed into PR
pycbc/psd/lisa_pre_merger.py Unrelated LISA code mixed into PR
pycbc/inference/models/lisa_pre_merger.py Unrelated LISA code mixed into PR
requirements.txt Commented out markupsafe constraint
pycbc/scheme.py CUPYScheme initialization updates
pycbc/types/array*.py CuPy array type handling
pycbc/events/threshold_cupy.py Batched thresholding with CUDA kernels
examples/inspiral/run_cupy.sh Example script for GPU execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +161 to +162
# EW CODE SWAP
# def _loglikelihood(self):
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments like "EW CODE SWAP" suggest this is development/experimental code that shouldn't be merged in this state. These commented-out alternatives should either be removed or properly documented if they're needed for future work.

Suggested change
# EW CODE SWAP
# def _loglikelihood(self):
# Implement the log-likelihood via `_loglr`, consistent with the
# underlying model interface used by this class.

Copilot uses AI. Check for mistakes.
import logging
from math import sqrt
import numpy
import cupy
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing cupy globally without checking if it's available could cause import errors on systems without cupy installed. The import should be conditional or wrapped in a try/except block that provides a helpful error message.

Suggested change
import cupy
try:
import cupy
except ImportError as e:
raise ImportError(
"The 'cupy' package is required for GPU-accelerated matched filtering "
"in 'pycbc.filter.matchedfilter'. Please install cupy (see "
"https://docs.cupy.dev/ for installation instructions) or configure "
"PyCBC to use a CPU-only backend."
) from e

Copilot uses AI. Check for mistakes.
Comment on lines +208 to +210
raise NotImplementedError("Needs writing properly")
# Not sure this function is accessed easily. However, right now, it has not
# been properly written. A starting point is here though!
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says "Needs writing properly" and raises NotImplementedError, indicating incomplete implementation. The function should either be completed or removed before merging.

Suggested change
raise NotImplementedError("Needs writing properly")
# Not sure this function is accessed easily. However, right now, it has not
# been properly written. A starting point is here though!

Copilot uses AI. Check for mistakes.
snrv, idx = self.threshold_and_clusterers[segnum].threshold_and_cluster(self.snr_threshold / norm, window)

if len(idx) == 0:
print("NO TRIGGERS")
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug print statement should be removed or replaced with proper logging before merging.

Suggested change
print("NO TRIGGERS")
logger.info("NO TRIGGERS")

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +213
# Copyright (C) 2018 Collin Capano
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 3 of the License, or (at your
# option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
# Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

"""
This modules provides models that have analytic solutions for the
log likelihood.
"""

import copy
import logging

import pycbc.types

from .base import BaseModel

import pycbc.psd

from pycbc.waveform.pre_merger_waveform import (
pre_process_data_lisa_pre_merger,
generate_waveform_lisa_pre_merger,
)
from pycbc.psd.lisa_pre_merger import generate_pre_merger_psds
from pycbc.waveform.waveform import parse_mode_array
from pycbc.waveform.utils import apply_fseries_time_shift
from .tools import marginalize_likelihood


class LISAPreMergerModel(BaseModel):
r"""Model for pre-merger inference in LISA.

Parameters
----------
variable_params : (tuple of) string(s)
A tuple of parameter names that will be varied.
static_params: Dict[str: Any]
Dictionary of static parameters used for waveform generation.
psd_file : str
Path to the PSD file. Uses the same PSD file for LISA_A and LISA_E
channels.
**kwargs :
All other keyword arguments are passed to ``BaseModel``.
"""
name = "lisa_pre_merger"

def __init__(
self,
variable_params,
static_params=None,
psd_file=None,
**kwargs
):
# Pop relevant values from kwargs
cutoff_time = int(kwargs.pop('cutoff_time'))
kernel_length = int(kwargs.pop('kernel_length'))
window_length = int(kwargs.pop('window_length'))
extra_forward_zeroes = int(kwargs.pop('extra_forward_zeroes'))
tlen = int(kwargs.pop('tlen'))
sample_rate = float(kwargs.pop('sample_rate'))
data_file = kwargs.pop('data_file')

# set up base likelihood parameters
super().__init__(variable_params, **kwargs)

self.static_params = parse_mode_array(static_params)

if psd_file is None:
raise ValueError("Must specify a PSD file!")

# Zero phase PSDs for whitening
# Only store the frequency-domain PSDs
logging.info("Generating pre-merger PSDs")
self.whitening_psds = {}
self.whitening_psds['LISA_A'] = generate_pre_merger_psds(
psd_file,
sample_rate=sample_rate,
duration=tlen,
kernel_length=kernel_length,
)["FD"]
self.whitening_psds['LISA_E'] = generate_pre_merger_psds(
psd_file,
sample_rate=sample_rate,
duration=tlen,
kernel_length=kernel_length,
)["FD"]

# Store data for doing likelihoods.
self.kernel_length = kernel_length
self.window_length = window_length
self.sample_rate = sample_rate
self.cutoff_time = cutoff_time
self.extra_forward_zeroes = extra_forward_zeroes
self.tlen = tlen

# Load the data from the file
data = {}
for channel in ["LISA_A", "LISA_E"]:
data[channel] = pycbc.types.timeseries.load_timeseries(
data_file,
group=f"/{channel}",
)

# Pre-process the pre-merger data
# Returns time-domain data
# Uses UIDs: 4235, 4236
logging.info("Pre-processing pre-merger data")
pre_merger_data = pre_process_data_lisa_pre_merger(
data,
sample_rate=sample_rate,
psds_for_whitening=self.whitening_psds,
window_length=0,
cutoff_time=self.cutoff_time,
forward_zeroes=self.kernel_length,
)

self.lisa_a_strain = pre_merger_data["LISA_A"]
self.lisa_e_strain = pre_merger_data["LISA_E"]

# Frequency-domain data for computing log-likelihood
self.lisa_a_strain_fd = pycbc.strain.strain.execute_cached_fft(
self.lisa_a_strain,
copy_output=True,
uid=3223965
)
self.lisa_e_strain_fd = pycbc.strain.strain.execute_cached_fft(
self.lisa_e_strain,
copy_output=True,
uid=3223967
)
# Data epoch
self._epoch = self.lisa_a_strain_fd._epoch

def get_waveforms(self, params):
"""Generate the waveforms given the parameters.

Note: `params` should already include the static parameters.
"""
dt = params["tc"] - self._epoch
# Time between the end of the data and the time of coalescence
dt_end = params.get("cutoff_deltat", self.tlen - dt)
# Actual time between the end of the data and the cutoff time
# since cutoff time is specified relative to the merger
cutoff_time = self.cutoff_time - dt_end
# Additional zeros at the beginning of the data, these:
# - manually specified zeroes
# - kernel length zeros
# - zeros that will be wrapped around when the data is shifted
forward_zeroes = (
self.extra_forward_zeroes
+ self.kernel_length
+ int(dt_end * self.sample_rate)
)
# Generate the pre-merger waveform
# These waveforms are whitened
# Uses UIDs: 1235(0), 1236(0)
ws = generate_waveform_lisa_pre_merger(
params,
psds_for_whitening=self.whitening_psds,
window_length=self.window_length,
sample_rate=self.sample_rate,
cutoff_time=cutoff_time,
forward_zeroes=forward_zeroes,
)

wf = {}
# Adjust epoch to match data and shift merger to the
# correct time.
# Can safely set copy=False since ws won't be used again.
for channel in ws.keys():
wf[channel] = apply_fseries_time_shift(
ws[channel], float(dt), copy=False,
)
wf[channel]._epoch = self._epoch
return wf

def _loglikelihood(self):
"""Compute the pre-merger log-likelihood."""
cparams = copy.deepcopy(self.static_params)
cparams.update(self.current_params)

# Generate the waveforms
wforms = self.get_waveforms(cparams)

# Compute <h|d> for each channel
snr_A = pycbc.filter.overlap_cplx(
wforms["LISA_A"],
self.lisa_a_strain_fd,
normalized=False,
)
snr_E = pycbc.filter.overlap_cplx(
wforms["LISA_E"],
self.lisa_e_strain_fd,
normalized=False,
)
# Compute <h|h> for each channel
a_norm = pycbc.filter.sigmasq(wforms["LISA_A"])
e_norm = pycbc.filter.sigmasq(wforms["LISA_E"])

hs = snr_A + snr_E
hh = (a_norm + e_norm)

return marginalize_likelihood(complex(hs), float(hh), phase=False)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains LISA-specific inference model code that is unrelated to the GPU performance optimization goals of this PR. It should be removed.

Copilot uses AI. Check for mistakes.
# setup up the ifft we will do
self.ifft = IFFT(self.corr_mem, self.snr_mem)
# Detect if we're using CUPY scheme
import pycbc.scheme
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import of module pycbc.scheme is redundant, as it was previously imported on line 37.

Suggested change
import pycbc.scheme

Copilot uses AI. Check for mistakes.
else:
raise ValueError(_INV_FFT_MSG.format("IFFT", itype, otype))

class BatchFFT(_BaseFFT):
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class does not call _BaseFFT.init during initialization. (BatchFFT.init may be missing a call to a base class init)

Copilot uses AI. Check for mistakes.

class BatchIFFT(_BaseIFFT):
"""Class for performing batched IFFTs via the cupy interface"""
def __init__(self, invecs, outvecs, batch_size):
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class does not call _BaseIFFT.init during initialization. (BatchIFFT.init may be missing a call to a base class init)

Suggested change
def __init__(self, invecs, outvecs, batch_size):
def __init__(self, invecs, outvecs, batch_size):
# Initialize base _BaseIFFT with a representative pair and batch size
super(BatchIFFT, self).__init__(invecs[0], outvecs[0], nbatch=batch_size, size=None)

Copilot uses AI. Check for mistakes.
is_cupy = hasattr(test_arr._data, 'device')
use_batched_gen = all_spatmplt and no_compression and is_cupy
except ImportError:
pass
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
pass
# Optional GPU/CuPy support is not available; fall back to non-batched generation.
logging.debug("CuPy/CUPYScheme not available; using non-batched waveform generation.")

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +64
kernel, latency, sample_rate = self.psd_to_linear_phase_whitening_fir_kernel(psd)
kernel, phase = self.linear_phase_fir_kernel_to_minimum_phase_whitening_fir_kernel(kernel, sample_rate)

# get merger model for SNR = 1.
f_psd = psd.f0 + np.arange(len(psd.data.data)) * psd.deltaF
horizon_distance = HorizonDistance(f_low, f_psd[-1], psd.deltaF, m1, m2)
f_model, model= horizon_distance(psd, 1.)[1]

# find the range of frequency bins covered by the merger
# model
kmin, kmax = f_psd.searchsorted(f_model[0]), f_psd.searchsorted(f_model[-1]) + 1

# compute SNR=1 model's (d SNR^2 / df) spectral density
unit_snr2_density = np.zeros_like(phase)
unit_snr2_density[kmin:kmax] = model / psd.data.data[kmin:kmax]

# integrate across each frequency bin, converting to
# snr^2/bin. NOTE: this step is here for the record, but
# is commented out because it has no effect on the result
# given the renormalization that occurs next.
#unit_snr2_density *= psd.deltaF

# take 16th root, then normalize so max=1. why? I don't
# know, just feels good, on the whole.
unit_snr2_density = unit_snr2_density**(1./16)
unit_snr2_density /= unit_snr2_density.max()

# record phase vector and SNR^2 density vector
self.target_phase = phase
self.target_phase_mask = unit_snr2_density

Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Suggested change
kernel, latency, sample_rate = self.psd_to_linear_phase_whitening_fir_kernel(psd)
kernel, phase = self.linear_phase_fir_kernel_to_minimum_phase_whitening_fir_kernel(kernel, sample_rate)
# get merger model for SNR = 1.
f_psd = psd.f0 + np.arange(len(psd.data.data)) * psd.deltaF
horizon_distance = HorizonDistance(f_low, f_psd[-1], psd.deltaF, m1, m2)
f_model, model= horizon_distance(psd, 1.)[1]
# find the range of frequency bins covered by the merger
# model
kmin, kmax = f_psd.searchsorted(f_model[0]), f_psd.searchsorted(f_model[-1]) + 1
# compute SNR=1 model's (d SNR^2 / df) spectral density
unit_snr2_density = np.zeros_like(phase)
unit_snr2_density[kmin:kmax] = model / psd.data.data[kmin:kmax]
# integrate across each frequency bin, converting to
# snr^2/bin. NOTE: this step is here for the record, but
# is commented out because it has no effect on the result
# given the renormalization that occurs next.
#unit_snr2_density *= psd.deltaF
# take 16th root, then normalize so max=1. why? I don't
# know, just feels good, on the whole.
unit_snr2_density = unit_snr2_density**(1./16)
unit_snr2_density /= unit_snr2_density.max()
# record phase vector and SNR^2 density vector
self.target_phase = phase
self.target_phase_mask = unit_snr2_density

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants