This directory contains scripts for dataset building, model training, and evaluation. To reproduce the experiments in the paper, follow these steps:
Note
The user is recommended to download the dataset and preprocessed the data by following the data section in our manuscript. However, to ease the trial, we provide script to generate the raw_waveforms.h5 using STEAD dataset link in create_dataset_from_STEAD.py. This example using STEAD dataset will NOT reproduce the results we provided in our manuscript.
To set up the environment and install all dependencies follow the steps below.
-
First, download the
tqdnecode. There are two ways:a) Recommended: Download the latest release if you do not require commit history. Releases have been tested and reproduced by us and partners.
b) Alternatively, clone the repository using:
git clone (--depth 1) https://github.com/highfem/tqdne.git
Omit
--depth 1if you want to access the full commit history. -
Second, create and activate a
condaenvironment. Again, there are multiple options:a) If you prefer to create an environment in
conda's default path, use:conda env create -f envs/environment.yaml conda activate tqdne
If conda is not installed, download it from Miniconda.
b) If you prefer to install the environment in a custom path, e.g., in cluster environments, run:
conda env create -f envs/environment.yaml -p <PATH> conda activate <PATH>
Replace
<PATH>with your desired installation directory.
To make running experiments as easy as possible we expect the user to adopt the following folder structure. We refer to the base folder as workdir which will be used to automatically store all results. The structure is as follows:
workdir/
/data/
/data/preprocessed_waveforms.h5
/data/raw_waveforms.h5
/evaluation/
/figures/
/outputs/
/outputs/Autoencoder-32x32x4-LogSpectrogram
/outputs/Latent-EDM-32x32x4-LogSpectrogramTo create each file, follow the steps below.
Download the STEAD Dataset, extract the *.zip file in this directory, and then simply run create_dataset_from_STEAD.py. It will generate
workdir/
/data/raw_waveforms.h5Note that, some of the parameters (e.g., the length of the waveforms, data sampling rate, vs30 values, and starting time of the waveforms) are hardcoded, therefore, please adjust accordingly.
Run build_dataset.py to create the cleaned dataset
workdir/
/data/preprocessed_waveforms.h5from the raw raw_waveforms.h5 file.
Run train_classifier.py to train a classifier predicting the earthquake distance-magnitude bin. This classifier will be used to evaluate the generated data. Model checkpoints will be saved as
workdir/
/outputs/Classifier-LogSpectrogramTrain the latent diffusion model using the EDM diffusion framework:
- Run
train_autoencoder.pyto train the autoencoder, the first stage of the latent diffusion model. - Run
train_latent_edm.pyto train the diffusion model, the second stage of the latent diffusion model.
This will create
workdir/
/outputs/Autoencoder-32x32x4-LogSpectrogram
/outputs/Latent-EDM-LogSpectrogramMake sure to create a soft link best.ckpt in /outputs/Autoencoder-32x96x4-LogSpectrogram such that the best checkpoint will be used for training the latent EDM.
Conduct the following ablation studies:
- (No Latent) Diffusion: Run
train_edm.pyto train the diffusion model without the autoencoder. - 1D Diffusion: Run
train_1d_edm.pyto train the diffusion model generating the signal in the time domain, instead of the log-spectrogram. - Latent 1D Diffusion: Run
train_1d_autoencoder.pyfollowed bytrain_1d_latent_edm.pyto train the latent diffusion model generating the signal in the time domain.
This will create
workdir/
/outputs/Autoencoder-1024x16-MovingAvg
/outputs/Latent-EDM-MovingAvg
/outputs/EDM-LogSpectrogram
/outputs/EDM-MovingAvgRun evaluate.py to generate synthetic seismograms conditioned on the parameters of real seismograms and evaluate them using the classifier trained in step 2. The model and classifier checkpoints and the dataset split (train, validation, or full) must be specified. Waveforms using the conditional features of the corresponding dataset split will be saved in a .h5 file in a specified subfolder in the outputs directory, along with the real waveforms and classifier predictions. This file can be read by the evaluate.ipynb notebook to compute metrics and generate figures as presented in the paper. Check the script documentation for usage details. Example call:
python evaluate.py --split "test" --batch_size 32Run generate.py with a model checkpoint as an argument to generate synthetic seismograms. The generated data will be saved as a .h5 file. Check the script documentation for usage details. Example call:
python generate.py \
--hypocentral_distance 10.0 \
--magnitude 5.5 \
--vs30 760 \
--hypocentre_depth 10.0 \
--azimuthal_gap 130 \
--num_samples 100 \
--outfile workdir/generated_waveforms.h5 \
--batch_size 32Although the KiK-net and K-NET datasets are freely accessible, their licenses prohibit redistribution. Consequently, we cannot share the pre-processed data required for full reproducibility.
Instead, you can access all preprocessing scripts in tqdne/scripts/preprocessing.
For a fully reproducible example, we provide a small dataset derived from the STEAD repository (Mousavi et al., 2019) — https://github.com/smousavi05/STEAD.
To reproduce the basic HighFEM analysis with this STEAD sample, follow the steps outlined in Supplementary Text S3.
python stead_download.py --local_path /absolute/path/to/STEADEdit local_path inside stead_download.py or pass the --local_path flag so files are stored where you want.
python create_dataset_from_STEAD.py \
--file_name /absolute/path/to/STEAD/waveforms \
--csv_file /absolute/path/to/STEAD/metadata.csv \
--output_file_path /absolute/path/to/data/raw_waveforms.h5Update the script arguments (or variables inside the script) to point to the freshly downloaded STEAD files and choose an output location.
python build_dataset.py --workdir /absolute/path/to/dataThis command assumes the file data/raw_waveforms.h5 is inside the specified --workdir.
torchrun \
--nproc_per_node=4 \
train_autoencoder.py \
--workdir experiments/workdir/steadtorchrun \
--nproc_per_node=4 \
train_latent_edm.py \
--workdir experiments/workdir/stead \
--batchsize 128torchrun --nproc_per_node=1 generate_stead.py \
--workdir experiments/workdir/stead \
--outfile experiments/workdir/stead/gwm_stead_v1.h5 \
--edm_checkpoint experiments/workdir/stead/outputs/Latent-EDM-32x32x4-LogSpectrogram/last.ckpt \
--autoencoder_checkpoint experiments/workdir/stead/outputs/Autoencoder-32x32x4-LogSpectrogram/last.ckptOpen and run Residual_plot_stead.ipynb to plot residuals for the generated waveforms.