Skip to content

Synthetic Data Generation with Gaussian Mixture Models

Notifications You must be signed in to change notification settings

MekanMyradov/Simulations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

722f693 · Sep 11, 2023

History

2 Commits
Sep 11, 2023
Sep 11, 2023
Sep 11, 2023
Sep 11, 2023
Sep 11, 2023
Sep 11, 2023

Repository files navigation

Synthetic Data Generation with Gaussian Mixture Models

Synthetic data is a valuable asset for testing newly developed methods. This repository provides an implementation of a synthetic data generation algorithm in Python using Gaussian Mixture Models.

Table of Contents

Getting Started

Installation

  1. Clone this repository:

    git clone https://github.com/MekanMyradov/Simulations.git
    cd Simulations
  2. Create and activate a conda environment:

    conda env create -f environment.yml
    conda activate simulate_batches

Usage Examples

Here's an example of how to use the synthetic data generation code:

from Batch import Batch
import numpy as np
import anndata as ad
import pandas as pd

np.random.seed(0)

# Create 2 batches with same means and standard deviations
x_mu = np.array([0, 5, 5])    # x means
x_sd = np.array([1, 0.1, 1])  # x standard deviations

y_mu = np.array([5, 5, 0])    # y means
y_sd = np.array([1, 0.1, 1])  # y standard deviations


# Weights of components
weights01 = np.array([0.30, 0.50, 0.20])    # fractions of cell types for batch 1
weights02 = np.array([0.65, 0.30, 0.05])    # fractions of cell types for batch 2


n_cells = 1000  # number of cells
n_genes = 100   # number of genes


# sample from standard normal distribution to project data to n_genes dimensional space.
proj = np.random.randn(n_genes * 2).reshape(n_genes, 2)

colors = np.array(["#0000FF", "#964B00", "#FFD700"])     # colors of components [blue, brown, gold] 

batch01 = Batch(
    x_mu=x_mu,
    x_sd=x_sd,
    y_mu=y_mu,
    y_sd=y_sd,
    weights=weights01,
    colors=colors,
    proj=proj,
    batch_id="Batch_01",
    n_cells=n_cells,
    n_genes=n_genes,
    gene_specific=False     # no need for gene specific noise (batch effect) in batch 1
)

batch02 = Batch(
    x_mu=x_mu,
    x_sd=x_sd,
    y_mu=y_mu,
    y_sd=y_sd,
    weights=weights02,
    colors=colors,
    proj=proj,
    batch_id="Batch_02",
    n_cells=n_cells,
    n_genes=n_genes,
    gene_specific=True     # employ gene specific noise (batch effect)
)

# Generate n_cells x 2 data
batch01.generate_data()
batch02.generate_data()

# Save the plot of low-dimensional data
batch01.save_clusters()
batch02.save_clusters()

# Projection
data01 = batch01.project_data()
data02 = batch02.project_data()

# Concatenate along axis 0 (rows)
data = np.concatenate((data01, data02), axis=0)

# Convert to AnnData
adata = ad.AnnData(data)

# Provide the index to both the 'obs' and 'var' axes
adata.var_names = ["Gene_{}".format(i) for i in range(0, adata.n_vars)]

obs01 = ["B01_Cell_{}".format(i) for i in range(0, batch01.n_cells)]
obs02 = ["B02_Cell_{}".format(i) for i in range(0, batch02.n_cells)]
adata.obs_names = np.concatenate((obs01, obs02), axis=0)

# Add some annotations to 'obs' (i.e., cells)
tech01 = pd.Categorical([batch01.batch_id] * batch01.n_cells)
tech02 = pd.Categorical([batch02.batch_id] * batch02.n_cells)
adata.obs["tech"] = np.concatenate((tech01, tech02), axis=0)

celltype01 = pd.Categorical(batch01.components)
celltype02 = pd.Categorical(batch02.components)
adata.obs["celltype"] = np.concatenate((celltype01, celltype02), axis=0)

adata.write("simulated_data.h5ad")

You can customize the parameters in the code to generate synthetic data for your specific use case.

Figures

Batch 1 has 3 components with proportions 0.3, 0.5, and 0.2.

Batch 1

Batch 2 has 3 components with proportions 0.65, 0.30, and 0.05.

Batch 2

References

[1] Haghverdi, Laleh, et al. "Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors." Nature biotechnology 36.5 (2018): 421-427.

About

Synthetic Data Generation with Gaussian Mixture Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages