ttrenty · ttrenty · Jun 29, 2025 · Jun 29, 2025 · Jun 29, 2025 · Jun 29, 2025
diff --git a/Makefile b/Makefile
@@ -2,9 +2,9 @@
 
 .PHONY: all
 all:
-	# @pixi run test
 	@pixi run main
-	@pixi run bench
+	@pixi run test
+	# @pixi run bench
 
 .PHONY: %
 %:

diff --git a/README.md b/README.md
@@ -2,49 +2,107 @@
 
 **A Quantum Circuit Composer & Simulator in Mojo** 🔥⚛️
 
+QLabs is a quantum circuit simulation library implemented in Mojo, designed for educational purposes and high-performance quantum circuit simulation.
 
-## Education 
+## 🎓 Educational Purposes
 
-This project reimplements and extends the ideas from the following tutorial paper:
+### 🎯 Project Objectives
+
+- **Mojo Implementation**: Re-implement the approach from the referenced paper [1] in Mojo for a Pythonic syntax and enhanced readability.
+- **Learning by Doing**: Gain hands-on experience with quantum circuit simulation to understand the capabilities and limitations of classical simulation.
+- **Performance & Safety**: Leverage Mojo's strong static typing and compilation for blazing-fast and safe operations.
+- **Hardware Acceleration**: Utilize Mojo’s universal GPU programming support to accelerate simulations.
+
+### 🛠️ Implementations
+
+- ✅ **State Vector and Gate Circuit Implementations**
+  - Low-level: See `examples/low_level.mojo` using `qlabs.base` tools.
+  - High-level: See `examples/circuit_level.mojo` using `qlabs.abstractions` tools.
+- ✅ **Partial GPU Support (Cross-Platform: NVIDIA/AMD)**
+  - Low-level: See `examples/gpu_low_level.mojo` using `qlabs.base` and `qlabs.base.gpu` tools.
+  - High-level: See `examples/circuit_level.mojo` using `qlabs.abstractions` tools.
+- ✅ **Quantum States Statistics Calculation**
+- ☐ **Qubit Measurements** (Access to the full State Vector is already available, providing even more flexibility)
+- ☐ **Continuous Statistics Tracking During Circuit Execution**
+- ☐ **Gradient Computations**
+- ☐ **Tensor Network Implementation**
+
+[1] This project reimplements and extends the ideas from the following tutorial paper:
 
 > **How to Write a Simulator for Quantum Circuits from Scratch: A Tutorial**  
 > *Michael J. McGuffin, Jean-Marc Robert, and Kazuki Ikeda*  
 > Published: 2025-06-09 on [arXiv:2506.08142v1](https://arxiv.org/abs/2506.08142v1) (last accessed: 2025-06-12)
 
-###  🎯 Project Objectives
-
-* **Mojo Implementation:** Re-implement the approach from the paper in Mojo for more Pythonic synthax and better readability.
-* **Learning by Doing:** Gain hands-on experience with quantum circuit simulation to better understand the capabilities and limitations of classical simulation.
-* **Performance & Safety:** Leverage Mojo's strong static typing and compilation for blazing-fast and safe operations.
-* **Hardware Acceleration:** Utilize Mojo’s universal GPU programming support to accelerate simulations.
-
-### 🔥 Current Implementation
+### 🔥 Library Performance
 
-The current implementation uses a State Vector approach, which is an efficient method for simulating small-scale quantum circuits (20–30 qubits) with high precision. This approach also enables relatively straightforward exact gradient computations.
+Achieves high-speed execution through Mojo's compilation, with up to 100x performance improvement when using GPU acceleration for larger numbers of qubits.
 
-An alternative implementation for the futur could be using the Tensor Network approach. This method is more suitable for larger circuits but offers lower precision and would involves more computationally expensive gradient calculations.
+![Benchmark Results](img/benchmark_H100.png)
 
-## Usage
+## 🚀 Usage
 
-### ⚙️ Environment Setup
+To get started, you need to install `pixi` to manage project dependencies.
 
-Follow these steps to set up your environment, build the library and run some examples:
+### 📦 Install Pixi
 
-If you don't have Pixi installed yet:
 ```bash
 curl -sSf https://pixi.sh/install.sh | bash
 ```
-Install all project dependencies:
-```
-pixi install
-```
 
-Build and run examples of the simulator:
+### ⚙️ Main Commands
+
+Run the following commands using `pixi run` or `make`:
+
 ```bash
-pixi run main
+pixi run format   # Format the repository's Mojo code with the Mojo formatter
+pixi run package  # Compile the qlabs package into a .mojopkg file in build/
+pixi run test     # Run all tests in tests/
+pixi run main     # Run all example files in examples/
+pixi run bench    # Run all benchmarks as defined in benchmarks/main.mojo
+pixi run plot     # Run benchmarks and plot their results in data/
 ```
 
+### 🧑‍💻 Example: Quantum Circuit with `qlabs.abstractions`
+
+```python
+from qlabs.base import StateVector, Hadamard, SWAP, NOT, PauliY, PauliZ
+from qlabs.abstractions import GateCircuit, StateVectorSimulator, ShowAfterEachGate
+
+num_qubits = 3
+qc = GateCircuit(num_qubits)
+
+qc.apply_gates(
+    Hadamard(0),
+    SWAP(0, 2),
+    NOT(1, anti_controls=[2]),
+    NOT(0, controls=[1]),
+    PauliY(0),
+    SWAP(1, 2, controls=[0]),
+    PauliZ(1),
+)
+
+print("Quantum circuit created:\n", qc)  # Visualization not fully implemented
+> Expected output:
+> --|H|---x--------|X|--|Y|---*-------
+>         |         |         |
+> --------|---|X|---*---------x--|Z|--
+>         |    |              |
+> --------x----o--------------x-------
+
+qsimu = StateVectorSimulator(
+    qc,
+    initial_state=StateVector.from_bitstring("0" * num_qubits),
+    use_gpu_if_available=True,  # GPU support not fully implemented
+    verbose=True,
+    verbose_step_size=ShowAfterEachGate,  # Options: ShowAfterEachGate, ShowOnlyEnd
+)
+
+final_state = qsimu.run()
+
+print("Final quantum state:\n", final_state)
+print("Normalised purity of qubit 0 (the top one):", final_state.normalised_purity([0]))
+```
 
 ## 📄 License
 
-This project is open-source and licensed under Apache License 2.0.
+This project is open-source and licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). Contributions are welcome!
diff --git a/TODOs.md b/TODOs.md
@@ -4,39 +4,26 @@
 
 ### Implementations
 
-- 5 / 5 : Start adding support for GPU in the base classes if needed (not possible to use SIMD(ComplexFloat32) anymore, or keep them but seperate them when moving data to GPU)
-    - struct StateVector
-    - struct ComplexMatrix
-    - struct Gate
-
-- 5 / ? : GPU implementation of:
-    - qubit_wise_multiply()
+- 5 / 4 : GPU implementation of:
+    - qubit_wise_multiply() (with different type of control gates and for multiple qubits)
     - apply_swap()
     - partial_trace()
     - StateVector.to_density_matrix()
 
-- 4 / 3 : Export benchmark results as plots.
-
 - 2 / 4 : Efficient support for tracking a state statistic like entropy during the execution of the circuit by the simulator.
 
-- 3 / 3 : Implement naive implementation of the functions to compare performances
-    - matrix multiplication (but starting from right or smart)
-    - partial trace
-
 ### Tests
 
 - 5 / 2 : Test qubit_wise_multiply_extended() that can take multiple qubits gates (2 and more, iSWAP for example)
 
 - 5 / 2 : Test for everything that will be implement in GPU
-    - qubit_wise_multiply()
+    - qubit_wise_multiply() (with different type of control gates and for multiple qubits)
     - apply_swap()
-    - struct StateVector's methods
-    - struct ComplexMatrix's methods
-    - struct Gate's Gate
+    - partial_trace()
 
 ### Benchmarks
 
-- 3 / 2 : Reproduce table from page 10
+- 3 / 2 : partial_trace() Reproduce table from page 10
 
 ## Droped for now
 
@@ -60,3 +47,7 @@
 - 2 / 4 : qubit_wise_multiply_extended() but for gates applied to non-adjacent qubits
 
 - 2 / 3 : Implement concurence (2-qubits entanglement metric) computePairwiseQubitConcurrences()
+
+- 1 / 3 : Implement naive implementation of the functions to compare performances
+    - matrix multiplication (but starting from right or smart)
+    - partial trace
diff --git a/benchmarks/all_benchmarks.mojo → benchmarks/bench_main.mojo b/benchmarks/all_benchmarks.mojo → benchmarks/bench_main.mojo
@@ -15,29 +15,29 @@ def main():
     print("Running all benchmarks...")
     # bench_qubit_wise_multiply()
     bench_qubit_wise_multiply_inplace[
-        min_number_qubits=5,
-        max_number_qubits=25,
-        number_qubits_step_size=2,
+        min_number_qubits=1,
+        max_number_qubits=20,
+        number_qubits_step_size=1,
         min_number_layers=5,
-        max_number_layers=4000,
+        max_number_layers=3500,
         number_layers_step_size=400,
-        fixed_number_qubits=11,
-        fixed_number_layers=20,
+        fixed_number_qubits=13,
+        fixed_number_layers=5,
     ]()
 
     @parameter
     if not has_accelerator():
         print("No compatible GPU found")
     else:
-        bench_qubit_wise_multiply_inplace[
-            min_number_qubits=5,
-            max_number_qubits=25,
-            number_qubits_step_size=2,
+        bench_qubit_wise_multiply_inplace_gpu[
+            min_number_qubits=1,
+            max_number_qubits=26,  # 29 is OOM for my 3070 Ti Laptop GPU
+            number_qubits_step_size=1,
             min_number_layers=5,
-            max_number_layers=4000,
+            max_number_layers=7000,
             number_layers_step_size=400,
-            fixed_number_qubits=11,
-            fixed_number_layers=20,
+            fixed_number_qubits=13,
+            fixed_number_layers=5,
         ]()
 
     # bench_qubit_wise_multiply_extended()

diff --git a/benchmarks/bench_qubit_wise_multiply_gpu.mojo b/benchmarks/bench_qubit_wise_multiply_gpu.mojo
@@ -244,7 +244,12 @@ fn benchmark_qubit_wise_multiply_inplace_gpu[
             for qubit in range(num_qubits):
                 if current_state == 0:
                     ctx.enqueue_function[
-                        qubit_wise_multiply_inplace_gpu[number_control_bits=0]
+                        qubit_wise_multiply_inplace_gpu[
+                            state_vector_size=state_vector_size,
+                            gate_set_size=gate_set_size,
+                            circuit_number_control_gates=circuit_number_control_gates,
+                            number_control_bits=0,
+                        ]
                     ](
                         gate_set_re_tensor,
                         gate_set_im_tensor,
@@ -259,7 +264,6 @@ fn benchmark_qubit_wise_multiply_inplace_gpu[
                         quantum_state_re_tensor,
                         quantum_state_im_tensor,
                         num_qubits,  # number_qubits
-                        state_vector_size,  # quantum_state_size
                         quantum_state_out_re_tensor,
                         quantum_state_out_im_tensor,
                         control_bits_circuit_tensor,
@@ -270,7 +274,12 @@ fn benchmark_qubit_wise_multiply_inplace_gpu[
                     current_state = 1
                 else:
                     ctx.enqueue_function[
-                        qubit_wise_multiply_inplace_gpu[number_control_bits=0]
+                        qubit_wise_multiply_inplace_gpu[
+                            state_vector_size=state_vector_size,
+                            gate_set_size=gate_set_size,
+                            circuit_number_control_gates=circuit_number_control_gates,
+                            number_control_bits=0,
+                        ]
                     ](
                         gate_set_re_tensor,
                         gate_set_im_tensor,
@@ -285,7 +294,6 @@ fn benchmark_qubit_wise_multiply_inplace_gpu[
                         quantum_state_out_re_tensor,
                         quantum_state_out_im_tensor,
                         num_qubits,  # number_qubits
-                        state_vector_size,  # quantum_state_size
                         quantum_state_re_tensor,
                         quantum_state_im_tensor,
                         control_bits_circuit_tensor,

diff --git a/benchmarks/bench_simulate_random_circuit.mojo b/benchmarks/bench_simulate_random_circuit.mojo
@@ -100,7 +100,6 @@ fn simulate_random_circuit[num_qubits: Int, number_layers: Int]() -> None:
     qsimu = StateVectorSimulator(
         qc,
         initial_state=initial_state,
-        optimisation_level=0,  # No optimisations for now
         verbose=False,
         # verbose_step_size=ShowAfterEachLayer,  # ShowAfterEachGate, ShowOnlyEnd
         verbose_step_size=ShowAfterEachGate,  # ShowAfterEachGate, ShowOnlyEnd

diff --git a/benchmarks/plot_results.py b/benchmarks/plot_results.py
@@ -39,7 +39,6 @@ def process_benchmark_data(filepath):
 layers_cpu_df = process_benchmark_data("data/qubit_wise_multiply_inplace_layers.csv")
 qubits_cpu_df = process_benchmark_data("data/qubit_wise_multiply_inplace_qubits.csv")
 
-
 # --- 3. Plotting ---
 
 # Create a figure with two subplots side-by-side
@@ -48,21 +47,26 @@ def process_benchmark_data(filepath):
 
 # Plot 1: Performance vs. Number of Layers
 ax1.plot(
-    layers_cpu_df["layers"],
+    layers_cpu_df["layers"] * layers_cpu_df["qubits"][0],
     layers_cpu_df["time_ms"],
     marker="o",
     linestyle="-",
     label="CPU",
 )
 ax1.plot(
-    layers_gpu_df["layers"],
+    layers_gpu_df["layers"]
+    * layers_cpu_df["qubits"][0],  # Scale x-axis by number of qubits
     layers_gpu_df["time_ms"],
     marker="s",
     linestyle="--",
     label="GPU",
 )
-ax1.set_title("Performance vs. Number of Layers (13 Qubits)")
-ax1.set_xlabel("Number of Layers")
+ax1.set_title(
+    f"Execution Time vs. Number of Layers\n({layers_cpu_df['qubits'][0]} Qubits)"
+)
+ax1.set_xlabel(
+    f"Number of Gates\n(Number of Layers x {layers_cpu_df['qubits'][0]} Qubits)"
+)
 ax1.set_ylabel("Mean Execution Time (ms)")
 ax1.legend()
 ax1.grid(True, linestyle="--", alpha=0.6)
@@ -82,13 +86,21 @@ def process_benchmark_data(filepath):
     linestyle="--",
     label="GPU",
 )
-ax2.set_title("Performance vs. Number of Qubits (20 Layers)")
+ax2.set_title(
+    f"Execution Time vs. Number of Qubits\n({qubits_cpu_df['layers'][0]} Layers)"
+)
 ax2.set_xlabel("Number of Qubits")
 # We can make the y-axis a log scale if the values vary widely
 ax2.set_ylabel("Mean Execution Time (ms) - Log Scale")
 ax2.set_yscale("log")  # Use a logarithmic scale to better see the differences
 ax2.legend()
 ax2.grid(True, which="both", linestyle="--", alpha=0.6)
+# set the x-ticks to be the x-values of the qubits
+unique_cpu_qubits = qubits_cpu_df["qubits"].unique()
+unique_gpu_qubits = qubits_gpu_df["qubits"].unique()
+unique_qubits = sorted(set(unique_cpu_qubits) | set(unique_gpu_qubits))
+ax2.set_xticks(unique_qubits)
+ax2.set_xticklabels(unique_qubits, rotation=45)
 
 
 # Adjust layout to prevent labels from overlapping