Core Concepts
This page explains the core concepts behind QDP (Quantum Data Plane): the architecture, the supported encoding methods, GPU memory management, DLPack zero-copy integration, and key performance characteristics.
1. Architecture overview
At a high level, QDP is organized as layered components:
- Python API (Qumat):
qumat.qdpexposes a friendly Python entry point (QdpEngine). - Python native extension (PyO3):
qdp/qdp-pythonbuilds the_qdpmodule, bridging Python ↔ Rust and implementing Python-side DLPack (__dlpack__). - Rust core:
qdp/qdp-corecontains the engine (QdpEngine), encoder implementations, IO readers, GPU pipelines, and DLPack export. - CUDA kernels:
qdp/qdp-kernelsprovides the CUDA kernels invoked by the Rust encoders.
Data flow (conceptual):
Python (qumat.qdp) → _qdp (PyO3) → qdp-core (Rust) → qdp-kernels (CUDA)
│ │ │ │
└──── torch.from_dlpack(qtensor) ◄──────┴── DLPack DLManagedTensor (GPU ptr)
2. What QDP produces: a GPU-resident state vector
QDP encodes classical data into a state vector for qubits.
- State length:
- Type: complex numbers (on GPU)
- Shape exposed via DLPack:
- Single sample:
[1, 2^n](always 2D) - Batch:
[batch_size, 2^n]
- Single sample:
QDP supports two output precisions:
- complex64 (2×float32) when output precision is
float32 - complex128 (2×float64) when output precision is
float64
3. Encoder selection and validation
QDP uses a strategy pattern for encoding methods:
encoding_methodis a string, e.g."amplitude","basis","angle"- QDP maps it to a concrete encoder at runtime
All encoders perform input validation (at minimum):
- input is not empty
- for vector-based encodings:
len(data) <= 2^n
Note: is already very large—just the output state for a single sample is on the order of complex numbers.
4. Encoding methods (amplitude, basis, angle)
4.1 Amplitude encoding
Goal: represent a real-valued feature vector as quantum amplitudes:
Key properties in QDP:
- Normalization: QDP computes and rejects zero-norm inputs.
- Padding: if
len(x) < 2^n, the remaining amplitudes are treated as zeros. - GPU execution: the normalization and write into the GPU state vector is performed by CUDA kernels.
- Batch support: amplitude encoding supports a batch path to reduce kernel launch / allocation overhead (recommended when encoding many samples).
When to use it:
- You have dense real-valued vectors and want a direct amplitude representation.
Trade-offs:
- Output size grows exponentially with
num_qubits(), so it can become memory-heavy quickly.
4.2 Basis encoding
Goal: map an integer index into a computational basis state .
For qubits with :
- for all
Key properties in QDP:
- Input shape:
- single sample expects exactly one value:
[index] - batch expects one index per sample (effectively
sample_size = 1)
- single sample expects exactly one value:
- Range checking: indices must be valid for the chosen
num_qubits - GPU execution: kernel sets the one-hot amplitude at the requested index
When to use it:
- Your data naturally represents discrete states (categories, token IDs, hashed features, etc.).
4.3 Angle encoding (planned)
Angle encoding typically maps features to rotation angles (e.g., via , , ) and constructs a state by applying rotations across qubits.
Current status in this codebase:
- The
"angle"encoder exists as a placeholder and returns an error stating it is not implemented yet. - Use
"amplitude"or"basis"for now.
5. GPU memory management
QDP is designed to keep the encoded states on the GPU and to avoid unnecessary allocations/copies where possible.
5.1 Output state vector allocation
For each encoded sample, QDP allocates a state vector of size . Memory usage grows exponentially:
- complex128 uses 16 bytes per element
- rough output size (single sample) is:
- bytes for complex128
- bytes for complex64
QDP performs pre-flight checks before large allocations to fail fast with an OOM-aware message (e.g., suggesting smaller num_qubits or batch size).
5.2 Pinned host memory and streaming pipelines
For high-throughput IO → GPU workflows (especially streaming from Parquet), QDP uses:
- Pinned host buffers (page-locked memory) to speed up host↔device transfers.
- Double buffering (ping-pong) so one buffer can be filled while another is being processed.
- Device staging buffers (for streaming) so that copies and compute can overlap.
Streaming Parquet encoding is implemented as a producer/consumer pipeline:
- a background IO thread reads chunks into pinned host buffers
- the GPU side processes each chunk while IO continues
In the current implementation, streaming Parquet supports:
"amplitude""basis"
("angle" is not supported for streaming yet.)
5.3 Asynchronous copy/compute overlap (dual streams)
For large workloads, QDP uses multiple CUDA streams and CUDA events to overlap:
- H2D copies (copy stream)
- kernel execution (compute stream)
This reduces time spent waiting on PCIe transfers and can improve throughput substantially for large batches.
6. DLPack zero-copy integration
QDP exposes results using the DLPack protocol, which allows frameworks like PyTorch to consume GPU memory without copying.
Conceptually:
- Rust allocates GPU memory for the state vector.
- Rust wraps it into a DLPack
DLManagedTensor. - Python returns an object that implements
__dlpack__. - PyTorch calls
torch.from_dlpack(qtensor)and takes ownership via DLPack's deleter.
Important details:
- The returned DLPack capsule is single-consume (can only be used once). This prevents double-free bugs.
- Memory lifetime is managed safely via reference counting on the Rust side, and freed by the DLPack deleter when the consumer releases it.
7. Performance characteristics and practical guidance
7.1 What makes QDP fast
- GPU kernels replace circuit-based state preparation for the supported encodings.
- Batch APIs reduce allocation and kernel launch overhead.
- Streaming pipelines overlap IO and GPU compute.
7.2 Choosing parameters wisely
- Prefer batch encoding when encoding many samples (lower overhead, better GPU utilization).
- Keep
num_qubitsrealistic. Output size is and becomes the dominant cost quickly. - Pick the right encoding:
- amplitude: dense real-valued vectors
- basis: discrete indices / categorical states
- angle: planned, not implemented yet in this version
7.3 Profiling
If you need to understand where time is spent (copy vs compute), QDP supports NVTX-based profiling. See qdp/docs/observability/NVTX_USAGE.md.