Batch Processing¶

What is batch processing?¶

By default, SecActPy loads the entire expression matrix into memory and runs ridge regression on all samples at once. This works well for most datasets, but for large-scale analyses (e.g., 50,000+ single cells or spatial spots) the memory required for permutation testing can exceed available RAM or GPU memory.

Batch processing splits the work into smaller pieces. The expensive projection matrix T = (X'X + λI)^{-1} X' is computed once from the signature, then samples are processed in chunks of batch_size at a time. Each chunk goes through the full permutation-testing pipeline independently, and partial results are concatenated at the end. The final output is mathematically identical to processing all samples at once — only peak memory usage is reduced. Y itself can also be streamed from disk in column blocks (see Streaming a large Y from disk), so even the input matrix need not fit in RAM.

All three high-level functions support batch_size and output_path: - secact_activity_inference() — bulk RNA-seq - secact_activity_inference_scrnaseq() — scRNA-seq - secact_activity_inference_st() — spatial transcriptomics

Set batch_size to enable it:

# Without batch processing: all samples at once (default)
result = secact_activity_inference(expr_df, ...)

# With batch processing: 5000 samples per chunk
result = secact_activity_inference(expr_df, ..., batch_size=5000)

# Works the same way for scRNA-seq and ST:
result = secact_activity_inference_scrnaseq(adata, ..., batch_size=5000)
result = secact_activity_inference_st(adata, ..., batch_size=5000)

In-memory vs streaming output¶

By default, batch results are accumulated in memory and returned as a dictionary of DataFrames — this is the in-memory mode. You get back a dict with result['zscore'], result['pvalue'], etc., just like the non-batched case.

For very large datasets, even the output matrices (beta, zscore, pvalue, se — each of shape n_proteins × n_samples) may not fit in memory. Streaming output solves this: set output_path to write each batch's results directly to an HDF5 file on disk as it completes. The function returns None in this mode — no results are held in memory. You load them back from the file when needed. All three high-level functions support this.

Mode	Parameter	Return value	Memory for output
In-memory (default)	`output_path=None`	`dict` of DataFrames	All results in RAM
Streaming	`output_path="results.h5ad"`	`None`	Only one batch at a time

# Streaming works with any high-level function:
secact_activity_inference(..., batch_size=5000, output_path="bulk_results.h5ad")
secact_activity_inference_scrnaseq(..., batch_size=5000, output_path="sc_results.h5ad")
secact_activity_inference_st(..., batch_size=5000, output_path="st_results.h5ad")

Streaming a large Y from disk (read streaming)¶

The section above streams results out. ridge_batch can also stream the expression matrix Y in: pass an on-disk source instead of an in-memory array and it is read one column block at a time, so the full n_genes × n_samples matrix is never resident. This removes the memory floor where a dense Y would otherwise have to fit in RAM. Combine it with output_path for end-to-end read + write streaming — neither the input nor the output is ever held in full.

Supported on-disk sources (all oriented genes × samples):

Source	Reader	Notes
`.npy` (dense)	`DenseChunkReader` (memmap)	lazy; avoids a full float64 copy
HDF5 dense dataset	`DenseChunkReader` (slab read)	hard peak-RSS bound
HDF5 CSC group (sparse)	`SparseChunkReader` (column slab)	native lazy column reads
`.npz` (scipy sparse)	`SparseChunkReader`	loaded fully in RAM (convenience)

from secactpy.batch import ridge_batch

# Dense Y on disk (HDF5 dataset or .npy) — streamed in column blocks:
result = ridge_batch(X, "Y.h5", batch_size=200_000)          # or "Y.npy"

# Sparse Y on disk (HDF5 CSC group / .npz) — in-flight column normalization:
result = ridge_batch(X, "Y_csc.h5", batch_size=200_000,
                     sparse_mode=True, col_center=True, col_scale=True)

# End-to-end read + write streaming — Y in, results out, nothing held in RAM:
ridge_batch(X, "Y.h5", batch_size=200_000, output_path="results.h5ad")

# HDF5 with multiple datasets: name the one to read
result = ridge_batch(X, "atlas.h5", y_dataset_key="expression", batch_size=200_000)

Results are bit-identical to running the same Y in memory — the streaming loop reuses the in-memory per-block kernels, and sparse column statistics are per-cell so a single pass matches the sliced full-Y statistics exactly.

Memory note. An HDF5 dataset gives a hard resident-memory ceiling: each column slab is read into a transient array and freed per block. A .npy memmap is lazy and avoids a full float64 copy (handy when the on-disk dtype is float32), but the mapped pages accumulate as evictable page cache, so ru_maxrss still approaches the file size. Prefer HDF5 when you need a strict RSS bound.

Dense Y must be pre-scaled (as in the in-memory dense path); sparse Y gets in-flight centering/scaling via col_center / col_scale. This is the general-purpose path for an already-normalized Y. For the scRNA-seq / ST pipeline that reads a raw h5ad and does CPM → log2 → row-centering on the fly, use the two-pass streaming=True path in the next section instead.

Example: batch processing with `secact_activity_inference`¶

secact_activity_inference handles gene subsetting, z-score normalization, signature grouping, and row expansion automatically — you just pass your expression data and set batch_size.

# Download example data (788 OV CD4 T cells, 34 MB)
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad

from secactpy import secact_activity_inference
import anndata as ad

# Load multi-sample expression data
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")

# --- In-memory mode (default) ---
# Results are returned as a dict of DataFrames
result = secact_activity_inference(
    adata.to_df().T,         # genes × cells DataFrame
    is_differential=False,   # center by row means across samples
    batch_size=200,          # process 200 cells per batch
    verbose=True
)
print(result['zscore'].head())  # (proteins × cells) DataFrame

# --- Streaming mode ---
# Results are written to disk; function returns None
secact_activity_inference(
    adata.to_df().T,
    is_differential=False,
    batch_size=200,
    output_path="results.h5ad",       # write here instead of returning
    output_compression="gzip",        # compress on disk (default)
    verbose=True
)
# Load results back when needed:
import h5py
with h5py.File("results.h5ad", "r") as f:
    zscore = f['zscore'][:]           # NumPy array (proteins × cells)

Sparse mode for memory-efficient processing¶

By default, sparse Y matrices are converted to dense before matrix multiplication. For very large, highly sparse datasets (e.g., scRNA-seq with 100k+ cells at <5% density), this can require hundreds of GB of RAM.

Setting sparse_mode=True keeps Y sparse throughout the entire pipeline. Instead of densifying Y and computing T @ Y_dense, it uses the algebraic identity (Y.T @ T.T).T to perform the multiplication with Y in sparse format. Column z-scoring and row-mean centering are applied as lightweight corrections on the small output matrix, never on Y itself.

All three high-level functions support sparse_mode:

# scRNA-seq: sparse CPM → log2 → ridge, Y never densified
result = secact_activity_inference_scrnaseq(
    adata,
    cell_type_col="cell_type",
    is_single_cell_level=True,
    batch_size=5000,
    sparse_mode=True,   # keep Y sparse end-to-end
    verbose=True
)

# Spatial transcriptomics: same sparse pipeline
result = secact_activity_inference_st(
    adata,
    batch_size=5000,
    sparse_mode=True,
    verbose=True
)

When sparse_mode=True, the scrnaseq and ST functions bypass the standard dense normalization pipeline. CPM normalization and log2 transform are applied directly on the sparse matrix (both are zero-preserving), and row-mean centering + column z-scoring are handled in-flight by ridge_batch.

	Default (`sparse_mode=False`)	`sparse_mode=True`
Memory	Allocates full dense Y	Y stays sparse
Speed (<5% density)	Baseline	~1.8x faster
Speed (5-10% density)	Baseline	~25% slower
Results	Identical	Identical

Streaming H5AD for very large datasets (>5M cells)¶

Batch processing and sparse mode assume the full expression matrix fits in memory. For datasets with >5M cells (e.g., 6.5M-cell atlases), even the sparse matrix can exceed available RAM (60+ GB for the sparse data alone, 200+ GB peak after transposition, CPM normalization, and log2 transform).

Streaming H5AD (streaming=True) solves this by reading the H5AD file in chunks via h5py, never loading the full matrix. It uses a two-pass algorithm:

Pass 1 (Statistics): For each chunk of cells, read CSR rows from H5AD, transpose, CPM-normalize, log2-transform, subset to signature genes, and accumulate row sums, column sums, and column sum-of-squares. The chunk data is then discarded.
Between passes: Compute row means, the projection matrix T, and the permutation table (one-time cost, same as standard batch mode).
Pass 2 (Inference): Re-read each chunk, apply the same normalization, compute per-chunk cross terms (Y_chunk.T @ row_means), derive per-cell sigma, and feed sub-batches to the existing sparse batch inference functions. Results are written to disk via StreamingResultWriter.

Usage¶

from secactpy import secact_activity_inference_scrnaseq

# scRNA-seq: process 6.5M cells with ~3 GB peak memory
result = secact_activity_inference_scrnaseq(
    "cima_atlas.h5ad",               # must be a file path, not AnnData
    cell_type_col="cell_type",
    is_single_cell_level=True,       # required for streaming
    streaming=True,                  # enable two-pass chunk reading
    streaming_chunk_size=50_000,     # cells per chunk (default)
    output_path="cima_results.h5ad", # stream results to disk
    sparse_mode=True,                # keep chunks sparse (recommended)
    backend="auto",                  # GPU if available
    verbose=True,
)
# result is None when output_path is set; load with h5py or anndata

from secactpy import secact_activity_inference_st

# Spatial transcriptomics: same interface
result = secact_activity_inference_st(
    "large_spatial.h5ad",
    streaming=True,
    streaming_chunk_size=50_000,
    output_path="spatial_results.h5ad",
    verbose=True,
)

Requirements¶

adata / input_data must be a file path (str or Path), not an in-memory AnnData object
is_single_cell_level=True (scRNA-seq) or is_spot_level=True (ST)
The H5AD file must store X in sparse (CSR or CSC) format
h5py must be installed

Memory comparison¶

Component	Standard (5M cells)	Streaming (5M cells)
Full sparse X	~60 GB	0
Transpose + normalize	~120 GB peak	0
One chunk (50K cells)	—	~1 GB
Stats accumulators	—	~80 MB
T matrix + permutations	~25 MB	~25 MB
Peak	~200 GB	~3 GB

Low-level API¶

For advanced users, the streaming components are also available directly:

from secactpy import H5ADChunkReader, ridge_batch_streaming

# Read H5AD in chunks
with H5ADChunkReader("data.h5ad", chunk_size=50_000) as reader:
    print(f"Shape: {reader.n_obs} cells × {reader.n_vars} genes")
    var_names = reader.read_var_names()
    obs_names = reader.read_obs_names()
    for start, end, chunk_csr in reader.iter_chunks():
        print(f"  Chunk [{start}:{end}]: {chunk_csr.shape}")