Batch Processor Module

The batch_processor module is part of the QhX package, designed for processing large datasets in parallel batches. It uses DataManager for data loading and preprocessing, and the ParallelSolver for executing data processing tasks in parallel.

Overview

The module allows for the processing of data in specified batch sizes using a predetermined number of parallel workers. It aims to enhance processing efficiency when dealing with large datasets.

Functions

QhX.batch_processor.process_batches(batch_size, num_workers=25, start_i=0)[source]

Processes data in batches using parallel processing.

Parameters:
  • batch_size (int) – The number of data points to process in each batch.

  • num_workers (int, optional) – The number of parallel workers to use for processing. Defaults to 25.

  • start_i (int, optional) – The index from which to start processing the dataset. Defaults to 0.

This function loads a dataset, groups the data as necessary, and then processes it in batches. Each batch is processed in a new directory to keep the results organized.

This function is responsible for orchestrating the batch processing workflow. It involves loading and optionally grouping the dataset, splitting it into batches, and processing each batch in parallel.

Usage

The module can be used as a standalone script or imported into other scripts or modules within the QhX package. When executed as a script, it requires the batch size as a mandatory argument, with optional arguments for the number of workers and the starting index for processing.

python -m QhX.batch_processor 100 25 0

This command processes the dataset in batches of 100 using 25 parallel workers, starting from the first record.

Installation and Requirements

Ensure that the QhX package is properly installed and configured in your environment. The batch_processor module depends on other components of the QhX package, such as DataManager and ParallelSolver.

See Also