output_parallel
This module processes datasets of detected periods for quasars, performing classification based on intersection over union (IoU) calculations and various error metrics. It provides functionality for loading, classifying, and aggregating information on detected periods in large datasets that may not fit entirely in memory.
The key functionalities in this module include: - Flattening nested data structures and filtering out entries with missing values. - Calculating the intersection over union (IoU) for overlapping circles, representing period uncertainties. - Classifying individual periods based on IoU, error bounds, and significance metrics. - Aggregating classifications to provide a summary classification for each unique object ID. - Processing large datasets in chunks to handle memory limitations, with chunk-wise classifications. - Saving processed data and aggregated statistics to CSV files.
This module is designed for high-throughput time-domain astronomical data analysis, specifically to help identify and classify variability patterns in quasar light curves. It can be used as a standalone script or imported into larger workflows.
Example Usage:
file_path = ‘path/to/your/individual_classified_dataset.csv’ combined_data, aggregated_stats = process_large_dataset(file_path)
combined_data_path = ‘path/to/your/combined_classified_data.csv’ aggregated_stats_path = ‘path/to/your/aggregated_statistics.csv’
save_to_csv(combined_data, combined_data_path) save_to_csv(aggregated_stats, aggregated_stats_path)
print(f”Combined classified data saved to {combined_data_path}”) print(f”Aggregated statistics saved to {aggregated_stats_path}”)
Functions:
flatten_detected_periods(detected_periods): Flatten nested data and skip records with NaN values.
calculate_iou(radius1, radius2, distance): Calculate IoU for two circles.
classify_periods(detected_periods): Classify periods based on IoU and error metrics.
classify_period(row): Classify an individual period based on predefined thresholds.
aggregate_classifications(group): Aggregate classifications to summarize period reliability for each object.
group_periods(data): Apply classification and aggregation to all detected periods in the dataset.
process_chunk(chunk): Process a chunk of data, handling chunk-wise classification.
aggregate_statistics(classified_data): Aggregate statistics for reliable and medium-reliable classifications.
save_to_csv(data, file_path): Save a DataFrame to a CSV file.
process_large_dataset(file_path, chunksize=10000): Process large datasets by reading in chunks, classifying, and aggregating.
- QhX.output_parallel.aggregate_classifications(group)[source]
Aggregates individual period classifications within a group, determining a final classification for each unique object ID.
Parameters:
- grouppd.DataFrame
A subset of the original DataFrame grouped by object ID, containing all period detections.
Returns:
- pd.DataFrame
The input DataFrame with an additional column ‘final_classification’ representing the aggregated classification result for each object ID.
- QhX.output_parallel.aggregate_statistics(classified_data)[source]
Aggregates statistics for object IDs classified as ‘reliable’ or ‘medium reliable’.
Parameters:
- classified_datapd.DataFrame
The dataset with ‘final_classification’ for each object ID.
Returns:
- pd.DataFrame
A DataFrame containing aggregated statistics for each ‘reliable’ and ‘medium reliable’ classified object ID.
- QhX.output_parallel.calculate_iou(radius1, radius2, distance)[source]
Calculates the Intersection over Union (IoU) for two circles given their radii and the distance between their centers.
Parameters:
radius1 (float): Radius of the first circle. radius2 (float): Radius of the second circle. distance (float): Distance between the centers of the two circles.
Returns:
float: IoU value.
- QhX.output_parallel.classify_period(row)[source]
Classify the detected period as ‘reliable’, ‘medium reliable’, ‘poor’, or ‘NAN’ based on the significance of the detected period, the relative lower and upper errors, and the IoU of the error circles.
Parameters:
row (pd.Series): A row from the DataFrame containing detected period data.
Returns:
str: Classification of the period (‘reliable’, ‘medium reliable’, ‘poor’, ‘NAN’).
- QhX.output_parallel.classify_periods(detected_periods)[source]
Classifies periods based on IoU and other metrics, adjusted to work with specified column names. Assumes ‘Band1’ and ‘Band2’ columns are already present in the DataFrame.
Parameters:
- detected_periodslist of dict
A list of dictionaries representing detected periods.
Returns:
- pd.DataFrame
A DataFrame containing classified periods and their relevant metrics.
- QhX.output_parallel.flatten_detected_periods(detected_periods)[source]
Flatten the nested list of dictionaries of detected periods, skipping records with any NaN values.
Parameters:
- detected_periodslist of dict
A list of dictionaries, each representing a detected period with keys such as ‘ID’, ‘Sampling_1’, etc.
Returns:
- list of dict
A flattened list of dictionaries, excluding records with NaN values.
- QhX.output_parallel.group_periods(data)[source]
Classifies periods for all detected periods in a dataset and aggregates classifications for each object ID.
Parameters:
- datapd.DataFrame
The dataset containing detected periods and necessary metrics.
Returns:
- pd.DataFrame
The input DataFrame enhanced with ‘individual_classification’ and ‘final_classification’ columns.
- QhX.output_parallel.process_chunk(chunk)[source]
Processes a chunk of the dataset by applying period classification logic, used for chunk-wise processing.
Parameters:
- chunkpd.DataFrame
A chunk of the dataset, containing a subset of the detected periods.
Returns:
- pd.DataFrame
The chunk with period classifications applied, ready for further aggregation.
- QhX.output_parallel.process_large_dataset(file_path, chunksize=10000)[source]
Processes a large dataset by reading it in chunks, classifying periods, and then aggregating results.
Parameters:
- file_pathstr
The path to the dataset file.
- chunksizeint, optional
The number of rows per chunk.
Returns:
- tuple of (pd.DataFrame, pd.DataFrame)
Two DataFrames: one with classified data and one with aggregated statistics.