Unlocking Medical Data

The Auto-Preprocessing Pipeline for Nested, Semi-Structured EHR Data

80%

of data analysis effort is data preprocessing.

Electronic Health Records (EHRs) hold vital patient insights but are notoriously difficult to analyze. Data is often exported as complex, nested CSV files—a format that computers struggle to parse.

This "messy" data creates a massive bottleneck, halting critical research and machine learning development. Our solution automates this entire process.

How It Works: The Automated Pipeline

1. Load YAML Configuration
2. Load Raw CSV Dataset
3. Consolidate Fragmented Rows (by Unique ID)
4. Expand Nested Data (Core Innovation)
Uses: Regular Expressions
(for Data Extraction)
Uses: Fuzzy Matching
(for Disambiguation)
5. Clean & Normalize Data
6. Auto-Labeler (Feature Engineering)
Range-Based
(e.g., Age: 68 → 'Elderly')
Keyword-Based
(e.g., 'high blood' → 'High Risk')
Model-Based
(e.g., ClinicalBERT Classification)
7. Output: Structured, Analysis-Ready Data

Case Study: The Transformation

A raw cardiology dataset was processed by the pipeline. The transformation was dramatic, converting a massive, unusable file into a clean, structured dataset ready for machine learning.

Raw Data vs. Processed Data

The pipeline successfully unpacked 88,786 rows of fragmented data into 12,544 clean, patient-visit-specific rows, expanding 10 nested columns into 22 structured features.

Pipeline Performance & Scalability

Baseline Performance

The full pipeline, including complex AI-driven labeling (using ClinicalBERT), is highly efficient. The auto-labeling step is the most computationally intensive, demonstrating its advanced analytical power.

Scalability Under Load

Execution time scales linearly and predictably as the dataset volume increases, proving the pipeline's robustness for large-scale, real-world applications.

The Payoff: Real-World Utility

Was the transformation useful? We used the processed data to conduct a Kaplan-Meier survival analysis to measure patient outcomes. This analysis was impossible with the raw data.

Survival Analysis: Impact of Cardiac Rehabilitation

The processed data clearly shows that patients with a history of Cardiovascular Disease (CVD) who attended rehabilitation had a significantly higher survival probability over time compared to those who did not.