The Auto-Preprocessing Pipeline for Nested, Semi-Structured EHR Data
of data analysis effort is data preprocessing.
Electronic Health Records (EHRs) hold vital patient insights but are notoriously difficult to analyze. Data is often exported as complex, nested CSV files—a format that computers struggle to parse.
This "messy" data creates a massive bottleneck, halting critical research and machine learning development. Our solution automates this entire process.
A raw cardiology dataset was processed by the pipeline. The transformation was dramatic, converting a massive, unusable file into a clean, structured dataset ready for machine learning.
The pipeline successfully unpacked 88,786 rows of fragmented data into 12,544 clean, patient-visit-specific rows, expanding 10 nested columns into 22 structured features.
The full pipeline, including complex AI-driven labeling (using ClinicalBERT), is highly efficient. The auto-labeling step is the most computationally intensive, demonstrating its advanced analytical power.
Execution time scales linearly and predictably as the dataset volume increases, proving the pipeline's robustness for large-scale, real-world applications.
Was the transformation useful? We used the processed data to conduct a Kaplan-Meier survival analysis to measure patient outcomes. This analysis was impossible with the raw data.
The processed data clearly shows that patients with a history of Cardiovascular Disease (CVD) who attended rehabilitation had a significantly higher survival probability over time compared to those who did not.