Skip to content

Major Refactor: Unified Event Stream, YAML Config, Multimodal Processor, Simplified Model#319

Merged
jhnwu3 merged 14 commits intodevfrom
feat/event-stream
Apr 8, 2025
Merged

Major Refactor: Unified Event Stream, YAML Config, Multimodal Processor, Simplified Model#319
jhnwu3 merged 14 commits intodevfrom
feat/event-stream

Conversation

@zzachw
Copy link
Collaborator

@zzachw zzachw commented Apr 7, 2025

This PR delivers a full-stack refactor of PyHealth’s data and model pipeline.

  • Unified Event Stream

    • All patient data is now represented as a unified event stream (Polars DataFrame).
    • Simplifies data handling across modalities and supports flexible, on-demand querying.
    • Patient object redesigned to work directly with the event stream for efficient event access and filtering.
  • YAML Configuration

    • Dataset loading is fully driven by YAML config files.
    • Clean, declarative specification of table paths, attributes, and data sources.
    • Enables ease of extension to new datasets.
  • Multimodal Processor Framework

    • Introduced a modular processor system for handling diverse data types (sequences, signals, images, labels, time series).
    • Processors are easily extendable and managed through a centralized registry.
    • Standardized interface for preprocessing, feature extraction, and data transformation.
  • Simplified Model Structure

    • Refactored model classes for clarity and modularity.
    • Added EmbeddingModel to unify input embedding across models.
    • Cleaned up BaseModel and RNN class implementations, removing deprecated logic and improving readability.
  • Additional Improvements

    • Adopted Polars backend for faster data manipulation.
    • Unified dataset APIs and removed legacy code.
    • New tasks added: InHospitalMortalityMIMIC4, Readmission30DaysMIMIC4.
    • Updated .gitignore and utility functions for YAML parsing.
    • Improved internal typing and documentation.

⚠️ Breaking Changes & Follow-Up

  • This is a major refactor that may break existing functionalities.
  • Extensive testing and integration with downstream modules are still required.
  • Documentation is outdated and will need a full revision to match the new pipeline.
  • Reviewers should focus on overall structure and architecture.
  • Follow-up PRs will address testing, doc updates, and additional validation.

zzachw added 8 commits April 7, 2025 01:12
1. Patient is now a sequence of event.
2. Updated Patient class to initialize with a Polars DataFrame for event management.
- Unified APIs for all modalities.
- Enabled data loading based on YAML configs.
- Switched to Polars backend.
- Removed deprecated base_dataset, sample_dataset.
- Renamed base_dataset_v2 as base_dataset.
- Renamed sample_dataset_v2 as sample_dataset.
- Moved padding to collate_fn.
- Cleaned up unused featurizer classes.
Simplified MIMIC4Dataset class by merging loading functions
Introduced a YAML configuration file for dataset management, detailing file paths and attributes for various tables.
- Renamed `TaskTemplate` to `BaseTask`.
- Introduced `InHospitalMortalityMIMIC4`.
- Introduced `Readmission30DaysMIMIC4`.
- Introduced a new processor registry to manage different data processors.
- Implemented base processor classes: `Processor`, `FeatureProcessor`, `SampleProcessor`, and `DatasetProcessor`.
- Added specific processors for images (`ImageProcessor`), labels (`BinaryLabelProcessor`, `MultiClassLabelProcessor`, `MultiLabelProcessor`, `RegressionLabelProcessor`), sequences (`SequenceProcessor`), signals (`SignalProcessor`), and time series (`TimeseriesProcessor`).
- Each processor includes methods for processing data and managing state, with appropriate error handling and configuration options.
- Updated `BaseModel` to streamline initialization and remove deprecated parameters.
- Introduced `EmbeddingModel` for handling embedding layers for various input types.
- Refactored `RNN` class to utilize `EmbeddingModel` for embedding inputs, enhancing modularity.
- Cleaned up unused code and improved type annotations for better clarity and maintainability.
@zzachw zzachw requested a review from Copilot April 7, 2025 06:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

pyhealth/processors/image_processor.py:75

  • [nitpick] The repr method in ImageProcessor returns 'ImageLoadingProcessor', which is inconsistent with the class name. Consider updating it to 'ImageProcessor' to reflect the correct class.
f"ImageLoadingProcessor(image_size={self.image_size}, to_tensor={self.to_tensor}, normalize={self.normalize}, mean={self.mean}, std={self.std})"

pyhealth/data/data.py:35

  • The filtering of dictionary keys in Event.from_dict assumes that attribute keys are prefixed with the event_type. This behavior might be fragile if keys are not uniformly prefixed; consider adding validation or documentation to ensure consistency.
for k, v in d.items() if k.startswith(event_type)

Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good tbh

Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no red flags.

Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signal processor may need Jathurshan to look at later.

@jhnwu3 jhnwu3 merged commit b09f6d4 into dev Apr 8, 2025
@zzachw zzachw deleted the feat/event-stream branch April 8, 2025 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants