Skip to content

Fix/processors fit process#744

Merged
jhnwu3 merged 4 commits intomasterfrom
fix/processors_fit_process
Dec 18, 2025
Merged

Fix/processors fit process#744
jhnwu3 merged 4 commits intomasterfrom
fix/processors_fit_process

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Dec 18, 2025

This pull request introduces several improvements and fixes across documentation, benchmarking scripts, dataset processing, and processor design. The most significant update is the addition of comprehensive documentation for writing custom feature processors, clarifying the use of fit() and process() methods. There are also bug fixes and enhancements to the dataset loading logic, as well as updates to benchmarking scripts for more consistent development and memory settings.

Documentation Improvements:

  • Added a detailed section to the processors documentation explaining how to write custom FeatureProcessor classes, including guidance on when to use fit() vs. process(), example implementations, and processor registration instructions.

Benchmarking Script Updates:

  • Changed the default development and memory limit settings in benchmark_workers_1.py and benchmark_workers_4.py to ensure benchmarks run with production-like configurations by default. [1] [2]
  • Updated cache directory usage and naming in benchmark_workers_4.py for clarity and to avoid overwriting previous results. [1] [2]
  • Added a new example script, timeseries_mimic4.py, demonstrating dataset creation, task assignment, and dataset splitting for time series data.

Dataset Loading and Processing Enhancements:

  • Improved timestamp handling in BaseDataset.load_table() to avoid errors with missing or invalid values, using errors="coerce" and ensuring proper datetime conversion.
  • Refactored code for readability and maintainability, including better formatting of warning messages, merging logic, and limiting patient selection in dev mode. [1] [2] [3] [4] [5] [6]
  • Added a fit() method to TimeseriesProcessor to automatically determine the number of features (n_features) from input data, improving processor usability and reliability.

Versioning:

  • Bumped the project version from 2.0a11 to 2.0a12 in pyproject.toml.

@jhnwu3 jhnwu3 requested a review from Logiquo December 18, 2025 20:04
The ``process()`` Method
~~~~~~~~~~~~~~~~~~~~~~~~

The ``process()`` method is called **on-the-fly** during training or when accessing data from a ``StreamingDataset``. It transforms a single raw feature value into the format your model needs. This method can return:
Copy link
Collaborator

@Logiquo Logiquo Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process is always only called once during the cache phase. But it should be stateless as whatever mutation done in the process

  1. may or may not share with another sample running process call
  2. is not saved at the final SampleDataset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's why the timeseries processor kept erroring as self.n_features is never set unless in fit(). I will update the docs to correct this info though as it's still being pretransformed.

Comment on lines +475 to +476
# Single timestamp column - don't convert to string yet
timestamp_series: dd.Series = df[timestamp_col]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work for the else case, though i think the if case is also problematic is some of the columns are NA. But we can worry about it in a later PR.

timestamp_series,
format=timestamp_format,
errors="raise",
errors="coerce", # Convert unparseable values to NaT instead of raising
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may hide errors? e.g. if someone specify the timestamp_format wrong, they don't see any errors but see lots of null timestamp, which may lead to confussion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this in next commit.

@Logiquo Logiquo added component: dataset Contribute a new dataset to PyHealth component: processor Contribute a new processor to PyHealth infra Infrastructure: data loading, caching, pipelines labels Dec 18, 2025
…added another bug fix to where dask leverages its temporary file usage to be in line with the cache_dir
@jhnwu3
Copy link
Collaborator Author

jhnwu3 commented Dec 18, 2025

Will merge once some jobs finish running, and then update pypi to have a 2.0a13

@Logiquo Logiquo added the bug Something isn't working label Dec 18, 2025
@jhnwu3 jhnwu3 merged commit 41d190a into master Dec 18, 2025
1 check passed
@jhnwu3 jhnwu3 deleted the fix/processors_fit_process branch December 18, 2025 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component: dataset Contribute a new dataset to PyHealth component: processor Contribute a new processor to PyHealth infra Infrastructure: data loading, caching, pipelines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants