Fix/processors fit process by jhnwu3 · Pull Request #744 · sunlabuiuc/PyHealth

jhnwu3 · 2025-12-18T19:26:52Z

This pull request introduces several improvements and fixes across documentation, benchmarking scripts, dataset processing, and processor design. The most significant update is the addition of comprehensive documentation for writing custom feature processors, clarifying the use of fit() and process() methods. There are also bug fixes and enhancements to the dataset loading logic, as well as updates to benchmarking scripts for more consistent development and memory settings.

Documentation Improvements:

Added a detailed section to the processors documentation explaining how to write custom FeatureProcessor classes, including guidance on when to use fit() vs. process(), example implementations, and processor registration instructions.

Benchmarking Script Updates:

Changed the default development and memory limit settings in benchmark_workers_1.py and benchmark_workers_4.py to ensure benchmarks run with production-like configurations by default. [1] [2]
Updated cache directory usage and naming in benchmark_workers_4.py for clarity and to avoid overwriting previous results. [1] [2]
Added a new example script, timeseries_mimic4.py, demonstrating dataset creation, task assignment, and dataset splitting for time series data.

Dataset Loading and Processing Enhancements:

Improved timestamp handling in BaseDataset.load_table() to avoid errors with missing or invalid values, using errors="coerce" and ensuring proper datetime conversion.
Refactored code for readability and maintainability, including better formatting of warning messages, merging logic, and limiting patient selection in dev mode. [1] [2] [3] [4] [5] [6]
Added a fit() method to TimeseriesProcessor to automatically determine the number of features (n_features) from input data, improving processor usability and reliability.

Versioning:

Bumped the project version from 2.0a11 to 2.0a12 in pyproject.toml.

…etails

Logiquo · 2025-12-18T21:01:27Z

docs/api/processors.rst

+The ``process()`` Method
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``process()`` method is called **on-the-fly** during training or when accessing data from a ``StreamingDataset``. It transforms a single raw feature value into the format your model needs. This method can return:


process is always only called once during the cache phase. But it should be stateless as whatever mutation done in the process

may or may not share with another sample running process call

is not saved at the final SampleDataset.

Yeah I think that's why the timeseries processor kept erroring as self.n_features is never set unless in fit(). I will update the docs to correct this info though as it's still being pretransformed.

Logiquo · 2025-12-18T21:06:50Z

pyhealth/datasets/base_dataset.py

+                # Single timestamp column - don't convert to string yet
+                timestamp_series: dd.Series = df[timestamp_col]


This would work for the else case, though i think the if case is also problematic is some of the columns are NA. But we can worry about it in a later PR.

Logiquo · 2025-12-18T21:08:30Z

pyhealth/datasets/base_dataset.py

                timestamp_series,
                format=timestamp_format,
-                errors="raise",
+                errors="coerce",  # Convert unparseable values to NaT instead of raising


I think this may hide errors? e.g. if someone specify the timestamp_format wrong, they don't see any errors but see lots of null timestamp, which may lead to confussion.

Will change this in next commit.

…added another bug fix to where dask leverages its temporary file usage to be in line with the cache_dir

jhnwu3 · 2025-12-18T21:58:29Z

Will merge once some jobs finish running, and then update pypi to have a 2.0a13

jhnwu3 added 3 commits December 18, 2025 12:03

fixes for timeseries processor, bugs in datetime parsing, and other d…

6dbc8dc

…etails

update to examples to run these things

e13220c

version update again

26cda48

jhnwu3 requested a review from Logiquo December 18, 2025 20:04

Logiquo reviewed Dec 18, 2025

View reviewed changes

Logiquo approved these changes Dec 18, 2025

View reviewed changes

Logiquo mentioned this pull request Dec 18, 2025

[Tracking] Tracking issue for the new memory efficient dataset. #740

Open

13 tasks

Logiquo added component: dataset Contribute a new dataset to PyHealth component: processor Contribute a new processor to PyHealth infra Infrastructure: data loading, caching, pipelines labels Dec 18, 2025

doc updates to be more accurate, changed the error raising, and also …

62a262e

…added another bug fix to where dask leverages its temporary file usage to be in line with the cache_dir

Logiquo added the bug Something isn't working label Dec 18, 2025

jhnwu3 merged commit 41d190a into master Dec 18, 2025
1 check passed

jhnwu3 deleted the fix/processors_fit_process branch December 18, 2025 22:29

Logiquo mentioned this pull request Dec 19, 2025

Time-series processor, process() doesn't seem to properly set the self.n_features or self.size() function properly #742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/processors fit process#744

Fix/processors fit process#744
jhnwu3 merged 4 commits intomasterfrom
fix/processors_fit_process

jhnwu3 commented Dec 18, 2025

Uh oh!

Logiquo Dec 18, 2025 •

edited

Loading

Uh oh!

jhnwu3 Dec 18, 2025

Uh oh!

Logiquo Dec 18, 2025

Uh oh!

Logiquo Dec 18, 2025

Uh oh!

jhnwu3 Dec 18, 2025

Uh oh!

jhnwu3 commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Single timestamp column - don't convert to string yet
		timestamp_series: dd.Series = df[timestamp_col]

Conversation

jhnwu3 commented Dec 18, 2025

Uh oh!

Logiquo Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Logiquo Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Logiquo Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jhnwu3 commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Logiquo Dec 18, 2025 •

edited

Loading