Conversation
docs/api/processors.rst
Outdated
| The ``process()`` Method | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The ``process()`` method is called **on-the-fly** during training or when accessing data from a ``StreamingDataset``. It transforms a single raw feature value into the format your model needs. This method can return: |
There was a problem hiding this comment.
process is always only called once during the cache phase. But it should be stateless as whatever mutation done in the process
- may or may not share with another sample running
processcall - is not saved at the final
SampleDataset.
There was a problem hiding this comment.
Yeah I think that's why the timeseries processor kept erroring as self.n_features is never set unless in fit(). I will update the docs to correct this info though as it's still being pretransformed.
| # Single timestamp column - don't convert to string yet | ||
| timestamp_series: dd.Series = df[timestamp_col] |
There was a problem hiding this comment.
This would work for the else case, though i think the if case is also problematic is some of the columns are NA. But we can worry about it in a later PR.
pyhealth/datasets/base_dataset.py
Outdated
| timestamp_series, | ||
| format=timestamp_format, | ||
| errors="raise", | ||
| errors="coerce", # Convert unparseable values to NaT instead of raising |
There was a problem hiding this comment.
I think this may hide errors? e.g. if someone specify the timestamp_format wrong, they don't see any errors but see lots of null timestamp, which may lead to confussion.
There was a problem hiding this comment.
Will change this in next commit.
…added another bug fix to where dask leverages its temporary file usage to be in line with the cache_dir
|
Will merge once some jobs finish running, and then update pypi to have a 2.0a13 |
This pull request introduces several improvements and fixes across documentation, benchmarking scripts, dataset processing, and processor design. The most significant update is the addition of comprehensive documentation for writing custom feature processors, clarifying the use of
fit()andprocess()methods. There are also bug fixes and enhancements to the dataset loading logic, as well as updates to benchmarking scripts for more consistent development and memory settings.Documentation Improvements:
processorsdocumentation explaining how to write customFeatureProcessorclasses, including guidance on when to usefit()vs.process(), example implementations, and processor registration instructions.Benchmarking Script Updates:
benchmark_workers_1.pyandbenchmark_workers_4.pyto ensure benchmarks run with production-like configurations by default. [1] [2]benchmark_workers_4.pyfor clarity and to avoid overwriting previous results. [1] [2]timeseries_mimic4.py, demonstrating dataset creation, task assignment, and dataset splitting for time series data.Dataset Loading and Processing Enhancements:
BaseDataset.load_table()to avoid errors with missing or invalid values, usingerrors="coerce"and ensuring proper datetime conversion.fit()method toTimeseriesProcessorto automatically determine the number of features (n_features) from input data, improving processor usability and reliability.Versioning:
2.0a11to2.0a12inpyproject.toml.