-
Notifications
You must be signed in to change notification settings - Fork 279
Description
The main constraint for solving large problems often isn't time but memory. For example, on my laptop with 32 GB memory, the largest nmod_poly multiplication with 64-bit modulus that I can do is with inputs of length around 4 * (10^8) * (64 bits) = 3.2 GB, so the inputs and outputs use just 10% of RAM; the rest of the memory is consumed by temporary buffers in fft_small. With a more memory-efficient multiplication algorithm, I should be able to solve a 10x larger problem entirely in RAM. If FLINT were able to use file storage, I should be able to do a multiplication 100x as large.
Memory limits
A first idea is to add an interface along the following lines:
void flint_set_local_memory_limit(slong bytes)
slong flint_get_local_memory_limit()
The idea is that when FLINT selects between algorithms, e.g. for polynomial multiplication, it could try to estimate in advance the size of temporary allocations. If this exceeds flint_get_local_memory_limit() for the fastest algorithm (e.g. an FFT), we switch to a potentially slower, but less memory-hungry algorithm. Note that there are already various hardcoded cutoffs in FLINT along these lines. An example is the hardcoded 4 GB sieve limit for the Riemann zeta function:
flint/src/acb_poly/zeta_em_sum.c
Line 33 in d29dc3c
| #define SIEVE_ALLOC_LIMIT 4e9 /* 4 GB */ |
arb_mat_mul: Line 327 in d29dc3c
| if (A_max_bits > 8000 || B_max_bits > 8000 || |
Such hardcoded limits could be replaced by a global, runtime-selectable tuning parameter.
Like flint_get/set_num_threads(), I think this is a parameter that needs to be set manually by the user; doing something automatic based on the total amount of memory available on the machine is probably going to be fragile, and the optimal setting is going to depend on what the user plans to do anyway.
What are some optimizations that we need to implement to allow economizing memory?
One of the main memory hogs as noted above is FFT multiplication for huge polynomials and integers. An easy general-purpose solution is switch to Toom-Cook for extremely large multiplications, using FFT for the "pointwise" multiplications (or even doing another round of Toom-Cook recursively if the operands are truly enormous). Toom-Cook with
The existing FFT code could also be made more efficient (e.g. not storing tables of roots of unity; processing primes one at a time when using CRT) but that would be harder to implement. Even harder would be to implement some in-place FFT variant.
Matrix multiplication is analogous: instead of using Strassen or evaluation-interpolation type algorithms for the largest sizes, do blockwise classical multiplication and rely on the faster algorithms for sufficiently small blocks.
For various other operations, we'd want to replace divide and conquer algorithms that recurse to size
File-based operands
A second idea is to support file storage when even the operands are too large to fit in RAM.
It would be pretty neat if fmpz and nmod_poly (etc.) could switch to file storage automatically, but probably dedicated types with file storage is more realistic.
For composite types this could be done quite generically: if gr rings provide a method to serialize/deserialize vectors of elements to vectors of bytes, they could be wrapped by some kind of gr_file_vec, gr_file_poly, gr_file_mat types that stream flint_get_local_memory_limit()-sized chunks in and out of memory.