-
-
Notifications
You must be signed in to change notification settings - Fork 198
Closed
Description
Description
Enabling threading in the stan-math library with the STAN_THREADS compiler macro leads to significant slow-downs. This is caused by the thread_local storage (TLS) implementation used in a Mayer singleton approach for the AD tape.
Example
Running the performance cmdstan benchmark suite (for vanilla cmdstan 2.18.1) with 5 replications on the stat_comp_benchmark set of examples produces on a MacOS Mojave, clang++ Apple LLVM version 10.0.0 (clang-1000.11.45.5), 2.9 GHz Intel Core i9 system these results without TLS (so STAN_THREADS not being defined):
~/work/performance-tests-cmdstan]$ cat performance_2_18_1.csv
stat_comp_benchmarks/benchmarks/garch/garch.stan,0.371638154984
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.15906739235
stat_comp_benchmarks/benchmarks/sir/sir.stan,119.077228165
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0128675460815
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.5181746006
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.078618812561
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.181181192398
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0303950309753
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.271839380264
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,18.0576345444
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.63970680237
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.43102579117
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.22463345528
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.51917424202
The same benchmark runs with -DSTAN_THREADS turned on with these metrics:
[21:55:38][sebi@sebastians-macbook-pro-1:~/work/performance-tests-cmdstan]$ cat performance_2_18_1_tls.csv
stat_comp_benchmarks/benchmarks/garch/garch.stan,0.655950117111
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,9.04262123108
stat_comp_benchmarks/benchmarks/sir/sir.stan,161.493054867
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0135178565979
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,3.01960897446
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0919272422791
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.219263839722
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0315888404846
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.282732009888
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,20.0654606819
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,7.04950909615
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.34022479057
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.82001562119
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.834125375748
The relative performance loss results in some runtime increase:
~/work/performance-tests-cmdstan]$ ./comparePerformance.py performance_2_18_1_tls.csv performance_2_18_1.csv
('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.2)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 1.05)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.17)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.25)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.21)
('performance.compilation', 0.95)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.26)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.11)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.36)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.04)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.77)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 1.04)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.64)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.61)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.27)
Expected Output
It would be nice if the speed difference is smaller.
Current Version:
v2.18.1
Metadata
Metadata
Assignees
Labels
No labels