Skip to content

Thread performance penalty #1102

@wds15

Description

@wds15

Description

Enabling threading in the stan-math library with the STAN_THREADS compiler macro leads to significant slow-downs. This is caused by the thread_local storage (TLS) implementation used in a Mayer singleton approach for the AD tape.

Example

Running the performance cmdstan benchmark suite (for vanilla cmdstan 2.18.1) with 5 replications on the stat_comp_benchmark set of examples produces on a MacOS Mojave, clang++ Apple LLVM version 10.0.0 (clang-1000.11.45.5), 2.9 GHz Intel Core i9 system these results without TLS (so STAN_THREADS not being defined):

~/work/performance-tests-cmdstan]$ cat performance_2_18_1.csv
stat_comp_benchmarks/benchmarks/garch/garch.stan,0.371638154984
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.15906739235
stat_comp_benchmarks/benchmarks/sir/sir.stan,119.077228165
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0128675460815
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.5181746006
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.078618812561
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.181181192398
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0303950309753
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.271839380264
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,18.0576345444
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.63970680237
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.43102579117
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.22463345528
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.51917424202

The same benchmark runs with -DSTAN_THREADS turned on with these metrics:

[21:55:38][sebi@sebastians-macbook-pro-1:~/work/performance-tests-cmdstan]$ cat performance_2_18_1_tls.csv
stat_comp_benchmarks/benchmarks/garch/garch.stan,0.655950117111
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,9.04262123108
stat_comp_benchmarks/benchmarks/sir/sir.stan,161.493054867
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0135178565979
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,3.01960897446
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0919272422791
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.219263839722
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0315888404846
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.282732009888
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,20.0654606819
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,7.04950909615
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.34022479057
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.82001562119
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.834125375748

The relative performance loss results in some runtime increase:

~/work/performance-tests-cmdstan]$ ./comparePerformance.py performance_2_18_1_tls.csv performance_2_18_1.csv
('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.2)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 1.05)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.17)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.25)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.21)
('performance.compilation', 0.95)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.26)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.11)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.36)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.04)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.77)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 1.04)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.64)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.61)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.27)

Expected Output

It would be nice if the speed difference is smaller.

Current Version:

v2.18.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions