Faster TLS AD tape by wds15 · Pull Request #1103 · stan-dev/math

wds15 · 2019-01-15T21:14:25Z

This PR speeds up the access to the AD tape whenever thread_local storage (TLS) is needed for the AD tape.

Summary

This PR has the main goal to reduce the performance penalty implied by turning on thread capability of stan-math. As it turns out, the proposed implementation works out of the box on Windows with the gcc 4.9.3 compiler which is not the case with the current threading implementation in develop. So we fix a bug which affects many users given the popularity of Windows.

The performance benefits of this PR are documented below and these benchmarks demonstrate that currently we pay about 20% performance by turning on threading while this PR reduces this in many instances to nothing (at most 5% performance degradation).

The key change of this PR is to change how the global AD tape instance is declared and accessed. Currently in develop the global AD tape is declared

for the non-threading case as a static global instance.
for the threading case as a TLS which is declared as a static instance of a static function (Meyer singleton design).

This PR changes this to a global pointer to the AD tape, which is a TLS for threading, and which is initialized to the constant nullptr expression. Initialization to a constant expression (as opposed to a dynamic expression right now in develop) allows the compiler to use the TLS global pointer more efficiently. Furthermore, the thread_local keyword used to declare TLS can be swapped for the compiler-specific __thread keyword whenever the expression is constant. The __thread keyword has been around much longer than the C++11 thread_local keyword.

Side-effects

The main side-effect is that the AD tape must be initialized before it can be used... otherwise the program will segfault!

The initialization is done with the new method init(). I am proposing to link a call to the init() function by default with any program which uses the stan-math makefiles. This will imply that there is no change to any existing program like cmdstan.

However, any program not using the stan-math makefiles must initialize the AD tape prior to performing autodiff. Moreover, any program which intends to use AD in a child-thread must call within the child thread the init() method once before any AD operations take place.

The stan-math library itself creates threads in the map_rect_concurrent function. In order to make this function safe to use I am suggesting to move the definition of the function into the rev branch. Doing so allows to call the init method inside the execute_chunk functor which is executed potentially in a new child-thread. The downside of this is that there will not be a prim definition of map_rect_concurrent right now - which is not ideal, but not meant to stay like this. I am proposing to include a thread-pool library like the Intel TBB which can be setup to manage AD tape initialization independently. Thus once we have a threadpool we should revert the change to map_rect_concurrent (take out the init call and place the function definition into prim again where it belongs).

Tests

test/unit/math/rev/core/init_test.cpp is a new test and is written to test matters with and without threading. The test is setup to test different things depending on enabling threading or not.

STAN_THREADS NOT defined (no threading). In this case we only have a global AD tape. All child-threads created by a given program will see the same AD tape. Thus initializing the AD tape on the main thread implies that it is initialized on the child-thread, etc.
STAN_THREADS IS defined (threading case). Now all new threads created must see their own TLS of the AD tape pointer which is not initialized at the beginning and is independent of the main thread AD tape when performing AD operations.

In addition I have added to the map_rect_concurrent tests code which enforces that these tests are always run with 4 threads regardless of the STAN_NUM_THREADS definition (which is often not set and this things do not run threaded during testing).

Summary of Performance Impact

Here are timings for the performance cmdstan benchmark suite with 5 replications on the stat_comp_benchmark set of examples as run on a MacOS Mojave, clang++ Apple LLVM version 10.0.0 (clang-1000.11.45.5), 2.9 GHz Intel Core i9 system. All runs are with the Hoard memory allocator loaded.

Current develop no TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.391575193405
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.12034959793
stat_comp_benchmarks/benchmarks/sir/sir.stan,94.8758684158
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.015190410614
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.69290657043
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0820732593536
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.174080991745
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0340455532074
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.282129526138
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.1371281624
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.67680997849
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.46782159805
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.24409885406
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.533155632019

develop, TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.652153968811
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,8.7458925724
stat_comp_benchmarks/benchmarks/sir/sir.stan,118.049730635
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0142640113831
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,3.4129802227
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.083340215683
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.20528254509
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0332769870758
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.284078598022
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,17.7065485477
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,7.07638401985
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.28887238503
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.81610770226
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.834881353378

This PR with new AD tape and no TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.398648643494
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.46248960495
stat_comp_benchmarks/benchmarks/sir/sir.stan,96.4491852283
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0159543991089
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.70668153763
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.080451965332
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.17337064743
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0327137470245
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.281925201416
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.3148930073
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.67591238022
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.49667320251
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.31186356544
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.549070644379

New AD tape of this PR with TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.4567237854
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.36154813766
stat_comp_benchmarks/benchmarks/sir/sir.stan,98.4717896461
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.014261007309
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.82799501419
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0829592227936
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.187496757507
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0368892669678
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.288218021393
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,19.0674182892
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.87766156197
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.78501296043
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.38194732666
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.605814552307

Relative Comparisons

develop TLS vs No TLS (these look weird to me, TLS is way too slow; will recheck results from this run of the 16th January 2019; UPDATE: now these look ok to; Hoard was not used for the TLS case; now Hoard is used in all settings shown):

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.27)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.94)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.02)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.25)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.18)
('performance.compilation', 0.95)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.23)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.1)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.24)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.01)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.67)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.98)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.56)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.57)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.25)
1.19328781019

new AD tape NO TLS vs develop NO TLS:

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.01)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 1.05)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 0.98)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.0)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.0)
('performance.compilation', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.05)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.01)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.02)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.0)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.02)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.96)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.02)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.03)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.03)
1.01087904094

New AD tape TLS vs new AD tape NO TLS:

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.04)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.89)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.03)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.04)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.08)
('performance.compilation', 0.97)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 0.99)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.17)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.02)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.02)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.15)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 1.13)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.19)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.1)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.03)
1.05392186842

In summary this suggests:

~20% performance loss for develop TLS vs NO TLS
The new AD tape approach without TLS looks to me almost fast as the current develop implementation for models. The average difference is just 1% which is easily in the simulation noise.
There are almost no slowdowns when switching on TLS for the new approach. The speed comparison of turning on TLS on the new AD tape approach shows a very minor speed decrease which is for models essentially vanishing and only very cheap to compute models show a noticeable difference. The average difference for this set is just 5% !!! If one were to weight by total execution time per model, then difference would probably be negligible.

Benchmarks without Hoard

Develop, NO TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.364757537842
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,6.98111424446
stat_comp_benchmarks/benchmarks/sir/sir.stan,92.9554965973
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0131373405457
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.69111056328
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0864352226257
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.167731571198
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.031219625473
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.275588512421
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.1579001904
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.64153895378
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.4336502552
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.22408385277
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.521640396118

Develop, TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.64807677269
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,8.83160867691
stat_comp_benchmarks/benchmarks/sir/sir.stan,121.385628176
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0121698379517
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,3.43887701035
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0869189739227
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.208812379837
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0304401397705
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.27582449913
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,17.965182209
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,7.03453731537
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.31388001442
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.82726144791
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.847529029846```

New tape, NO TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.389703369141
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.10135984421
stat_comp_benchmarks/benchmarks/sir/sir.stan,96.8512422085
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0118281364441
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.71869359016
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0778662204742
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.166544389725
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0303031921387
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.271680641174
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.4858063698
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.91060729027
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.52340941429
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.31667103767
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.540799951553```

New tape, TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.439257955551
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.52502818108
stat_comp_benchmarks/benchmarks/sir/sir.stan,105.122972965
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0125532150269
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.88401002884
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0802386283875
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.173125886917
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0306114673615
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.281007575989
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,17.3150982857
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,6.2066637516
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.72195749283
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.37005095482
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.606473207474

Relative comparisons without Hoard

develop TLS vs NO TLS

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.28)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.93)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.01)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.25)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.24)
('performance.compilation', 1.01)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.27)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.11)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.31)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.0)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.78)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.98)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.61)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.62)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.27)
1.2204335761

new tape NO TLS vs develop NO TLS

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.01)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.9)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 0.9)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.05)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 0.99)
('performance.compilation', 0.99)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.02)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.02)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.04)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 0.99)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.07)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.97)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.06)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.04)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.04)
1.00439422031

new tape TLS vs new tape NO TLS

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.06)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 1.06)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.03)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.05)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.04)
('performance.compilation', 1.03)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.06)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.05)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.09)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.03)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.13)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 1.01)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.13)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.12)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.02)
1.06039034984

So without the Hoard memory allocator things are comparable. This is using a vanilla MacOS Mojave system as defined above.

Tests

test/unit/math/rev/core/init_test.cpp

Side Effects

The current design would only initialize the AD tape in the main process. Any sub-thread will have to initialize the AD tape by itself a single time.

Checklist

Math issue Thread performance penalty #1102
Copyright holder: Sebastian Weber

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

into feature/faster-ad-tls

…gs/RELEASE_500/final)

seantalts · 2019-01-17T22:08:56Z

stan/math/rev/core/autodiffstackstorage.hpp

-#ifndef STAN_THREADS
+  static inline AutodiffStackStorage *instantiate() {
+    static thread_local std::mutex init_mutex;
+    std::lock_guard<std::mutex> init_lock(init_mutex);


Do we actually need a mutex here? Since this is a thread-local _instance, by definition only one thread can access it at a time, right? :P

Well, I guess it's not thread-local if we have no threads. But in that case as well seems fine with no mutex?

(forgot to submit these comments)

Yeah, I agree that this code does not need a mutex.

seantalts · 2019-01-17T22:10:30Z

stan/math/rev/core/chainablestack_inst.cpp

@@ -0,0 +1,15 @@
+#ifndef STAN_MATH_REV_CORE_CHAINABLESTACK_INST_CPP


Can we get rid of this file if we have a stan::math::init() function or something like that?

No, we still need the file in place so that the respective code gets executed to initialize the AD stack of the main thread. The alternative is to require that any client code must call init before doing any AD. I think we want to keep the behavior such that there is no change to client code other than using our updated makefiles.

Yeah, fair enough.

seantalts · 2019-01-18T16:58:24Z

make/libraries


 endif

+STACK_INSTANTIATION_CPP := $(shell find $(MATH)stan -type f -name 'chainablestack_inst.cpp')


why not just refer to the specific file here?

OK. I just copied what we did for MPI, but we can just refer to the file here.

seantalts · 2019-01-18T17:05:30Z

Could you add paranoia tests for a couple of things:

that the child threads do actually get their own instances of the AD stack
that each time map_rect is called, child threads get a new empty instance

wds15 · 2019-01-19T15:23:53Z

Sure, I add some tests.

into feature/faster-ad-tls

…ere, add tests for init function

wds15 · 2019-01-19T16:43:58Z

In the context of map_rect the child thread does not necessarily get an empty AD tape. It is admissible that the child thread did some other work in the mean time such that the AD tape is whatever it is. This is not a problem for map_rect as it is executing a nested AD sweep. That's safe to do as it cleans off the tape whatever it did to it.

wds15 · 2019-01-19T16:45:38Z

It looks like we want to move this forward... how do we manage the upstream repos best? If you agree to the changes as they are I would open issues and PRs for stan and cmdstan, but it's not clear how we merge that.

wds15 · 2019-01-19T20:14:03Z

Things are weird!!! I am getting segfaults with the new TLS with a stan program which uses map_rect. These are solved when I call init in the prim variant of map_rect_reduce. The same error cannot be triggered with the unit tests.

Moreover, the unit tests for map_rect_concurrent should probably be changed so that they do fire off child threads. They do not do that right now unless you set STAN_NUM_THREADS which is why I haven't caught trouble there yet (I forgot to put init calls there as you noted already).

SteveBronder · 2019-01-19T21:06:34Z

@wds15 can you run the tests without Hoard? It looks cool but I tried installing it and ran into some issues

emeryberger/Hoard#50

Using optimized software is cool and good but it's probably also a good idea to see the tests on the generic stuff as well since that's what most users will probably be using

EDIT: After messing with a bunch of these different mallocs, if this is a hassle then no worries

wds15 · 2019-01-20T10:44:18Z

Ok, now I understand what's going on. My Stan program I tested used the cvodes integrator. This integrator will always use the AD stack even if we only want the solution of the ODE system without any sensitivities since the Jacobian is needed.

Thus, we need to move the map_rect_reduce double only into the rev tree and initialize the AD tape of the thread upon entry in the function.

I am also going to sharpen the map_rect_concurrent tests and force them to fire off sub threads (no matter what the status of STAN_NUM_THREADS is).

With those changes we are hopefully good to go. It would be very interesting to see if the Windows gcc 4.9.3 can deal with the new TLS approach.

…initialize the AD tape

seantalts · 2019-01-20T20:26:15Z

Thanks for putting in so much validation effort here - super important for us to feel good about this. Seems like GCC is having issues with the tests still :/

Also, it would be good to perhaps pull out the call that spawns threads into a single place that then initializes each thread appropriately as well? That way, as long as we use that abstraction for doing threaded stuff, we'll have the AD initialized properly (and anything else we may need added in a similar way eventually). In the current system, this probably means placing the call in the execute_chunk lambda in map_rect_concurrent. Any reason that wouldn't work or wouldn't be as good? Just trying to make it as hard as possible for us to forget to initialize. When we switch to a thread pool (with TBB?) then we will just call init once per thread at the creation of the thread pool.

wds15 · 2019-01-21T07:48:29Z

Yack! The problem we have right now is with any map_rect* test living in prim. Since I moved the map_rect_reduce specialization for prim into the rev part so that I can make the call to init. Is that fine for you/us?

The problem with putting the init inside the lambda is that we will have to move the map_rect_concurrent to the rev branch of stan-math. So if that is fine for everyone, we should probably do that and move the init into the functor.

(BTW, based on the design of having the AD tape referred to as a pointer, I think I do have now a version of nesting code up which is much leaner; at least the first version of this new approach passess all tests for rev/core).

seantalts · 2019-01-21T14:29:57Z

The problem with putting the init inside the lambda is that we will have to move the map_rect_concurrent to the rev branch of stan-math.

Can we create a function that is something like thread_init() that has a prim header that does nothing and when rev gets included, does the correct AD tape initialization?

I think I do have now a version of nesting code up which is much leaner;

Where's that? I'm still skeptical we want to allow nested threaded computation, unless you've found some paradigm where that's just as performant as user-controlled parallelism?

seantalts · 2019-01-21T14:56:45Z

Can we create a function that is something like thread_init() that has a prim header that does nothing and when rev gets included, does the correct AD tape initialization?

The only ways I can think of to do this are kind of ugly. I would be okay with only having map_rect concurrent available in code using rev. I don't think we have almost any users of just the prim version of the library, and I certainly don't think we have any relying on map_rect. @syclik what do you think about moving map_rect_concurrent to rev?

wds15 · 2019-01-21T15:03:52Z

I have a local branch which is based on this PR which does nested AD in a much leaner way which doesn't require a globally acting nesting... I can push it if you want to see it, but I guess we should get this PR in first.

Yeah, I think the cleanest way for the moment being is to include in prim only a header which declares the map_rect_concurrent, but doesn't define it. The actual definition would be placed in rev.

We can revert the move of map_rect_concurrent to rev once we have Intel's TBB threadpool available. Then this can be reverted.

seantalts · 2019-01-21T16:26:41Z

Sounds reasonable to me.

syclik · 2019-02-22T18:41:06Z

For things at this level, I really would like to see microbenchmarks. If we want to make any improvements to the backend, we should have these around. I would trust the microbenchmarks and use the end-to-end as a sanity check (rather than the other way around). As written clearly in the stat_comp_benchmark’s README.md, it wasn’t designed to be an end-to-end test. There’s always a chance of something else that we’re missing at that level. I still remember the days when disk i/o could tank any reasonable performance numbers.

…

On Fri, Feb 22, 2019 at 1:35 PM seantalts ***@***.***> wrote: Wait, if we're doing full end-to-end comparison benchmarks on stat_comp_benchmark, we surely don't need to do microbenchmarks, right? Would you be satisfied with just microbenchmarks? Or should we drop them and just build a Jenkins job that runs end-to-end stat_comp_benchmark tests on our 3 OS+compiler pairs for a given math commit (can build off of http://github.com/stan-dev/performance-tests-cmdstan <https://github.com/stan-dev/performance-tests-cmdstan> but not trivially, at least not until monorepo happens). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F7KMKwcsOnIuNsY0SIDJxDiTMX8Vks5vQDhqgaJpZM4aB0zR> .

seantalts · 2019-02-22T19:15:46Z

I would trust the microbenchmarks and use the end-to-end as a sanity check (rather than the other way around).

This way lies madness! :P Take gander at e.g. this cppconf video on how difficult it is to construct an informative microbenchmark, nevermind one that captures all of the behavior you care about. It's a really good starter on this topic and has a pretty entertaining speaker, luckily: https://www.youtube.com/watch?v=nXaxk27zwlk It should probably be required viewing for all C++ devs, haha.

End-to-end makes sense from a logical perspective, too - of course you will only get information about what you measure, so measuring your entire system end-to-end is much more informative than a specific line of code. CPUs and compilers are extremely complicated these days and generate different code and execute it in a different way in different scenarios. End-to-end for a specific model is the only way to be authoritative and only on that model, but we hope to get a wide enough spread of models to be somewhat characteristic of the general trend, while ignoring stuff in the <3% or so range as noise.

As written clearly in the stat_comp_benchmark’s README.md, it wasn’t designed to be an end-to-end test.

Those are just the source of models + data. We all agreed they were the best thing we had towards an end-to-end test, and the performance-tests-cmdstan repo is clearly meant to serve in that capacity as best as we currently know how (and if we know better now, we should upgrade!). That repo also contains code to run dozens of example models, which do not always fit very well but execute additional math library functionality with slightly different usage patterns. Generally, I have seen the stat_comp_benchmark "known good" models test, on average, about the same as the shotgun tests with many more models.

Optically, raising the bar to such an extreme level on this particular PR appears somewhat capricious at the moment. Would you mind confirming that you aren't using Stan Math with threading at Generable?

At the end of the day, I think we need automated tooling for whatever additional bars you are adding to the code review process here, so if you want to build all of the above and require them for all PRs going forward, I'd be happy to try that out for a while and see how it goes.

syclik · 2019-02-22T19:35:36Z

@seantalts, thanks for chiming in.

This way lies madness! :P Take gander at e.g. this cppconf video on how difficult it is to construct an informative microbenchmark, nevermind one that captures all of the behavior you care about. It's a really good starter on this topic and has a pretty entertaining speaker, luckily: https://www.youtube.com/watch?v=nXaxk27zwlk It should probably be required viewing for all C++ devs, haha.

I'll definitely take a look.

But just so we're clear, we're not looking at 2% performance hit. We're looking at a 15-40% hit, so we should be able to trigger that, no?

End-to-end makes sense from a logical perspective, too - of course you will only get information about what you measure, so measuring your entire system end-to-end is much more informative than a specific line of code. CPUs and compilers are extremely complicated these days and generate different code and execute it in a different way in different scenarios. End-to-end for a specific model is the only way to be authoritative and only on that model, but we hope to get a wide enough spread of models to be somewhat characteristic of the general trend, while ignoring stuff in the <3% or so range as noise.

So... one thing that makes this particular implementation trickier is that there's an interaction with a ton of other things, down to linking, compiler flags, use of different keywords, pointers, compilers, etc. If we're going to actually go with changing the design of our autodiff stack, then I'd like to know that we needed to do this.

Those are just the source of models + data. We all agreed they were the best thing we had towards an end-to-end test, and the performance-tests-cmdstan repo is clearly meant to serve in that capacity as best as we currently know how (and if we know better now, we should upgrade!). That repo also contains code to run dozens of example models, which do not always fit very well but execute additional math library functionality with slightly different usage patterns. Generally, I have seen the stat_comp_benchmark "known good" models test, on average, about the same as the shotgun tests with many more models.

Cool. Good enough for me. Want to update the documentation?

At the end of the day, I think we need automated tooling for whatever additional bars you are adding to the code review process here, so if you want to build all of the above and require them for all PRs going forward, I'd be happy to try that out for a while and see how it goes.

I agree that we do need it. It'll be good to have it so we can apply it to all PRs, but really for the ones that touch the autodiff stack.

I don't want to hold up the PR to get all the way automated, if that's the last thing. If we determine that this design is the only sensible way to move forwards, I'm going to request we bump major version numbers for the Math library. It's a fundamental shift in the design of the autodiff stack and we should update the arXiv paper and all our documentation. It'll also give us an opportunity to align how the atuodiff stack is instantiated. (@wds15: if we go this way, I'm thinking we don't have to automatically instantiate the stack, but require clients to actually call an init() function, under all conditions.)

I'm still hoping we don't have to do that, but I'm becoming more and more convinced by @wds15 that this is the right way to proceed.

Would you mind confirming that you aren't using Stan Math with threading at Generable?

We're threading, but at the moment, through map_rect().

syclik · 2019-02-22T19:55:00Z

This was all executed while being logged into the machine. So, yes, it is the same node for all tests.

Thanks!

wds15 · 2019-02-23T13:40:52Z

BTW, requiring the performance tests on windows is a bit mood given that map_rect never worked on Windows in a multi-threaded way... things do work though if you only use a single thread.

I just pushed onto this branch a small example which can serve as a microbenchmark. If you compile and run this code on develop in TLS / No TLS then I get on my Mac system a runtime of 2700ms / 1700ms. On this branch with TLS the runtime is 2000ms; so the performance hit is drastically reduced (60% slowdown vs 17%). The example is the modified ode stress test... this time using the RK45 integrator which represents an extreme stress of the AD stack as this creates an enormous amount of function calls where each call is cheap to calculate, but it does trigger for every call a build up and tear down of the nested AD tree.

Does this count as a microbenchmark?

syclik · 2019-02-25T15:57:46Z

@wds15: I think we should use a different process to benchmark. Using the coupled ode system actually introduces a different complication: nested autodiff. I don't know how, but there might be difference performace penalties.

We should just be able to write a large function, right?

seantalts · 2019-02-26T01:47:17Z

But just so we're clear, we're not looking at 2% performance hit. We're looking at a 15-40% hit, so we should be able to trigger that, no?

It's not about the size of the effect, it's about making sure you're measuring what you think you're measuring. For example, in the talk he goes through a simple example (I think measuring whether calling reserve on a std::vector is helpful or not) and shows that if you're not careful, the compiler will easily optimize your code away. He looks at the assembly to see why he is getting unintuitive results.

If we're going to actually go with changing the design of our autodiff stack, then I'd like to know that we needed to do this.

I agree that if we were changing the design of the autodiff stack we should look at it, but we're just changing how we're storing the global reference to it. And also agreed that good benchmarks (either with Google Benchmark for relatively localized stuff or end-to-end for holistic changes) should be required every time we do something JUST for alleged performance reasons. I just also want to make that easier for contributors.

Sure, let's not hold up the PR for the benchmarks. I'm reasonably convinced that the performance on Mac and Linux is better and it enables threading for the first time on Windows, so I don't think we need to benchmark in such detail especially when we still maintain that ifdef #STAN_THREADS logic around it.

I'm going to request we bump major version numbers for the Math library. It's a fundamental shift in the design of the autodiff stack and we should update the arXiv paper and all our documentation. It'll also give us an opportunity to align how the atuodiff stack is instantiated. (@wds15: if we go this way, I'm thinking we don't have to automatically instantiate the stack, but require clients to actually call an init() function, under all conditions.)

That's a huge change. We will have to update literally every test, for example. Why do you think we need to require all clients call an init() function under all conditions? Just for a sort of consistency with a use-case we don't support? Let's bring it up at the meeting Thursday - I'd like to hear Bob's input on this.

PS Versioning in the mono-repo world will look very different. We should talk about sometime - for example, we won't be able to tag a Math 3.0.0 with Stan 2.19 on the repo. What is the unit of a Math release now? Just a .tar.gz?

syclik · 2019-02-26T02:34:30Z

On Mon, Feb 25, 2019 at 8:47 PM seantalts ***@***.***> wrote: But just so we're clear, we're not looking at 2% performance hit. We're looking at a 15-40% hit, so we should be able to trigger that, no? It's not about the size of the effect, it's about making sure you're measuring what you think you're measuring. For example, in the talk he goes through a simple example (I think measuring whether calling reserve on a std::vector is helpful or not) and shows that if you're not careful, the compiler will easily optimize your code away. He looks at the assembly to see why he is getting unintuitive results.

Yes, I'm very aware. I watched that video and something that really stood out was one of the questions at the end. Someone asked if you should use an end-to-end test or a microbenchmark or both. The answer was along the lines that I've been advocating for: do both. But make sure you can create a benchmark that shows the behavior. I still don't quite trust that the problem is actually where we think it is. I'm still trying to be convinced that it is there. I'd rather be shown that it really is in the TLS than not.

If we're going to actually go with changing the design of our autodiff stack, then I'd like to know that we needed to do this. I agree that if we were changing the design of the autodiff stack we should look at it, but we're just changing how we're storing the global reference to it. And also agreed that good benchmarks (either with Google Benchmark for relatively localized stuff or end-to-end for holistic changes) should be required every time we do something JUST for alleged performance reasons. I just also want to make that easier for contributors.

Yup. Just so we're clear, almost everything we've done in the Math library has not been on the basis of performance. Yes, we are fast and care about speed, but almost all of it was building new features. When we claim we do something for speed, that's the point where we should make sure we can back up that claim. (Just as when someone introduces a new feature, they should make sure it works.) It's at this point, we care about timing. I've always been about build first, optimize second. Anyway, we're here trying to optimize. We need to be able to measure. One of the real difficulties is that C++ code really doesn't behave the same for performance under different compilers, operating systems, compiler flags, linker flags, cache sizes, disk speed, and a number of other reasons. When we want to do something like this, we need to make sure it's faster for the bulk of the users. It's great if it's faster on all three OSes we support, but we have decisions to make if it makes things faster in one and not the other two. Or if it makes it slower.

Sure, let's not hold up the PR for the benchmarks though. I'm reasonably convinced that the performance on Mac and Linux is better and it enables threading for the first time on Windows, so I don't think we need to benchmark in such detail especially when we still maintain that ifdef #STAN_THREADS logic around it.

I'm not convinced yet and I'd like to be there. I don't want to hold this up for us to make it automatic, but I'd like for us to isolate it a bit just to confirm. That make sense? I'm going to put some time in now, so hopefully I can hunt it down in a few hours. I'm going to request we bump major version numbers for the Math library.

It's a fundamental shift in the design of the autodiff stack and we should update the arXiv paper and all our documentation. It'll also give us an opportunity to align how the atuodiff stack is instantiated. ***@***.*** <https://github.com/wds15>: if we go this way, I'm thinking we don't have to automatically instantiate the stack, but require clients to actually call an init() function, under all conditions.) That's a huge change. We will have to update literally every test, for example. Why do you think we need to require all clients call an init() function under all conditions? Just for a sort of consistency with a use-case we don't support? Let's bring it up at the meeting Thursday - I'd like to hear Bob's input on this.

Do you also agree with the converse of your statement, it's a huge change to call an init() function under one condition when the others you don't have to? Before thinking too far about this, let's make sure we have to to use this uninitialized pointer. As I've mentioned before, I haven't been convinced that this is the problem and that's due to reading up on performance of thread_local variables and my own efforts in trying to confirm it. Also, this does affect almost everyone. I though the proposed plan once this got in was to only build with threading enabled. So... this means if we don't initialize, then it'll seg fault. Did I misread that? (Meaning, we should be initializing for all the tests anyway.) Mind if we just had an online discussion or perhaps a separate meeting with the stakeholders that should be in the call? I don't think the recurring Thursday meetings are a good way to discuss the depths of the Math library. Feel free to discuss with Bob and have him weigh in here or on discourse. PS Versioning in the mono-repo world will look very different. We should

talk about sometime - for example, we won't be able to tag a Math 3.0.0 with Stan 2.19 on the repo. What is the unit of a Math release now? Just a .tar.gz?

Oh yeah... thanks for bringing that up. I was just thinking it wouldn't be a big deal to bump the Math version to v3 to support this change. (It'd actually give us a chance to clear out a lot of deprecated code too.) The Math release is just a git hash of the stan-dev/math repo. It doesn't even packaged... we let GitHub do that work for us.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_FzW0DFYxEvJAUv5HJrytbjtMs2Lkks5vRJImgaJpZM4aB0zR> .

syclik · 2019-02-26T04:38:32Z

@wds15 and @seantalts: I'm becoming more and more unconvinced. (I can still be convinced... but I'll explain why.)

I just put something up using the Google Benchmark framework on @seantalts's perf-math repo on the faster-tls-tape branch: https://github.com/seantalts/perf-math/tree/faster-tls-tape. There are at least a few places I could have messed up: 1) maybe the benchmark is still wrong. 2) maybe I linked the tls stuff incorrectly. 3) maybe XCode clang++ isn't doing the right thing.

To run, I created a shell script. Follow the instructions to set up the benchmark (README.md), then run

sh compare-tls.sh

This is only designed to run on Mac with the clang++ compiler, but it's pretty easy to swap that out. I'm running with XCode clang++, which is also what @wds15 was reporting on. I won't claim that this holds for any other configuration.

What I see:

develop without using threading at all: 14020 ns
develop with -DSTAN_THREADS: 13875 ns
feature/faster-ad-tls with -DSTAN_THREADS: 14186 ns

If these results hold on other platforms / compilers, it's a wash, right?

The thing I'm using to test is the GARCH example. I just made the model using stan::math operations and just loaded the data once. That's one of the examples that had a high ratio in the original example.

Btw, I'm not convinced I did this completely right. I'm kinda tired right now, so I'm going to give it a rest, but could one of you verify that it looks right?

@wds15: in this version, I wasn't able to actually statically link the chainable_stack_inst.o. What am I doing wrong? Every time I try to link it, I get an error:

$ clang++ -std=c++11 -DSTAN_THREADS  -Imath/lib/eigen_3.3.3/ -Ibenchmark/include -std=c++1y -Imath/ -Imath/lib/boost_1.66.0 -O3 -Imath/lib/sundials_4.1.0/include   -Lbenchmark/build/src  math/stan/math/rev/core/chainablestack_inst.o  tls_speed_eval.cpp  -lbenchmark -o tls_speed_eval -DFEATURE_TLS
ld: illegal thread local variable reference to regular symbol __ZN4stan4math22AutodiffStackSingletonINS0_4variENS0_15chainable_allocEE9instance_E for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

[editted for clarity]

chriselrod · 2019-02-26T06:20:32Z

For fun, I ran those benchmarks. I tried 4 runs: clang vs gcc, and with or without -march=native.
Additionally, on Linux, I had to add -pthread.

Results:
clang (clang -match=native)

develop without using threading at all: 21076 ns (19793 ns)
develop with -DSTAN_THREADS: 21156 ns (19653 ns)
feature/faster-ad-tls with -DSTAN_THREADS: 21993 ns (23750 ns)

gcc (gcc -match=native)

develop without using threading at all: 14414 ns (14346 ns)
develop with -DSTAN_THREADS: 14544 ns (14423 ns)
feature/faster-ad-tls with -DSTAN_THREADS: 14872 ns (16962 ns)

The CPU was a Ryzen Threadripper 1950x. -march=native triggers a regression on this platform when using faster-ad-tls.
Otherwise, I get the best performance when not using threads at all. Which is a little disappointing on a 16 core/32 thread "Threadripper", but 14 microseconds isn't much time to amortize the overhead.

I'm on recent versions of both compilers:

$ clang++ --version
clang version 7.0.1 (tags/RELEASE_701/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ g++ --version
g++ (GCC) 8.2.1 20181127
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

syclik · 2019-02-26T14:55:24Z

@chriselrod, thanks for confirming.

@wds15, I also included your "too much work" ode example into the benchmark and I'm really seeing no difference. But... I'm compiling it differently than we do on that branch. Notice, everything is stripped down to the minimal set. If this result is replicated across different compilers and platforms, then it would point to a problem in the way we're building and linking.

syclik · 2019-02-26T14:56:12Z

@seantalts, I'll put together a Jenkins job to run on branches. It'll take me a while, though.

seantalts · 2019-02-26T15:20:03Z

Which benchmarks? I don't think any of them are using map_rect, right? So we wouldn't expect a speedup from threading being enabled, just to address your disappointment, haha. We're just looking at the overhead of thread-local storage in this PR.

On Tue, Feb 26, 2019 at 01:20 chriselrod ***@***.***> wrote: For fun, I ran those benchmarks. I tried 4 runs: clang vs gcc, and with or without -march=native. Additionally, on Linux, I had to add -pthread. Results: clang (clang -match=native) - develop without using threading at all: 21076 ns (19793 ns) - develop with -DSTAN_THREADS: 21156 ns (19653 ns) - feature/faster-ad-tls with -DSTAN_THREADS: 21993 ns (23750 ns) gcc (gcc -match=native) - develop without using threading at all: 14414 ns (14346 ns) - develop with -DSTAN_THREADS: 14544 ns (14423 ns) - feature/faster-ad-tls with -DSTAN_THREADS: 14872 ns (16962 ns) The CPU was a Ryzen Threadripper 1950x. -march=native triggers a regression on this platform when using faster-ad-tls. Otherwise, I get the best performance when not using threads at all. Which is a little disappointing on a 16 core/32 thread "Threadripper", but 14 microseconds isn't much time to amortize the overhead. I'm on recent versions of both compilers: $ clang++ --version clang version 7.0.1 (tags/RELEASE_701/final) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin $ g++ --version g++ (GCC) 8.2.1 20181127 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxJ7BEQmPHvDWkNjisJ-YMLBVLohAFAks5vRNIygaJpZM4aB0zR> .

-- 📲

seantalts · 2019-02-26T16:58:43Z

Also, and this is really important, it's STAN_THREADS and not STAN_THREAD. I put in that and a few other fixes to the branch on perf-math. The performance differences are quite obvious now even on the simplest test; for me it's 18800, 33410, 23314 for that first benchmark w/o the ODE system. I also updated the clobbering and escaping to be more fastidious.

syclik · 2019-02-26T17:21:48Z

Thanks for fixing my mistake that completely invalidated everything I had posted before! And for adding clobbering and escaping. I'm seeing similar, but we should verify on Linux and the compiler we said we'd support too. On my machine, the relative differences are: benchmark_autodiff_stack: 1.000000 1.912845 1.234384 becnhmark_autodiff_stack_coupled_mm: 1.000000 1.435524 1.119483 becnhamrk_autodiff_stack_coupled_mm_nested: 1.000000 1.453436 1.112097 Honestly, if these results hold, I don't think we take this design change **on the basis of always compiling with STAN_THREADS**. We'll be taking a 10%+ penalty for running on one core. What we now have to do is measure the relative impact of running threaded operations. Does that make sense? If these results hold under threading, we should really consider it, but let's make sure it does first.

…

On Tue, Feb 26, 2019 at 11:58 AM seantalts ***@***.***> wrote: Also, and this is really important, it's STAN_THREADS and not STAN_THREAD. I put in that and a few other fixes to the branch on perf-math. The performance differences are quite obvious now even on the simplest test; for me it's 18800, 33410, 23314 for that first benchmark w/o the ODE system. I also updated the clobbering and escaping to be more fastidious. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F0ShKhTNFmbs6OnSc3IJlw92HXpwks5vRWfDgaJpZM4aB0zR> .

seantalts · 2019-02-26T17:27:39Z

I don't think we take this design change on the basis of always compiling with STAN_THREADS. We'll be taking a 10%+ penalty for running on one core. What we now have to do is measure the relative impact of running threaded operations. Does that make sense?

Yep, same page. I'm sad for the loss there. @wds15 is very creative; maybe he can come up with some way to get it even lower and get rid of -DSTAN_THREADS (did you end up playing with putting a pointer on the var class?).

What if we had end-to-end tests that showed that in real-life the difference between STAN_THREADS on and off was only 5%? Would that be worth it, do you think? These microbenchmarks are really hitting the exact pain point head on, though I'm still inclined to follow them towards keeping STAN_THREADS. Sad.

What we now have to do is measure the relative impact of running threaded operations. Does that make sense?

Does this mean a new microbenchmark with different threads performing gradients and testing with develop vs. the pointer approach? Makes sense to me.

wds15 · 2019-02-26T18:00:00Z

Uff... finally we seem to converge... I am very glad to see this (and I am going to ask for a beer on this one ;) ).

From my perspective we can still consider to go with STAN_THREADS only .... at least at some point. First off, I would not like to jump from where we are (with/without STAN_THREADS) towards always turning it on right away. I would do that over 1-2 releases at least.

However, we can still seriously consider turning on STAN_THREADS for all cases - given the end-to-end benchmarks are not that sensitive to this. So if you checkout my last benchmarks above you see that for almost all models the difference is only very tiny. Essentially only those models which really use the AD stack in an inefficient way will get a slight performance hit. Inefficient use of the AD stack is, for example, non-stiff ODE solving since the computations done are extremely cheap, but we do access the AD stack a huge amount of time in rapid succession. Any model which does some sensible computations between AD stack accesses will not see this additional cost which we introduce as it is relatively speaking tiny.

So for real-world models I would argue that we are good with STAN_THREADS only and this will serve our user-base well and keep our maintenance burden down. It is a trade-off which I think is reasonable.

(I do recall that the performance hit got even smaller once you start using the Intel TBB malloc proxy....but I was never that precise with these measurements).

What we now have to do is measure the relative impact of running threaded operations. Does that make sense?

Does this mean a new microbenchmark with different threads performing gradients and testing with develop vs. the pointer approach? Makes sense to me.

I am not sure if I follow this one. What are threaded operations? ala map_rect with threads?

EDIT: @seantalts I don't have more tricks ATM other than using the TBB. Like I say, I think that overall end-to-end speed is affected in a negligible way from my perspective and from what I have seen so far.

chriselrod · 2019-02-26T18:10:14Z

Just to give updated Linux benchmarks, the three mean times...
...with clang (7.0.1):

1, 1.085, 0.9041
1, 1.142, 1.076
1, 1.15, 1.09

...with gcc (8.2.1)

1, 1.295, 1.035
1, 1.513, 1.04
1, 1.48, 1.034

FWIW, gcc/clang baselines: 0.7875, 0.9444, 0.9653

Which benchmarks? I don't think any of them are using map_rect, right?

Sorry, didn't double check that!

wds15 · 2019-02-26T18:44:54Z

@chriselrod Thanks! That now looks like how I would expect this to land.

wds15 · 2019-02-27T13:16:08Z

BTW, there is one figure which I am missing which is the speed of this PR without -DSTAN_THREADS.

wds15 · 2019-03-02T10:04:13Z

I am suggesting to close this PR and move to the #1135 PR which is a cleaned version of this one here.

syclik · 2019-03-02T12:23:17Z

Uff... finally we seem to converge... I am very glad to see this (and I am going to ask for a beer on this one ;) ).

@wds15, definitely! We're going to drink some whiskey next time we're together in person.

Thank you for putting together the PR and working through the concerns.

I'll close as suggested. For the other PR, I'll try to ask for the right benchmarks this time. I didn't know exactly what needed to be benchmarked when we started this one. As you can tell by now, this is a very big change to the autodiff stack and should not be taken lightly. Sometimes big changes are warranted.

wds15 · 2019-03-02T14:23:24Z

I know... this is a bummer. It's a big price we pay for threading all the time right now - and we need a change right at the "heart of Stan" which is the AD stack handling.

wds15 and others added 5 commits January 14, 2019 21:51

make thread-specific AD tape pointer much faster (on Mac)

953a35d

add makefile bits

9179928

Merge branch 'feature/faster-ad-tls' of https://github.com/stan-dev/math

b4c59c0

into feature/faster-ad-tls

switch to new thread_local scheme

27e9ee9

[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…

9cc04b0

…gs/RELEASE_500/final)

seantalts self-requested a review January 17, 2019 16:37

seantalts reviewed Jan 18, 2019

View reviewed changes

wds15 added 3 commits January 19, 2019 16:46

Merge branch 'feature/faster-ad-tls' of https://github.com/stan-dev/math

d462e9b

into feature/faster-ad-tls

address PR comments from Sean: introduce init method, remove mutex th…

fef78b5

…ere, add tests for init function

some more comments on test & makefile alignment

7a3fffe

wds15 changed the title ~~WIP faster TLS AD tape~~ Faster TLS AD tape Jan 19, 2019

wds15 added 3 commits January 19, 2019 18:02

update doc on AutodiffStackSingleton

710d387

add init to all map_rect_reduce flavors

a81d754

add more notes

a3ee346

wds15 added 2 commits January 20, 2019 11:53

move all map_rect_reduce specialisations into rev in order to always …

63cc01d

…initialize the AD tape

ensure that 4 threads are used during map_rect_concurrent tests

2757f80

add unit test which can serve as microbenchmark

1178f5f

Merge branch 'develop' into feature/faster-ad-tls

f8dd3b8

wds15 mentioned this pull request Mar 2, 2019

Feature/faster ad tls v3 #1135

Merged

11 tasks

syclik closed this Mar 2, 2019

This was referenced Mar 21, 2019

WIP Feature/faster ad tls v3B #1170

Closed

Feature/faster ad tls v4 #1171

Merged

syclik deleted the feature/faster-ad-tls branch September 15, 2022 14:10

		@@ -0,0 +1,15 @@
		#ifndef STAN_MATH_REV_CORE_CHAINABLESTACK_INST_CPP


		endif

		STACK_INSTANTIATION_CPP := $(shell find $(MATH)stan -type f -name 'chainablestack_inst.cpp')

Uh oh!

Conversation

wds15 commented Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Side-effects

Tests

Summary of Performance Impact

Relative Comparisons

Benchmarks without Hoard

Relative comparisons without Hoard

Tests

Side Effects

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seantalts commented Jan 18, 2019

Uh oh!

wds15 commented Jan 19, 2019

Uh oh!

wds15 commented Jan 19, 2019

Uh oh!

wds15 commented Jan 19, 2019

Uh oh!

wds15 commented Jan 19, 2019

Uh oh!

SteveBronder commented Jan 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wds15 commented Jan 20, 2019

Uh oh!

seantalts commented Jan 20, 2019

Uh oh!

wds15 commented Jan 21, 2019

Uh oh!

seantalts commented Jan 21, 2019

Uh oh!

seantalts commented Jan 21, 2019

Uh oh!

wds15 commented Jan 21, 2019

Uh oh!

seantalts commented Jan 21, 2019

Uh oh!

syclik commented Feb 22, 2019 via email

Uh oh!

seantalts commented Feb 22, 2019

Uh oh!

syclik commented Feb 22, 2019

Uh oh!

syclik commented Feb 22, 2019

Uh oh!

wds15 commented Feb 23, 2019

Uh oh!

syclik commented Feb 25, 2019

Uh oh!

seantalts commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syclik commented Feb 26, 2019 via email

Uh oh!

syclik commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wds15 commented Jan 15, 2019 •

edited

Loading

SteveBronder commented Jan 19, 2019 •

edited

Loading

seantalts commented Feb 26, 2019 •

edited

Loading

syclik commented Feb 26, 2019 •

edited

Loading

wds15 commented Feb 26, 2019 •

edited

Loading