Skip to content

Faster TLS AD tape#1103

Closed
wds15 wants to merge 24 commits intodevelopfrom
feature/faster-ad-tls
Closed

Faster TLS AD tape#1103
wds15 wants to merge 24 commits intodevelopfrom
feature/faster-ad-tls

Conversation

@wds15
Copy link
Contributor

@wds15 wds15 commented Jan 15, 2019

This PR speeds up the access to the AD tape whenever thread_local storage (TLS) is needed for the AD tape.

Summary

This PR has the main goal to reduce the performance penalty implied by turning on thread capability of stan-math. As it turns out, the proposed implementation works out of the box on Windows with the gcc 4.9.3 compiler which is not the case with the current threading implementation in develop. So we fix a bug which affects many users given the popularity of Windows.

The performance benefits of this PR are documented below and these benchmarks demonstrate that currently we pay about 20% performance by turning on threading while this PR reduces this in many instances to nothing (at most 5% performance degradation).

The key change of this PR is to change how the global AD tape instance is declared and accessed. Currently in develop the global AD tape is declared

  • for the non-threading case as a static global instance.
  • for the threading case as a TLS which is declared as a static instance of a static function (Meyer singleton design).

This PR changes this to a global pointer to the AD tape, which is a TLS for threading, and which is initialized to the constant nullptr expression. Initialization to a constant expression (as opposed to a dynamic expression right now in develop) allows the compiler to use the TLS global pointer more efficiently. Furthermore, the thread_local keyword used to declare TLS can be swapped for the compiler-specific __thread keyword whenever the expression is constant. The __thread keyword has been around much longer than the C++11 thread_local keyword.

Side-effects

The main side-effect is that the AD tape must be initialized before it can be used... otherwise the program will segfault!

The initialization is done with the new method init(). I am proposing to link a call to the init() function by default with any program which uses the stan-math makefiles. This will imply that there is no change to any existing program like cmdstan.

However, any program not using the stan-math makefiles must initialize the AD tape prior to performing autodiff. Moreover, any program which intends to use AD in a child-thread must call within the child thread the init() method once before any AD operations take place.

The stan-math library itself creates threads in the map_rect_concurrent function. In order to make this function safe to use I am suggesting to move the definition of the function into the rev branch. Doing so allows to call the init method inside the execute_chunk functor which is executed potentially in a new child-thread. The downside of this is that there will not be a prim definition of map_rect_concurrent right now - which is not ideal, but not meant to stay like this. I am proposing to include a thread-pool library like the Intel TBB which can be setup to manage AD tape initialization independently. Thus once we have a threadpool we should revert the change to map_rect_concurrent (take out the init call and place the function definition into prim again where it belongs).

Tests

test/unit/math/rev/core/init_test.cpp is a new test and is written to test matters with and without threading. The test is setup to test different things depending on enabling threading or not.

  • STAN_THREADS NOT defined (no threading). In this case we only have a global AD tape. All child-threads created by a given program will see the same AD tape. Thus initializing the AD tape on the main thread implies that it is initialized on the child-thread, etc.

  • STAN_THREADS IS defined (threading case). Now all new threads created must see their own TLS of the AD tape pointer which is not initialized at the beginning and is independent of the main thread AD tape when performing AD operations.

In addition I have added to the map_rect_concurrent tests code which enforces that these tests are always run with 4 threads regardless of the STAN_NUM_THREADS definition (which is often not set and this things do not run threaded during testing).

Summary of Performance Impact

Here are timings for the performance cmdstan benchmark suite with 5 replications on the stat_comp_benchmark set of examples as run on a MacOS Mojave, clang++ Apple LLVM version 10.0.0 (clang-1000.11.45.5), 2.9 GHz Intel Core i9 system. All runs are with the Hoard memory allocator loaded.

Current develop no TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.391575193405
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.12034959793
stat_comp_benchmarks/benchmarks/sir/sir.stan,94.8758684158
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.015190410614
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.69290657043
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0820732593536
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.174080991745
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0340455532074
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.282129526138
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.1371281624
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.67680997849
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.46782159805
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.24409885406
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.533155632019

develop, TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.652153968811
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,8.7458925724
stat_comp_benchmarks/benchmarks/sir/sir.stan,118.049730635
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0142640113831
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,3.4129802227
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.083340215683
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.20528254509
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0332769870758
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.284078598022
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,17.7065485477
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,7.07638401985
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.28887238503
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.81610770226
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.834881353378

This PR with new AD tape and no TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.398648643494
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.46248960495
stat_comp_benchmarks/benchmarks/sir/sir.stan,96.4491852283
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0159543991089
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.70668153763
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.080451965332
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.17337064743
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0327137470245
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.281925201416
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.3148930073
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.67591238022
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.49667320251
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.31186356544
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.549070644379

New AD tape of this PR with TLS:

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.4567237854
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.36154813766
stat_comp_benchmarks/benchmarks/sir/sir.stan,98.4717896461
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.014261007309
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.82799501419
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0829592227936
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.187496757507
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0368892669678
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.288218021393
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,19.0674182892
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.87766156197
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.78501296043
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.38194732666
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.605814552307

Relative Comparisons

develop TLS vs No TLS (these look weird to me, TLS is way too slow; will recheck results from this run of the 16th January 2019; UPDATE: now these look ok to; Hoard was not used for the TLS case; now Hoard is used in all settings shown):

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.27)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.94)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.02)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.25)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.18)
('performance.compilation', 0.95)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.23)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.1)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.24)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.01)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.67)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.98)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.56)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.57)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.25)
1.19328781019

new AD tape NO TLS vs develop NO TLS:

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.01)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 1.05)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 0.98)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.0)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.0)
('performance.compilation', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.05)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.01)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.02)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.0)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.02)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.96)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.02)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.03)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.03)
1.01087904094

New AD tape TLS vs new AD tape NO TLS:

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.04)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.89)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.03)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.04)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.08)
('performance.compilation', 0.97)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 0.99)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.17)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.02)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.02)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.15)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 1.13)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.19)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.1)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.03)
1.05392186842

In summary this suggests:

  • ~20% performance loss for develop TLS vs NO TLS
  • The new AD tape approach without TLS looks to me almost fast as the current develop implementation for models. The average difference is just 1% which is easily in the simulation noise.
  • There are almost no slowdowns when switching on TLS for the new approach. The speed comparison of turning on TLS on the new AD tape approach shows a very minor speed decrease which is for models essentially vanishing and only very cheap to compute models show a noticeable difference. The average difference for this set is just 5% !!! If one were to weight by total execution time per model, then difference would probably be negligible.

Benchmarks without Hoard

Develop, NO TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.364757537842
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,6.98111424446
stat_comp_benchmarks/benchmarks/sir/sir.stan,92.9554965973
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0131373405457
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.69111056328
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0864352226257
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.167731571198
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.031219625473
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.275588512421
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.1579001904
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.64153895378
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.4336502552
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.22408385277
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.521640396118

Develop, TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.64807677269
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,8.83160867691
stat_comp_benchmarks/benchmarks/sir/sir.stan,121.385628176
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0121698379517
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,3.43887701035
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0869189739227
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.208812379837
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0304401397705
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.27582449913
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,17.965182209
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,7.03453731537
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.31388001442
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.82726144791
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.847529029846```

New tape, NO TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.389703369141
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.10135984421
stat_comp_benchmarks/benchmarks/sir/sir.stan,96.8512422085
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0118281364441
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.71869359016
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0778662204742
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.166544389725
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0303031921387
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.271680641174
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.4858063698
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.91060729027
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.52340941429
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.31667103767
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.540799951553```

New tape, TLS

stat_comp_benchmarks/benchmarks/garch/garch.stan,0.439257955551
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.52502818108
stat_comp_benchmarks/benchmarks/sir/sir.stan,105.122972965
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0125532150269
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.88401002884
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0802386283875
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.173125886917
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0306114673615
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.281007575989
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,17.3150982857
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,6.2066637516
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.72195749283
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.37005095482
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.606473207474

Relative comparisons without Hoard

develop TLS vs NO TLS

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.28)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.93)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.01)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.25)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.24)
('performance.compilation', 1.01)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.27)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.11)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.31)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.0)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.78)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.98)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.61)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.62)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.27)
1.2204335761

new tape NO TLS vs develop NO TLS

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.01)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.9)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 0.9)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.05)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 0.99)
('performance.compilation', 0.99)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.02)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.02)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.04)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 0.99)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.07)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.97)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.06)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.04)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.04)
1.00439422031

new tape TLS vs new tape NO TLS

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.06)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 1.06)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.03)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.05)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.04)
('performance.compilation', 1.03)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.06)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.05)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.09)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.03)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.13)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 1.01)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.13)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.12)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.02)
1.06039034984

So without the Hoard memory allocator things are comparable. This is using a vanilla MacOS Mojave system as defined above.

Tests

  • test/unit/math/rev/core/init_test.cpp

Side Effects

The current design would only initialize the AD tape in the main process. Any sub-thread will have to initialize the AD tape by itself a single time.

Checklist

  • Math issue Thread performance penalty #1102

  • Copyright holder: Sebastian Weber

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@seantalts seantalts self-requested a review January 17, 2019 16:37
#ifndef STAN_THREADS
static inline AutodiffStackStorage *instantiate() {
static thread_local std::mutex init_mutex;
std::lock_guard<std::mutex> init_lock(init_mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need a mutex here? Since this is a thread-local _instance, by definition only one thread can access it at a time, right? :P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess it's not thread-local if we have no threads. But in that case as well seems fine with no mutex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(forgot to submit these comments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that this code does not need a mutex.

@@ -0,0 +1,15 @@
#ifndef STAN_MATH_REV_CORE_CHAINABLESTACK_INST_CPP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this file if we have a stan::math::init() function or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we still need the file in place so that the respective code gets executed to initialize the AD stack of the main thread. The alternative is to require that any client code must call init before doing any AD. I think we want to keep the behavior such that there is no change to client code other than using our updated makefiles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair enough.

make/libraries Outdated

endif

STACK_INSTANTIATION_CPP := $(shell find $(MATH)stan -type f -name 'chainablestack_inst.cpp')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just refer to the specific file here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I just copied what we did for MPI, but we can just refer to the file here.

@seantalts
Copy link
Member

Could you add paranoia tests for a couple of things:

  1. that the child threads do actually get their own instances of the AD stack
  2. that each time map_rect is called, child threads get a new empty instance

@wds15
Copy link
Contributor Author

wds15 commented Jan 19, 2019

Sure, I add some tests.

@wds15
Copy link
Contributor Author

wds15 commented Jan 19, 2019

In the context of map_rect the child thread does not necessarily get an empty AD tape. It is admissible that the child thread did some other work in the mean time such that the AD tape is whatever it is. This is not a problem for map_rect as it is executing a nested AD sweep. That's safe to do as it cleans off the tape whatever it did to it.

@wds15
Copy link
Contributor Author

wds15 commented Jan 19, 2019

It looks like we want to move this forward... how do we manage the upstream repos best? If you agree to the changes as they are I would open issues and PRs for stan and cmdstan, but it's not clear how we merge that.

@wds15 wds15 changed the title WIP faster TLS AD tape Faster TLS AD tape Jan 19, 2019
@wds15
Copy link
Contributor Author

wds15 commented Jan 19, 2019

Things are weird!!! I am getting segfaults with the new TLS with a stan program which uses map_rect. These are solved when I call init in the prim variant of map_rect_reduce. The same error cannot be triggered with the unit tests.

Moreover, the unit tests for map_rect_concurrent should probably be changed so that they do fire off child threads. They do not do that right now unless you set STAN_NUM_THREADS which is why I haven't caught trouble there yet (I forgot to put init calls there as you noted already).

@SteveBronder
Copy link
Collaborator

SteveBronder commented Jan 19, 2019

@wds15 can you run the tests without Hoard? It looks cool but I tried installing it and ran into some issues

emeryberger/Hoard#50

Using optimized software is cool and good but it's probably also a good idea to see the tests on the generic stuff as well since that's what most users will probably be using

EDIT: After messing with a bunch of these different mallocs, if this is a hassle then no worries

@wds15
Copy link
Contributor Author

wds15 commented Jan 20, 2019

Ok, now I understand what's going on. My Stan program I tested used the cvodes integrator. This integrator will always use the AD stack even if we only want the solution of the ODE system without any sensitivities since the Jacobian is needed.

Thus, we need to move the map_rect_reduce double only into the rev tree and initialize the AD tape of the thread upon entry in the function.

I am also going to sharpen the map_rect_concurrent tests and force them to fire off sub threads (no matter what the status of STAN_NUM_THREADS is).

With those changes we are hopefully good to go. It would be very interesting to see if the Windows gcc 4.9.3 can deal with the new TLS approach.

@seantalts
Copy link
Member

Thanks for putting in so much validation effort here - super important for us to feel good about this. Seems like GCC is having issues with the tests still :/

Also, it would be good to perhaps pull out the call that spawns threads into a single place that then initializes each thread appropriately as well? That way, as long as we use that abstraction for doing threaded stuff, we'll have the AD initialized properly (and anything else we may need added in a similar way eventually). In the current system, this probably means placing the call in the execute_chunk lambda in map_rect_concurrent. Any reason that wouldn't work or wouldn't be as good? Just trying to make it as hard as possible for us to forget to initialize. When we switch to a thread pool (with TBB?) then we will just call init once per thread at the creation of the thread pool.

@wds15
Copy link
Contributor Author

wds15 commented Jan 21, 2019

Yack! The problem we have right now is with any map_rect* test living in prim. Since I moved the map_rect_reduce specialization for prim into the rev part so that I can make the call to init. Is that fine for you/us?

The problem with putting the init inside the lambda is that we will have to move the map_rect_concurrent to the rev branch of stan-math. So if that is fine for everyone, we should probably do that and move the init into the functor.

(BTW, based on the design of having the AD tape referred to as a pointer, I think I do have now a version of nesting code up which is much leaner; at least the first version of this new approach passess all tests for rev/core).

@seantalts
Copy link
Member

The problem with putting the init inside the lambda is that we will have to move the map_rect_concurrent to the rev branch of stan-math.

Can we create a function that is something like thread_init() that has a prim header that does nothing and when rev gets included, does the correct AD tape initialization?

I think I do have now a version of nesting code up which is much leaner;

Where's that? I'm still skeptical we want to allow nested threaded computation, unless you've found some paradigm where that's just as performant as user-controlled parallelism?

@seantalts
Copy link
Member

Can we create a function that is something like thread_init() that has a prim header that does nothing and when rev gets included, does the correct AD tape initialization?

The only ways I can think of to do this are kind of ugly. I would be okay with only having map_rect concurrent available in code using rev. I don't think we have almost any users of just the prim version of the library, and I certainly don't think we have any relying on map_rect. @syclik what do you think about moving map_rect_concurrent to rev?

@wds15
Copy link
Contributor Author

wds15 commented Jan 21, 2019

I have a local branch which is based on this PR which does nested AD in a much leaner way which doesn't require a globally acting nesting... I can push it if you want to see it, but I guess we should get this PR in first.

Yeah, I think the cleanest way for the moment being is to include in prim only a header which declares the map_rect_concurrent, but doesn't define it. The actual definition would be placed in rev.

We can revert the move of map_rect_concurrent to rev once we have Intel's TBB threadpool available. Then this can be reverted.

@seantalts
Copy link
Member

Sounds reasonable to me.

@syclik
Copy link
Member

syclik commented Feb 22, 2019 via email

@seantalts
Copy link
Member

I would trust the microbenchmarks and use the end-to-end as a sanity check (rather than the other way around).

This way lies madness! :P Take gander at e.g. this cppconf video on how difficult it is to construct an informative microbenchmark, nevermind one that captures all of the behavior you care about. It's a really good starter on this topic and has a pretty entertaining speaker, luckily: https://www.youtube.com/watch?v=nXaxk27zwlk It should probably be required viewing for all C++ devs, haha.

End-to-end makes sense from a logical perspective, too - of course you will only get information about what you measure, so measuring your entire system end-to-end is much more informative than a specific line of code. CPUs and compilers are extremely complicated these days and generate different code and execute it in a different way in different scenarios. End-to-end for a specific model is the only way to be authoritative and only on that model, but we hope to get a wide enough spread of models to be somewhat characteristic of the general trend, while ignoring stuff in the <3% or so range as noise.

As written clearly in the stat_comp_benchmark’s README.md, it wasn’t designed to be an end-to-end test.

Those are just the source of models + data. We all agreed they were the best thing we had towards an end-to-end test, and the performance-tests-cmdstan repo is clearly meant to serve in that capacity as best as we currently know how (and if we know better now, we should upgrade!). That repo also contains code to run dozens of example models, which do not always fit very well but execute additional math library functionality with slightly different usage patterns. Generally, I have seen the stat_comp_benchmark "known good" models test, on average, about the same as the shotgun tests with many more models.

Optically, raising the bar to such an extreme level on this particular PR appears somewhat capricious at the moment. Would you mind confirming that you aren't using Stan Math with threading at Generable?

At the end of the day, I think we need automated tooling for whatever additional bars you are adding to the code review process here, so if you want to build all of the above and require them for all PRs going forward, I'd be happy to try that out for a while and see how it goes.

@syclik
Copy link
Member

syclik commented Feb 22, 2019

@seantalts, thanks for chiming in.

This way lies madness! :P Take gander at e.g. this cppconf video on how difficult it is to construct an informative microbenchmark, nevermind one that captures all of the behavior you care about. It's a really good starter on this topic and has a pretty entertaining speaker, luckily: https://www.youtube.com/watch?v=nXaxk27zwlk It should probably be required viewing for all C++ devs, haha.

I'll definitely take a look.

But just so we're clear, we're not looking at 2% performance hit. We're looking at a 15-40% hit, so we should be able to trigger that, no?

End-to-end makes sense from a logical perspective, too - of course you will only get information about what you measure, so measuring your entire system end-to-end is much more informative than a specific line of code. CPUs and compilers are extremely complicated these days and generate different code and execute it in a different way in different scenarios. End-to-end for a specific model is the only way to be authoritative and only on that model, but we hope to get a wide enough spread of models to be somewhat characteristic of the general trend, while ignoring stuff in the <3% or so range as noise.

So... one thing that makes this particular implementation trickier is that there's an interaction with a ton of other things, down to linking, compiler flags, use of different keywords, pointers, compilers, etc. If we're going to actually go with changing the design of our autodiff stack, then I'd like to know that we needed to do this.

Those are just the source of models + data. We all agreed they were the best thing we had towards an end-to-end test, and the performance-tests-cmdstan repo is clearly meant to serve in that capacity as best as we currently know how (and if we know better now, we should upgrade!). That repo also contains code to run dozens of example models, which do not always fit very well but execute additional math library functionality with slightly different usage patterns. Generally, I have seen the stat_comp_benchmark "known good" models test, on average, about the same as the shotgun tests with many more models.

Cool. Good enough for me. Want to update the documentation?

At the end of the day, I think we need automated tooling for whatever additional bars you are adding to the code review process here, so if you want to build all of the above and require them for all PRs going forward, I'd be happy to try that out for a while and see how it goes.

I agree that we do need it. It'll be good to have it so we can apply it to all PRs, but really for the ones that touch the autodiff stack.

I don't want to hold up the PR to get all the way automated, if that's the last thing. If we determine that this design is the only sensible way to move forwards, I'm going to request we bump major version numbers for the Math library. It's a fundamental shift in the design of the autodiff stack and we should update the arXiv paper and all our documentation. It'll also give us an opportunity to align how the atuodiff stack is instantiated. (@wds15: if we go this way, I'm thinking we don't have to automatically instantiate the stack, but require clients to actually call an init() function, under all conditions.)

I'm still hoping we don't have to do that, but I'm becoming more and more convinced by @wds15 that this is the right way to proceed.

Would you mind confirming that you aren't using Stan Math with threading at Generable?

We're threading, but at the moment, through map_rect().

@syclik
Copy link
Member

syclik commented Feb 22, 2019

This was all executed while being logged into the machine. So, yes, it is the same node for all tests.

Thanks!

@wds15
Copy link
Contributor Author

wds15 commented Feb 23, 2019

BTW, requiring the performance tests on windows is a bit mood given that map_rect never worked on Windows in a multi-threaded way... things do work though if you only use a single thread.

I just pushed onto this branch a small example which can serve as a microbenchmark. If you compile and run this code on develop in TLS / No TLS then I get on my Mac system a runtime of 2700ms / 1700ms. On this branch with TLS the runtime is 2000ms; so the performance hit is drastically reduced (60% slowdown vs 17%). The example is the modified ode stress test... this time using the RK45 integrator which represents an extreme stress of the AD stack as this creates an enormous amount of function calls where each call is cheap to calculate, but it does trigger for every call a build up and tear down of the nested AD tree.

Does this count as a microbenchmark?

@syclik
Copy link
Member

syclik commented Feb 25, 2019

@wds15: I think we should use a different process to benchmark. Using the coupled ode system actually introduces a different complication: nested autodiff. I don't know how, but there might be difference performace penalties.

We should just be able to write a large function, right?

@seantalts
Copy link
Member

seantalts commented Feb 26, 2019

But just so we're clear, we're not looking at 2% performance hit. We're looking at a 15-40% hit, so we should be able to trigger that, no?

It's not about the size of the effect, it's about making sure you're measuring what you think you're measuring. For example, in the talk he goes through a simple example (I think measuring whether calling reserve on a std::vector is helpful or not) and shows that if you're not careful, the compiler will easily optimize your code away. He looks at the assembly to see why he is getting unintuitive results.

If we're going to actually go with changing the design of our autodiff stack, then I'd like to know that we needed to do this.

I agree that if we were changing the design of the autodiff stack we should look at it, but we're just changing how we're storing the global reference to it. And also agreed that good benchmarks (either with Google Benchmark for relatively localized stuff or end-to-end for holistic changes) should be required every time we do something JUST for alleged performance reasons. I just also want to make that easier for contributors.

Sure, let's not hold up the PR for the benchmarks. I'm reasonably convinced that the performance on Mac and Linux is better and it enables threading for the first time on Windows, so I don't think we need to benchmark in such detail especially when we still maintain that ifdef #STAN_THREADS logic around it.

I'm going to request we bump major version numbers for the Math library. It's a fundamental shift in the design of the autodiff stack and we should update the arXiv paper and all our documentation. It'll also give us an opportunity to align how the atuodiff stack is instantiated. (@wds15: if we go this way, I'm thinking we don't have to automatically instantiate the stack, but require clients to actually call an init() function, under all conditions.)

That's a huge change. We will have to update literally every test, for example. Why do you think we need to require all clients call an init() function under all conditions? Just for a sort of consistency with a use-case we don't support? Let's bring it up at the meeting Thursday - I'd like to hear Bob's input on this.

PS Versioning in the mono-repo world will look very different. We should talk about sometime - for example, we won't be able to tag a Math 3.0.0 with Stan 2.19 on the repo. What is the unit of a Math release now? Just a .tar.gz?

@syclik
Copy link
Member

syclik commented Feb 26, 2019 via email

@syclik
Copy link
Member

syclik commented Feb 26, 2019

@wds15 and @seantalts: I'm becoming more and more unconvinced. (I can still be convinced... but I'll explain why.)

I just put something up using the Google Benchmark framework on @seantalts's perf-math repo on the faster-tls-tape branch: https://github.com/seantalts/perf-math/tree/faster-tls-tape. There are at least a few places I could have messed up: 1) maybe the benchmark is still wrong. 2) maybe I linked the tls stuff incorrectly. 3) maybe XCode clang++ isn't doing the right thing.

To run, I created a shell script. Follow the instructions to set up the benchmark (README.md), then run

sh compare-tls.sh

This is only designed to run on Mac with the clang++ compiler, but it's pretty easy to swap that out. I'm running with XCode clang++, which is also what @wds15 was reporting on. I won't claim that this holds for any other configuration.

What I see:

  • develop without using threading at all: 14020 ns
  • develop with -DSTAN_THREADS: 13875 ns
  • feature/faster-ad-tls with -DSTAN_THREADS: 14186 ns

If these results hold on other platforms / compilers, it's a wash, right?

The thing I'm using to test is the GARCH example. I just made the model using stan::math operations and just loaded the data once. That's one of the examples that had a high ratio in the original example.

Btw, I'm not convinced I did this completely right. I'm kinda tired right now, so I'm going to give it a rest, but could one of you verify that it looks right?

@wds15: in this version, I wasn't able to actually statically link the chainable_stack_inst.o. What am I doing wrong? Every time I try to link it, I get an error:

$ clang++ -std=c++11 -DSTAN_THREADS  -Imath/lib/eigen_3.3.3/ -Ibenchmark/include -std=c++1y -Imath/ -Imath/lib/boost_1.66.0 -O3 -Imath/lib/sundials_4.1.0/include   -Lbenchmark/build/src  math/stan/math/rev/core/chainablestack_inst.o  tls_speed_eval.cpp  -lbenchmark -o tls_speed_eval -DFEATURE_TLS
ld: illegal thread local variable reference to regular symbol __ZN4stan4math22AutodiffStackSingletonINS0_4variENS0_15chainable_allocEE9instance_E for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

[editted for clarity]

@chriselrod
Copy link

For fun, I ran those benchmarks. I tried 4 runs: clang vs gcc, and with or without -march=native.
Additionally, on Linux, I had to add -pthread.

Results:
clang (clang -match=native)

  • develop without using threading at all: 21076 ns (19793 ns)
  • develop with -DSTAN_THREADS: 21156 ns (19653 ns)
  • feature/faster-ad-tls with -DSTAN_THREADS: 21993 ns (23750 ns)

gcc (gcc -match=native)

  • develop without using threading at all: 14414 ns (14346 ns)
  • develop with -DSTAN_THREADS: 14544 ns (14423 ns)
  • feature/faster-ad-tls with -DSTAN_THREADS: 14872 ns (16962 ns)

The CPU was a Ryzen Threadripper 1950x. -march=native triggers a regression on this platform when using faster-ad-tls.
Otherwise, I get the best performance when not using threads at all. Which is a little disappointing on a 16 core/32 thread "Threadripper", but 14 microseconds isn't much time to amortize the overhead.

I'm on recent versions of both compilers:

$ clang++ --version
clang version 7.0.1 (tags/RELEASE_701/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ g++ --version
g++ (GCC) 8.2.1 20181127
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@syclik
Copy link
Member

syclik commented Feb 26, 2019

@chriselrod, thanks for confirming.

@wds15, I also included your "too much work" ode example into the benchmark and I'm really seeing no difference. But... I'm compiling it differently than we do on that branch. Notice, everything is stripped down to the minimal set. If this result is replicated across different compilers and platforms, then it would point to a problem in the way we're building and linking.

@syclik
Copy link
Member

syclik commented Feb 26, 2019

@seantalts, I'll put together a Jenkins job to run on branches. It'll take me a while, though.

@seantalts
Copy link
Member

seantalts commented Feb 26, 2019 via email

@seantalts
Copy link
Member

Also, and this is really important, it's STAN_THREADS and not STAN_THREAD. I put in that and a few other fixes to the branch on perf-math. The performance differences are quite obvious now even on the simplest test; for me it's 18800, 33410, 23314 for that first benchmark w/o the ODE system. I also updated the clobbering and escaping to be more fastidious.

@syclik
Copy link
Member

syclik commented Feb 26, 2019 via email

@seantalts
Copy link
Member

I don't think we take this design change on the basis of always compiling with STAN_THREADS. We'll be taking a 10%+ penalty for running on one core. What we now have to do is measure the relative impact of running threaded operations. Does that make sense?

Yep, same page. I'm sad for the loss there. @wds15 is very creative; maybe he can come up with some way to get it even lower and get rid of -DSTAN_THREADS (did you end up playing with putting a pointer on the var class?).

What if we had end-to-end tests that showed that in real-life the difference between STAN_THREADS on and off was only 5%? Would that be worth it, do you think? These microbenchmarks are really hitting the exact pain point head on, though I'm still inclined to follow them towards keeping STAN_THREADS. Sad.

What we now have to do is measure the relative impact of running threaded operations. Does that make sense?

Does this mean a new microbenchmark with different threads performing gradients and testing with develop vs. the pointer approach? Makes sense to me.

@wds15
Copy link
Contributor Author

wds15 commented Feb 26, 2019

Uff... finally we seem to converge... I am very glad to see this (and I am going to ask for a beer on this one ;) ).

From my perspective we can still consider to go with STAN_THREADS only .... at least at some point. First off, I would not like to jump from where we are (with/without STAN_THREADS) towards always turning it on right away. I would do that over 1-2 releases at least.

However, we can still seriously consider turning on STAN_THREADS for all cases - given the end-to-end benchmarks are not that sensitive to this. So if you checkout my last benchmarks above you see that for almost all models the difference is only very tiny. Essentially only those models which really use the AD stack in an inefficient way will get a slight performance hit. Inefficient use of the AD stack is, for example, non-stiff ODE solving since the computations done are extremely cheap, but we do access the AD stack a huge amount of time in rapid succession. Any model which does some sensible computations between AD stack accesses will not see this additional cost which we introduce as it is relatively speaking tiny.

So for real-world models I would argue that we are good with STAN_THREADS only and this will serve our user-base well and keep our maintenance burden down. It is a trade-off which I think is reasonable.

(I do recall that the performance hit got even smaller once you start using the Intel TBB malloc proxy....but I was never that precise with these measurements).

What we now have to do is measure the relative impact of running threaded operations. Does that make sense?

Does this mean a new microbenchmark with different threads performing gradients and testing with develop vs. the pointer approach? Makes sense to me.

I am not sure if I follow this one. What are threaded operations? ala map_rect with threads?

EDIT: @seantalts I don't have more tricks ATM other than using the TBB. Like I say, I think that overall end-to-end speed is affected in a negligible way from my perspective and from what I have seen so far.

@chriselrod
Copy link

Just to give updated Linux benchmarks, the three mean times...
...with clang (7.0.1):

  • 1, 1.085, 0.9041
  • 1, 1.142, 1.076
  • 1, 1.15, 1.09

...with gcc (8.2.1)

  • 1, 1.295, 1.035
  • 1, 1.513, 1.04
  • 1, 1.48, 1.034

FWIW, gcc/clang baselines: 0.7875, 0.9444, 0.9653

Which benchmarks? I don't think any of them are using map_rect, right?

Sorry, didn't double check that!

@wds15
Copy link
Contributor Author

wds15 commented Feb 26, 2019

@chriselrod Thanks! That now looks like how I would expect this to land.

@wds15
Copy link
Contributor Author

wds15 commented Feb 27, 2019

BTW, there is one figure which I am missing which is the speed of this PR without -DSTAN_THREADS.

@wds15 wds15 mentioned this pull request Mar 2, 2019
11 tasks
@wds15
Copy link
Contributor Author

wds15 commented Mar 2, 2019

I am suggesting to close this PR and move to the #1135 PR which is a cleaned version of this one here.

@syclik
Copy link
Member

syclik commented Mar 2, 2019

Uff... finally we seem to converge... I am very glad to see this (and I am going to ask for a beer on this one ;) ).

@wds15, definitely! We're going to drink some whiskey next time we're together in person.

Thank you for putting together the PR and working through the concerns.

I'll close as suggested. For the other PR, I'll try to ask for the right benchmarks this time. I didn't know exactly what needed to be benchmarked when we started this one. As you can tell by now, this is a very big change to the autodiff stack and should not be taken lightly. Sometimes big changes are warranted.

@syclik syclik closed this Mar 2, 2019
@wds15
Copy link
Contributor Author

wds15 commented Mar 2, 2019

I know... this is a bummer. It's a big price we pay for threading all the time right now - and we need a change right at the "heart of Stan" which is the AD stack handling.

This was referenced Mar 21, 2019
@syclik syclik deleted the feature/faster-ad-tls branch September 15, 2022 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants