Conversation
…gs/RELEASE_500/final)
| #ifndef STAN_THREADS | ||
| static inline AutodiffStackStorage *instantiate() { | ||
| static thread_local std::mutex init_mutex; | ||
| std::lock_guard<std::mutex> init_lock(init_mutex); |
There was a problem hiding this comment.
Do we actually need a mutex here? Since this is a thread-local _instance, by definition only one thread can access it at a time, right? :P
There was a problem hiding this comment.
Well, I guess it's not thread-local if we have no threads. But in that case as well seems fine with no mutex?
There was a problem hiding this comment.
(forgot to submit these comments)
There was a problem hiding this comment.
Yeah, I agree that this code does not need a mutex.
| @@ -0,0 +1,15 @@ | |||
| #ifndef STAN_MATH_REV_CORE_CHAINABLESTACK_INST_CPP | |||
There was a problem hiding this comment.
Can we get rid of this file if we have a stan::math::init() function or something like that?
There was a problem hiding this comment.
No, we still need the file in place so that the respective code gets executed to initialize the AD stack of the main thread. The alternative is to require that any client code must call init before doing any AD. I think we want to keep the behavior such that there is no change to client code other than using our updated makefiles.
make/libraries
Outdated
|
|
||
| endif | ||
|
|
||
| STACK_INSTANTIATION_CPP := $(shell find $(MATH)stan -type f -name 'chainablestack_inst.cpp') |
There was a problem hiding this comment.
why not just refer to the specific file here?
There was a problem hiding this comment.
OK. I just copied what we did for MPI, but we can just refer to the file here.
|
Could you add paranoia tests for a couple of things:
|
|
Sure, I add some tests. |
…ere, add tests for init function
|
In the context of |
|
It looks like we want to move this forward... how do we manage the upstream repos best? If you agree to the changes as they are I would open issues and PRs for |
|
Things are weird!!! I am getting segfaults with the new TLS with a stan program which uses map_rect. These are solved when I call Moreover, the unit tests for |
|
@wds15 can you run the tests without Hoard? It looks cool but I tried installing it and ran into some issues Using optimized software is cool and good but it's probably also a good idea to see the tests on the generic stuff as well since that's what most users will probably be using EDIT: After messing with a bunch of these different mallocs, if this is a hassle then no worries |
|
Ok, now I understand what's going on. My Stan program I tested used the cvodes integrator. This integrator will always use the AD stack even if we only want the solution of the ODE system without any sensitivities since the Jacobian is needed. Thus, we need to move the I am also going to sharpen the With those changes we are hopefully good to go. It would be very interesting to see if the Windows |
|
Thanks for putting in so much validation effort here - super important for us to feel good about this. Seems like GCC is having issues with the tests still :/ Also, it would be good to perhaps pull out the call that spawns threads into a single place that then initializes each thread appropriately as well? That way, as long as we use that abstraction for doing threaded stuff, we'll have the AD initialized properly (and anything else we may need added in a similar way eventually). In the current system, this probably means placing the call in the |
|
Yack! The problem we have right now is with any The problem with putting the (BTW, based on the design of having the AD tape referred to as a pointer, I think I do have now a version of nesting code up which is much leaner; at least the first version of this new approach passess all tests for |
Can we create a function that is something like thread_init() that has a
Where's that? I'm still skeptical we want to allow nested threaded computation, unless you've found some paradigm where that's just as performant as user-controlled parallelism? |
The only ways I can think of to do this are kind of ugly. I would be okay with only having map_rect concurrent available in code using |
|
I have a local branch which is based on this PR which does nested AD in a much leaner way which doesn't require a globally acting nesting... I can push it if you want to see it, but I guess we should get this PR in first. Yeah, I think the cleanest way for the moment being is to include in We can revert the move of |
|
Sounds reasonable to me. |
|
For things at this level, I really would like to see microbenchmarks. If we
want to make any improvements to the backend, we should have these around.
I would trust the microbenchmarks and use the end-to-end as a sanity check
(rather than the other way around). As written clearly in the
stat_comp_benchmark’s README.md, it wasn’t designed to be an end-to-end
test.
There’s always a chance of something else that we’re missing at that level.
I still remember the days when disk i/o could tank any reasonable
performance numbers.
…On Fri, Feb 22, 2019 at 1:35 PM seantalts ***@***.***> wrote:
Wait, if we're doing full end-to-end comparison benchmarks on
stat_comp_benchmark, we surely don't need to do microbenchmarks, right?
Would you be satisfied with just microbenchmarks? Or should we drop them
and just build a Jenkins job that runs end-to-end stat_comp_benchmark tests
on our 3 OS+compiler pairs for a given math commit (can build off of
http://github.com/stan-dev/performance-tests-cmdstan
<https://github.com/stan-dev/performance-tests-cmdstan> but not
trivially, at least not until monorepo happens).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1103 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F7KMKwcsOnIuNsY0SIDJxDiTMX8Vks5vQDhqgaJpZM4aB0zR>
.
|
This way lies madness! :P Take gander at e.g. this cppconf video on how difficult it is to construct an informative microbenchmark, nevermind one that captures all of the behavior you care about. It's a really good starter on this topic and has a pretty entertaining speaker, luckily: https://www.youtube.com/watch?v=nXaxk27zwlk It should probably be required viewing for all C++ devs, haha. End-to-end makes sense from a logical perspective, too - of course you will only get information about what you measure, so measuring your entire system end-to-end is much more informative than a specific line of code. CPUs and compilers are extremely complicated these days and generate different code and execute it in a different way in different scenarios. End-to-end for a specific model is the only way to be authoritative and only on that model, but we hope to get a wide enough spread of models to be somewhat characteristic of the general trend, while ignoring stuff in the <3% or so range as noise.
Those are just the source of models + data. We all agreed they were the best thing we had towards an end-to-end test, and the Optically, raising the bar to such an extreme level on this particular PR appears somewhat capricious at the moment. Would you mind confirming that you aren't using Stan Math with threading at Generable? At the end of the day, I think we need automated tooling for whatever additional bars you are adding to the code review process here, so if you want to build all of the above and require them for all PRs going forward, I'd be happy to try that out for a while and see how it goes. |
|
@seantalts, thanks for chiming in.
I'll definitely take a look. But just so we're clear, we're not looking at 2% performance hit. We're looking at a 15-40% hit, so we should be able to trigger that, no?
So... one thing that makes this particular implementation trickier is that there's an interaction with a ton of other things, down to linking, compiler flags, use of different keywords, pointers, compilers, etc. If we're going to actually go with changing the design of our autodiff stack, then I'd like to know that we needed to do this.
Cool. Good enough for me. Want to update the documentation?
I agree that we do need it. It'll be good to have it so we can apply it to all PRs, but really for the ones that touch the autodiff stack. I don't want to hold up the PR to get all the way automated, if that's the last thing. If we determine that this design is the only sensible way to move forwards, I'm going to request we bump major version numbers for the Math library. It's a fundamental shift in the design of the autodiff stack and we should update the arXiv paper and all our documentation. It'll also give us an opportunity to align how the atuodiff stack is instantiated. (@wds15: if we go this way, I'm thinking we don't have to automatically instantiate the stack, but require clients to actually call an I'm still hoping we don't have to do that, but I'm becoming more and more convinced by @wds15 that this is the right way to proceed.
We're threading, but at the moment, through |
Thanks! |
|
BTW, requiring the performance tests on windows is a bit mood given that map_rect never worked on Windows in a multi-threaded way... things do work though if you only use a single thread. I just pushed onto this branch a small example which can serve as a microbenchmark. If you compile and run this code on develop in TLS / No TLS then I get on my Mac system a runtime of 2700ms / 1700ms. On this branch with TLS the runtime is 2000ms; so the performance hit is drastically reduced (60% slowdown vs 17%). The example is the modified ode stress test... this time using the RK45 integrator which represents an extreme stress of the AD stack as this creates an enormous amount of function calls where each call is cheap to calculate, but it does trigger for every call a build up and tear down of the nested AD tree. Does this count as a microbenchmark? |
|
@wds15: I think we should use a different process to benchmark. Using the coupled ode system actually introduces a different complication: nested autodiff. I don't know how, but there might be difference performace penalties. We should just be able to write a large function, right? |
It's not about the size of the effect, it's about making sure you're measuring what you think you're measuring. For example, in the talk he goes through a simple example (I think measuring whether calling
I agree that if we were changing the design of the autodiff stack we should look at it, but we're just changing how we're storing the global reference to it. And also agreed that good benchmarks (either with Google Benchmark for relatively localized stuff or end-to-end for holistic changes) should be required every time we do something JUST for alleged performance reasons. I just also want to make that easier for contributors. Sure, let's not hold up the PR for the benchmarks. I'm reasonably convinced that the performance on Mac and Linux is better and it enables threading for the first time on Windows, so I don't think we need to benchmark in such detail especially when we still maintain that
That's a huge change. We will have to update literally every test, for example. Why do you think we need to require all clients call an PS Versioning in the mono-repo world will look very different. We should talk about sometime - for example, we won't be able to tag a Math 3.0.0 with Stan 2.19 on the repo. What is the unit of a Math release now? Just a .tar.gz? |
|
On Mon, Feb 25, 2019 at 8:47 PM seantalts ***@***.***> wrote:
But just so we're clear, we're not looking at 2% performance hit. We're
looking at a 15-40% hit, so we should be able to trigger that, no?
It's not about the size of the effect, it's about making sure you're
measuring what you think you're measuring. For example, in the talk he goes
through a simple example (I think measuring whether calling reserve on a
std::vector is helpful or not) and shows that if you're not careful, the
compiler will easily optimize your code away. He looks at the assembly to
see why he is getting unintuitive results.
Yes, I'm very aware. I watched that video and something that really stood
out was one of the questions at the end. Someone asked if you should use an
end-to-end test or a microbenchmark or both. The answer was along the lines
that I've been advocating for: do both. But make sure you can create a
benchmark that shows the behavior.
I still don't quite trust that the problem is actually where we think it
is. I'm still trying to be convinced that it is there. I'd rather be shown
that it really is in the TLS than not.
If we're going to actually go with changing the design of our autodiff
stack, then I'd like to know that we needed to do this.
I agree that if we were changing the design of the autodiff stack we
should look at it, but we're just changing how we're storing the global
reference to it. And also agreed that good benchmarks (either with Google
Benchmark for relatively localized stuff or end-to-end for holistic
changes) should be required every time we do something JUST for alleged
performance reasons. I just also want to make that easier for contributors.
Yup.
Just so we're clear, almost everything we've done in the Math library has
not been on the basis of performance. Yes, we are fast and care about
speed, but almost all of it was building new features. When we claim we do
something for speed, that's the point where we should make sure we can back
up that claim. (Just as when someone introduces a new feature, they should
make sure it works.) It's at this point, we care about timing. I've always
been about build first, optimize second.
Anyway, we're here trying to optimize. We need to be able to measure. One
of the real difficulties is that C++ code really doesn't behave the same
for performance under different compilers, operating systems, compiler
flags, linker flags, cache sizes, disk speed, and a number of other
reasons. When we want to do something like this, we need to make sure it's
faster for the bulk of the users. It's great if it's faster on all three
OSes we support, but we have decisions to make if it makes things faster in
one and not the other two. Or if it makes it slower.
Sure, let's not hold up the PR for the benchmarks though. I'm reasonably
convinced that the performance on Mac and Linux is better and it enables
threading for the first time on Windows, so I don't think we need to
benchmark in such detail especially when we still maintain that ifdef
#STAN_THREADS logic around it.
I'm not convinced yet and I'd like to be there. I don't want to hold this
up for us to make it automatic, but I'd like for us to isolate it a bit
just to confirm. That make sense? I'm going to put some time in now, so
hopefully I can hunt it down in a few hours.
I'm going to request we bump major version numbers for the Math library.
It's a fundamental shift in the design of the autodiff stack and we should
update the arXiv paper and all our documentation. It'll also give us an
opportunity to align how the atuodiff stack is instantiated. ***@***.***
<https://github.com/wds15>: if we go this way, I'm thinking we don't have
to automatically instantiate the stack, but require clients to actually
call an init() function, under all conditions.)
That's a huge change. We will have to update literally every test, for
example. Why do you think we need to require all clients call an init()
function under all conditions? Just for a sort of consistency with a
use-case we don't support? Let's bring it up at the meeting Thursday - I'd
like to hear Bob's input on this.
Do you also agree with the converse of your statement, it's a huge change
to call an init() function under one condition when the others you don't
have to? Before thinking too far about this, let's make sure we have to to
use this uninitialized pointer.
As I've mentioned before, I haven't been convinced that this is the problem
and that's due to reading up on performance of thread_local variables and
my own efforts in trying to confirm it.
Also, this does affect almost everyone. I though the proposed plan once
this got in was to only build with threading enabled. So... this means if
we don't initialize, then it'll seg fault. Did I misread that? (Meaning, we
should be initializing for all the tests anyway.)
Mind if we just had an online discussion or perhaps a separate meeting with
the stakeholders that should be in the call? I don't think the recurring
Thursday meetings are a good way to discuss the depths of the Math library.
Feel free to discuss with Bob and have him weigh in here or on discourse.
PS Versioning in the mono-repo world will look very different. We should
talk about sometime - for example, we won't be able to tag a Math 3.0.0
with Stan 2.19 on the repo. What is the unit of a Math release now? Just a
.tar.gz?
Oh yeah... thanks for bringing that up. I was just thinking it wouldn't be
a big deal to bump the Math version to v3 to support this change. (It'd
actually give us a chance to clear out a lot of deprecated code too.)
The Math release is just a git hash of the stan-dev/math repo. It doesn't
even packaged... we let GitHub do that work for us.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1103 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_FzW0DFYxEvJAUv5HJrytbjtMs2Lkks5vRJImgaJpZM4aB0zR>
.
|
|
@wds15 and @seantalts: I'm becoming more and more unconvinced. (I can still be convinced... but I'll explain why.) I just put something up using the Google Benchmark framework on @seantalts's perf-math repo on the faster-tls-tape branch: https://github.com/seantalts/perf-math/tree/faster-tls-tape. There are at least a few places I could have messed up: 1) maybe the benchmark is still wrong. 2) maybe I linked the tls stuff incorrectly. 3) maybe XCode clang++ isn't doing the right thing. To run, I created a shell script. Follow the instructions to set up the benchmark (README.md), then run This is only designed to run on Mac with the clang++ compiler, but it's pretty easy to swap that out. I'm running with XCode clang++, which is also what @wds15 was reporting on. I won't claim that this holds for any other configuration. What I see:
If these results hold on other platforms / compilers, it's a wash, right? The thing I'm using to test is the GARCH example. I just made the model using Btw, I'm not convinced I did this completely right. I'm kinda tired right now, so I'm going to give it a rest, but could one of you verify that it looks right? @wds15: in this version, I wasn't able to actually statically link the chainable_stack_inst.o. What am I doing wrong? Every time I try to link it, I get an error: [editted for clarity] |
|
For fun, I ran those benchmarks. I tried 4 runs: clang vs gcc, and with or without Results:
gcc (gcc -match=native)
The CPU was a Ryzen Threadripper 1950x. I'm on recent versions of both compilers: |
|
@chriselrod, thanks for confirming. @wds15, I also included your "too much work" ode example into the benchmark and I'm really seeing no difference. But... I'm compiling it differently than we do on that branch. Notice, everything is stripped down to the minimal set. If this result is replicated across different compilers and platforms, then it would point to a problem in the way we're building and linking. |
|
@seantalts, I'll put together a Jenkins job to run on branches. It'll take me a while, though. |
|
Which benchmarks? I don't think any of them are using map_rect, right? So
we wouldn't expect a speedup from threading being enabled, just to address
your disappointment, haha. We're just looking at the overhead of
thread-local storage in this PR.
On Tue, Feb 26, 2019 at 01:20 chriselrod ***@***.***> wrote:
For fun, I ran those benchmarks. I tried 4 runs: clang vs gcc, and with or
without -march=native.
Additionally, on Linux, I had to add -pthread.
Results:
clang (clang -match=native)
- develop without using threading at all: 21076 ns (19793 ns)
- develop with -DSTAN_THREADS: 21156 ns (19653 ns)
- feature/faster-ad-tls with -DSTAN_THREADS: 21993 ns (23750 ns)
gcc (gcc -match=native)
- develop without using threading at all: 14414 ns (14346 ns)
- develop with -DSTAN_THREADS: 14544 ns (14423 ns)
- feature/faster-ad-tls with -DSTAN_THREADS: 14872 ns (16962 ns)
The CPU was a Ryzen Threadripper 1950x. -march=native triggers a
regression on this platform when using faster-ad-tls.
Otherwise, I get the best performance when not using threads at all. Which
is a little disappointing on a 16 core/32 thread "Threadripper", but 14
microseconds isn't much time to amortize the overhead.
I'm on recent versions of both compilers:
$ clang++ --version
clang version 7.0.1 (tags/RELEASE_701/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ g++ --version
g++ (GCC) 8.2.1 20181127
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1103 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAxJ7BEQmPHvDWkNjisJ-YMLBVLohAFAks5vRNIygaJpZM4aB0zR>
.
--
📲
|
|
Also, and this is really important, it's |
|
Thanks for fixing my mistake that completely invalidated everything I had
posted before! And for adding clobbering and escaping.
I'm seeing similar, but we should verify on Linux and the compiler we said
we'd support too.
On my machine, the relative differences are:
benchmark_autodiff_stack: 1.000000 1.912845 1.234384
becnhmark_autodiff_stack_coupled_mm: 1.000000 1.435524 1.119483
becnhamrk_autodiff_stack_coupled_mm_nested: 1.000000 1.453436 1.112097
Honestly, if these results hold, I don't think we take this design change
**on the basis of always compiling with STAN_THREADS**. We'll be taking a
10%+ penalty for running on one core. What we now have to do is measure the
relative impact of running threaded operations. Does that make sense?
If these results hold under threading, we should really consider it, but
let's make sure it does first.
…On Tue, Feb 26, 2019 at 11:58 AM seantalts ***@***.***> wrote:
Also, and this is really important, it's STAN_THREADS and not STAN_THREAD.
I put in that and a few other fixes to the branch on perf-math. The
performance differences are quite obvious now even on the simplest test;
for me it's 18800, 33410, 23314 for that first benchmark w/o the ODE
system. I also updated the clobbering and escaping to be more fastidious.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1103 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F0ShKhTNFmbs6OnSc3IJlw92HXpwks5vRWfDgaJpZM4aB0zR>
.
|
Yep, same page. I'm sad for the loss there. @wds15 is very creative; maybe he can come up with some way to get it even lower and get rid of What if we had end-to-end tests that showed that in real-life the difference between
Does this mean a new microbenchmark with different threads performing gradients and testing with develop vs. the pointer approach? Makes sense to me. |
|
Uff... finally we seem to converge... I am very glad to see this (and I am going to ask for a beer on this one ;) ). From my perspective we can still consider to go with However, we can still seriously consider turning on So for real-world models I would argue that we are good with (I do recall that the performance hit got even smaller once you start using the Intel TBB malloc proxy....but I was never that precise with these measurements).
I am not sure if I follow this one. What are threaded operations? ala EDIT: @seantalts I don't have more tricks ATM other than using the TBB. Like I say, I think that overall end-to-end speed is affected in a negligible way from my perspective and from what I have seen so far. |
|
Just to give updated Linux benchmarks, the three mean times...
...with gcc (8.2.1)
FWIW, gcc/clang baselines: 0.7875, 0.9444, 0.9653
Sorry, didn't double check that! |
|
@chriselrod Thanks! That now looks like how I would expect this to land. |
|
BTW, there is one figure which I am missing which is the speed of this PR without |
|
I am suggesting to close this PR and move to the #1135 PR which is a cleaned version of this one here. |
@wds15, definitely! We're going to drink some whiskey next time we're together in person. Thank you for putting together the PR and working through the concerns. I'll close as suggested. For the other PR, I'll try to ask for the right benchmarks this time. I didn't know exactly what needed to be benchmarked when we started this one. As you can tell by now, this is a very big change to the autodiff stack and should not be taken lightly. Sometimes big changes are warranted. |
|
I know... this is a bummer. It's a big price we pay for threading all the time right now - and we need a change right at the "heart of Stan" which is the AD stack handling. |
This PR speeds up the access to the AD tape whenever
thread_localstorage (TLS) is needed for the AD tape.Summary
This PR has the main goal to reduce the performance penalty implied by turning on thread capability of stan-math. As it turns out, the proposed implementation works out of the box on Windows with the gcc 4.9.3 compiler which is not the case with the current threading implementation in develop. So we fix a bug which affects many users given the popularity of Windows.
The performance benefits of this PR are documented below and these benchmarks demonstrate that currently we pay about 20% performance by turning on threading while this PR reduces this in many instances to nothing (at most 5% performance degradation).
The key change of this PR is to change how the global AD tape instance is declared and accessed. Currently in develop the global AD tape is declared
This PR changes this to a global pointer to the AD tape, which is a TLS for threading, and which is initialized to the constant
nullptrexpression. Initialization to a constant expression (as opposed to a dynamic expression right now in develop) allows the compiler to use the TLS global pointer more efficiently. Furthermore, thethread_localkeyword used to declare TLS can be swapped for the compiler-specific__threadkeyword whenever the expression is constant. The__threadkeyword has been around much longer than the C++11thread_localkeyword.Side-effects
The main side-effect is that the AD tape must be initialized before it can be used... otherwise the program will segfault!
The initialization is done with the new method
init(). I am proposing to link a call to theinit()function by default with any program which uses the stan-math makefiles. This will imply that there is no change to any existing program likecmdstan.However, any program not using the stan-math makefiles must initialize the AD tape prior to performing autodiff. Moreover, any program which intends to use AD in a child-thread must call within the child thread the
init()method once before any AD operations take place.The stan-math library itself creates threads in the
map_rect_concurrentfunction. In order to make this function safe to use I am suggesting to move the definition of the function into therevbranch. Doing so allows to call theinitmethod inside theexecute_chunkfunctor which is executed potentially in a new child-thread. The downside of this is that there will not be aprimdefinition ofmap_rect_concurrentright now - which is not ideal, but not meant to stay like this. I am proposing to include a thread-pool library like the Intel TBB which can be setup to manage AD tape initialization independently. Thus once we have a threadpool we should revert the change tomap_rect_concurrent(take out theinitcall and place the function definition intoprimagain where it belongs).Tests
test/unit/math/rev/core/init_test.cppis a new test and is written to test matters with and without threading. The test is setup to test different things depending on enabling threading or not.STAN_THREADS NOT defined (no threading). In this case we only have a global AD tape. All child-threads created by a given program will see the same AD tape. Thus initializing the AD tape on the main thread implies that it is initialized on the child-thread, etc.
STAN_THREADS IS defined (threading case). Now all new threads created must see their own TLS of the AD tape pointer which is not initialized at the beginning and is independent of the main thread AD tape when performing AD operations.
In addition I have added to the
map_rect_concurrenttests code which enforces that these tests are always run with 4 threads regardless of theSTAN_NUM_THREADSdefinition (which is often not set and this things do not run threaded during testing).Summary of Performance Impact
Here are timings for the performance cmdstan benchmark suite with 5 replications on the
stat_comp_benchmarkset of examples as run on a MacOS Mojave,clang++ Apple LLVM version 10.0.0 (clang-1000.11.45.5), 2.9 GHz Intel Core i9 system. All runs are with the Hoard memory allocator loaded.Current develop no TLS:
develop, TLS:
This PR with new AD tape and no TLS:
New AD tape of this PR with TLS:
Relative Comparisons
develop TLS vs No TLS (these look weird to me, TLS is way too slow; will recheck results from this run of the 16th January 2019; UPDATE: now these look ok to; Hoard was not used for the TLS case; now Hoard is used in all settings shown):
new AD tape NO TLS vs develop NO TLS:
New AD tape TLS vs new AD tape NO TLS:
In summary this suggests:
Benchmarks without Hoard
Develop, NO TLS
Develop, TLS
stat_comp_benchmarks/benchmarks/garch/garch.stan,0.389703369141
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan,7.10135984421
stat_comp_benchmarks/benchmarks/sir/sir.stan,96.8512422085
stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan,0.0118281364441
stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan,2.71869359016
stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan,0.0778662204742
stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan,0.166544389725
stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan,0.0303031921387
stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan,0.271680641174
stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan,16.4858063698
stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan,5.91060729027
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.52340941429
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan,2.31667103767
stat_comp_benchmarks/benchmarks/arma/arma.stan,0.540799951553```
New tape, TLS
Relative comparisons without Hoard
develop TLS vs NO TLS
new tape NO TLS vs develop NO TLS
new tape TLS vs new tape NO TLS
So without the Hoard memory allocator things are comparable. This is using a vanilla MacOS Mojave system as defined above.
Tests
test/unit/math/rev/core/init_test.cppSide Effects
The current design would only initialize the AD tape in the main process. Any sub-thread will have to initialize the AD tape by itself a single time.
Checklist
Math issue Thread performance penalty #1102
Copyright holder: Sebastian Weber
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit)make test-headers)make doxygen)make cpplint)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested