use global offset for buffered chunks#8
Conversation
For split matchers that occur rarely in a stream with many chunks, resetting the search offset inside the handler makes for very bad performance (e.g. for a matcher that doesn't occur at all, the runtime is O(N²) where N is the total number of chunks in the stream). Moving the offset outside of the stream-chunk-handler restores linear runtime.
index.js
Outdated
| @@ -37,6 +37,7 @@ function BinarySplit (matcher) { | |||
| } else { | |||
| if (offset >= buf.length) { | |||
There was a problem hiding this comment.
Honestly, I don't see how this clause can ever be evaluated true (it doesn't in the tests and any scenario I can think of). So, the offset = 0 two lines below is kind of a guess…
There was a problem hiding this comment.
Hmm, I'm also not sure whether this block is necessary there. @maxogden is it safe to remove or does it have a special meaning that's not obvious?
There was a problem hiding this comment.
Can't remember :P Probably me just being overly defensive
|
This looks great to me. Unfortunately we don't have any real stress tests in the test suite to find potential regressions in changes like this, but if the change works fine with something like a big Tile-Reduce job, let's merge. |
(this one fails before max-mapper#8)
|
OK, my initial approach turned out not to be good enough, so I've replaced it with something more robust including some test. |
index.js
Outdated
| } | ||
| } else if (idx === 0) { | ||
| buf = bops.subarray(buf, offset + matcher.length) | ||
| if (typeof idx === 'number' && idx < buf.length) { |
There was a problem hiding this comment.
idx can be either false or number, so typeof is not necessary here — just idx !== false. Although for consistency with JS indexOf, I'd return -1 instead of false.
(this one fails before max-mapper#8)
|
Great, looks good now! Maybe let's squash into one commit? |
|
Rebased and merged. |
|
Nope, had to revert — I ran a local benchmark on a big text file with short lines and it's 60% slower. We need to investigate this. |
|
My benchmark script, using Node 5.8.0: var fs = require('fs');
var split = require('./');
console.time('split');
fs.createReadStream('../mbtiles/latest.planet-z12.txt')
.pipe(split())
.on('data', function () {})
.on('end', function () {
console.timeEnd('split');
});The text file is 28M list of line-delimited tile numbers (e.g. |
use global offset for buffered chunks
|
Performance is great now. |
|
Yeah. I didn't think that adding one additional |
|
@tyrasd it's O(1) but happens a lot with short lines, in a hot loop. |
For split matchers that occur rarely in a stream with many chunks, resetting the search offset inside the handler makes for very bad performance (e.g. for a matcher that doesn't occur at all, the runtime is
O(N²)where N is the total number of chunks in the stream). Moving the offset outside of the stream-chunk-handler restores linear runtime.