[SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData#2398
[SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData#2398florian-jobs wants to merge 42 commits intoapache:mainfrom
Conversation
…extending on APreAgg like ColGroupDDC for easier implementation. Idea: store only compressed version of _data vector and important metadata. If decompression is needed we reconstruct the _data vector using the metadata and the compressed _data vector. Decompression takes place at most once. This is just an idea and theres other ways of implementing.
…ng von Decompress
…and decompress and its used data structures compatible.
…lgorithms and try made them compatible.
…DC test for ColGroupDDCTest. Improved compress/decompress methods in LZW class.
…lemted from its Interface.
…mapping This commit adds an initial implementation of ColGroupDDCLZW, a new column group that stores the mapping vector in LZW-compressed form instead of materializing MapToData explicitly. The design focuses on enabling sequential access directly on the compressed representation, while complex access patterns are intended to fall back to DDC. No cache or lazy decompression mechanism is introduced at this stage.
…press(). Decompress will now return an empty map if the index is zero.
There was a problem hiding this comment.
Thank you for the PR. I left some comments in the code.
In general, please use tabs instead of spaces to make the diff more readable (can be done by importing the codestyle xml). It would be good if we are able to create the column group similar to this:
CompressionSettingsBuilder csb = new CompressionSettingsBuilder().setSamplingRatio(1.0)
.setValidCompressions(EnumSet.of(AColGroup.CompressionType.DDCLZW))
.setTransposeInput("false");
CompressionSettings cs = csb.create();
final CompressedSizeInfoColGroup cgi = new ComEstExact(mbt, cs).getColGroupInfo(colIndexes);
CompressedSizeInfo csi = new CompressedSizeInfo(cgi);
AColGroup cg = ColGroupFactory.compressColGroups(mbt, csi, cs, 1).get(0);
So corresponding features / methods to support this should be implemented.
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Show resolved
Hide resolved
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
All implemented methods must be covered by tests
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Outdated
Show resolved
Hide resolved
|
Please add some more tests to really verify correctness. For example, you should do a full compression and then decompress it again. Then it should be compared to the original data |
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java
Show resolved
Hide resolved
src/test/java/org/apache/sysds/test/component/compress/colgroup/ColGroupDDCTest.java
Show resolved
Hide resolved
…GroupDDCTest back to correct formatting. Added LZWMappingIterator to decompress values on the fly without having to allocate full compression map [WIP]. Added Test class ColGroupDDCLZWTest.
Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>
Update for benchmarksAddressing the feedback
I wasn't able to "generate" data that matched a given entropy (percentage), but I added a helper function to calculate "Shannon-entropy" for the given arrays. It's displayed now in the benchmarks.
I added
I adjusted the sizes to
I also added another RemarksThe main difficulty was judging which benchmarks are useful since most of my entropy values were pretty high to max. Also I also commented out the |
…cheme according to guidelines.
… it into the compression pipeline and serialization framework.
… some documentation for non native ddc methods in ddclzw class.
… by IDE. Removed unneccesary comments from classes DDCLZW and DDCLZWTest. Optimized some tests to use compression framework.
…ngsBuilder erstellen
…al decompression and adjusting the function decompress to become decompressFull
…ssCompression_NoRepetition to assertLosslessCompressionNoRepetition in order to pass codestyle test.
9241629 to
c5cc4c0
Compare
|
Okay, cool progress on the results! However, I'm a bit skeptical about your byte estimates for the sizes. Do you do extra packing based on the number of bits in your implementation? The ideal values for the current DDC implementation are 2, 256, and 65,536 unique values to avoid bit manipulations on lookup (see I'd love to see some results with your idealized input to get a range of what to expect vs. what you get. A recipe for X unique values at length L could be:
I don't know if it's exactly optimal, but it should be pretty good. |
|
At the moment the codes are still stored as int values by the LZW logic, but I’m in the process of changing the storage representation. Instead of storing one code per array element, I’m implementing a bit-packed long wordstream, where codes are packed based on a fixed bit width (derived from the maximum emitted code), with the option to extend this to a growing bit-width policy later if needed. |
|
I have just tested it with the highest "optimal" values for DDC in the "distributed" benchmark, so with datasets like:
There is a big jump at the Nevertheless, whenever I have also noticed that the entropy doesn't really influence the compression rate that much since entropy measures "how distributed" the values are and not "how they're arranged". So |
… improvements to tests
I also changed the |
|
I am still very sure the memory estimates are wrong (but I would be happy to be wrong). Please overwrite this method in your DDCLZW file: Please use the utilities associated with estimating memory size as the other column group does. |
|
Yes, I think our memory estimates are indeed wrong @Baunsgaard. We where using getExactSizeOnDisc instead of estimateInMemorySize for memory estimation. |
…InMemorySize instead of getExactSizeOnDisk for memory estimation.
|
We changed the Observation from the “distributed” benchmark:
Below are the results from |
|
okay, now the memory numbers sound more plausible ! Can you verify on all the tested instances that when we decompress either the DDC or the DDCLZW they reconstruct equivalent matrices? |
…s in benchmark class. Improved testing class.
|
I added tests to ensure the reconstructed matrices in the benchmarking class are equivalent. |
janniklinde
left a comment
There was a problem hiding this comment.
Thanks for the continued contribution @florian-jobs. I have two more minor comments on the code, but overall I think we're good to merge the new column group.
| c[rowBaseOff + j] = values[rowIndex + j]; | ||
| } | ||
| } | ||
| else { | ||
| for(int i = rl, offT = rl + offR; i < ru; i++, offT++) { | ||
| final double[] c = db.values(offT); | ||
| final int off = db.pos(offT) + offC; | ||
| final int dictIdx = it.next(); | ||
| final int rowIndex = dictIdx * nCol; | ||
|
|
||
| for(int j = 0; j < nCol; j++) { | ||
| final int colIdx = _colIndexes.get(j); | ||
| c[off + colIdx] = values[rowIndex + j]; |
There was a problem hiding this comment.
Setting instead of using += could be problematic, at least AColIndex.decrompressVec uses +=
| final double[] aval = sb.values(di); | ||
|
|
||
| for(int j = apos; j < alen; j++) { | ||
| sbr.append(_colIndexes.get(aix[j]), i, aval[apos]); |
There was a problem hiding this comment.
I think you mean aval[j] and not aval[apos], right?
There was a problem hiding this comment.
I implemented decompressToSparseBlockTransposedSparseDictionary directly from ColGroupDDC, so I assumed it's correct. The only difference is that ColGroupDDCLZW uses an iterator to retrieve the data values.
There was a problem hiding this comment.
I see, thanks for the clarification. However, it looks incorrect to me (both for DDC and DDCLZW then). @Baunsgaard am I missing something here?
There was a problem hiding this comment.
I will change it to use j instead of aval @janniklinde, I think the current version is false.
There was a problem hiding this comment.
It can very well be that i had a bug in the code.
You can always write a test, rather than guessing :) ?
There was a problem hiding this comment.
Testing verified that aval[j] should be used instead of aval[apos]. I pushed the fix together with the corresponding test.
…oSparseBlockTransposedSparseDictionary.
Summary
This PR introduces a new column group
ColGroupDDCLZWthat stores the mapping vector in LZW-compressed form.Key design points
MapToDatais not stored explicitly; only the compressed LZW representation is kept._dataLZWwithout full decompression.Current status
Feedback on design and integration is very welcome.