packed seq lengths token count correction #348

djsaunde · 2025-11-07T16:35:04Z

Should be merged alongside #3525.

gemini-code-assist · 2025-11-07T16:35:15Z

Summary of Changes

Hello @djsaunde, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the token counting mechanism in the unsloth_zoo library, specifically addressing an overcounting issue that occurred when processing packed sequence lengths. The change ensures that the number of tokens is accurately reflected, which is crucial for correct batch processing. Additionally, the .gitignore file has been updated to exclude common Python build artifacts, contributing to a cleaner repository. This PR is intended to be merged alongside a related pull request, #3525.

Highlights

Token Count Correction: Implemented a correction to the token count calculation within the _unsloth_get_batch_samples function in unsloth_zoo/loss_utils.py to accurately handle packed_seq_lengths.
Gitignore Update: Added __pycache__/ and unsloth_zoo.egg-info/ to the .gitignore file to prevent tracking of generated build and cache files.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a correction for token counting when using packed sequences, which is an important fix for correct loss calculation. The overall logic is sound, but I've identified a potential edge case in the implementation where the token count could be incorrectly modified. I've provided a suggestion to make the calculation more robust against this edge case.

unsloth_zoo/loss_utils.py

djsaunde · 2025-11-21T15:15:45Z

unsloth_zoo/dataset_utils.py

+        and not isinstance(trainer.data_collator, DataCollatorForSeq2Seq)
+        and not packing_enabled
+    ):
        trainer.data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer)


oh @danielhanchen I wanted to ask why we do this, and if there's anything that might be breaking about my change? FWIW I haven't encountered any issues with this change when running all the notebooks in the unsloth main repo README.

It's because we must use DataCollatorForSeq2Seq for training on completions only since DataCollatorForSeq2Seq allows us to have input_ids and labels separately. The other DataCollatorForLanguageModel doesn't allow labels.

hmm I think we should actually use DataCollatorForLanguageModel and pass completion_only_loss = True there. I don't think we should use a custom (non-default) collator here. that breaks packing

danielhanchen · 2025-11-27T07:45:04Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-27T07:49:29Z

unsloth_zoo/loss_utils.py

+                count = token_count.sum()
+                seq_lengths = x.get("packed_seq_lengths")
+                if seq_lengths is not None:
+                    count -= torch.count_nonzero(seq_lengths > 0).item() - 1


Avoid undercounting packed tokens in num_items_in_batch

When packed_seq_lengths is present (packing enabled), the token count is reduced by torch.count_nonzero(seq_lengths > 0).item() - 1, but labels[..., 1:] has already removed one token per sample. For batch sizes larger than 1 this double‑subtracts batch_size-1 tokens (total sequences minus only one), so num_items_in_batch becomes too small and the loss gets normalized over an artificially low denominator whenever packing is used with multiple examples per batch.

Useful? React with 👍 / 👎.

unsloth_zoo/loss_utils.py

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

unsloth_zoo/loss_utils.py Show resolved Hide resolved

djsaunde requested a review from danielhanchen November 7, 2025 18:13

djsaunde mentioned this pull request Nov 7, 2025

Uncontaminated Sample Packing unslothai/unsloth#3525

Closed

1 task

djsaunde force-pushed the packed-token-count branch from 0c1e04f to dd7fe22 Compare November 7, 2025 23:49

djsaunde added 2 commits November 19, 2025 12:40

packed seq lengths token count correction

42a0a25

update data collator swap logic (?)

1e5b0b7

djsaunde force-pushed the packed-token-count branch from dd7fe22 to 1e5b0b7 Compare November 19, 2025 17:40

djsaunde commented Nov 21, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 27, 2025

View reviewed changes

danielhanchen requested changes Nov 27, 2025

View reviewed changes

unsloth_zoo/loss_utils.py Show resolved Hide resolved

Update unsloth_zoo/loss_utils.py

800f450

danielhanchen merged commit 98b20ef into main Nov 27, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

packed seq lengths token count correction #348

packed seq lengths token count correction #348

Uh oh!

djsaunde commented Nov 7, 2025

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

djsaunde Nov 21, 2025

Uh oh!

danielhanchen Nov 27, 2025

Uh oh!

djsaunde Nov 28, 2025

Uh oh!

danielhanchen commented Nov 27, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 27, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

packed seq lengths token count correction #348

packed seq lengths token count correction #348

Uh oh!

Conversation

djsaunde commented Nov 7, 2025

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

djsaunde Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

djsaunde Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Nov 27, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants