Skip to content

fix: resolve 18 English ITN cased test failures for 100% pass rate#10

Merged
Alex-Wengg merged 1 commit intomainfrom
fix/en-itn-100
Mar 13, 2026
Merged

fix: resolve 18 English ITN cased test failures for 100% pass rate#10
Alex-Wengg merged 1 commit intomainfrom
fix/en-itn-100

Conversation

@Alex-Wengg
Copy link
Member

@Alex-Wengg Alex-Wengg commented Mar 13, 2026

Summary

  • cardinal: Preserve original casing for "Zero" and "Twelve" in cased mode (2 fixes)
  • electronic: Rewrite domain/email parsing to preserve original casing throughout, add stop-word detection and single-letter+word joining (11 fixes)
  • telephone: Add >11 digit formatting with space-separated groups (2 fixes)
  • time: Support prefix-preserving "X to Y" patterns with am/pm context, e.g. "set alarm at ten to eleven pm" → "set alarm at 10:50 p.m." (2 fixes)
  • Remove 4 dead functions from electronic.rs

Test plan

  • All 1,029 tests pass (0 failures)
  • cardinal_cased: 60/60 (was 58/60)
  • electronic_cased: 59/59 (was 48/59)
  • telephone + telephone_cased: 23/23 + 18/18 (was 21/23 + 17/18)
  • time + time_cased: 29/29 + 33/33 (was 28/29 + 32/33)
  • No regressions in any other language or module

Open with Devin

- cardinal: preserve case for Zero/Twelve in cased mode
- electronic: rewrite domain/email parsing to preserve original casing,
  add stop-word detection, single-letter+word joining, remove dead code
- telephone: add >11 digit formatting (NNN NNNN [middle] NNNN)
- time: support prefix-preserving "X to Y" with am/pm context
@Alex-Wengg Alex-Wengg merged commit 98d0195 into main Mar 13, 2026
2 checks passed
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +49 to +51
let at_pos = input.find(" at ")?;
let orig_local = &original[..at_pos];
let orig_domain = &original[at_pos + 4..];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Byte-position mismatch when slicing original with offset from lowercased input in parse_email

The code finds at_pos in input (which is original.to_lowercase()) at src/asr/en/electronic.rs:49, then uses that byte offset to slice original at lines 50-51. For non-ASCII characters whose lowercase form has a different byte length (e.g., Turkish İ (2 bytes) → (3 bytes), or (3 bytes) → ß (2 bytes)), the byte position from the lowercased string does not correspond to the correct character boundary in original. This can cause either a panic (slicing at a non-UTF-8 boundary, e.g., input "testẞ at gmail dot com") or incorrect output (slicing at the wrong position, producing garbled local/domain parts).

Regression from safe primary path in old code

The old code had a safe primary path that split directly on original using original.splitn(2, " at "), only falling back to the byte-position approach when " at " wasn't literally lowercase in the original. The new code always uses the byte-position approach, and also newly slices orig_domain this way (line 51), which the old code didn't do.

Suggested change
let at_pos = input.find(" at ")?;
let orig_local = &original[..at_pos];
let orig_domain = &original[at_pos + 4..];
// Find " at " position in original (case-insensitive)
let at_pos = original.to_lowercase().find(" at ")?;
let orig_local = &original[..at_pos];
let orig_domain = &original[at_pos + 4..];
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant