fix: resolve 18 English ITN cased test failures for 100% pass rate#10
fix: resolve 18 English ITN cased test failures for 100% pass rate#10Alex-Wengg merged 1 commit intomainfrom
Conversation
- cardinal: preserve case for Zero/Twelve in cased mode - electronic: rewrite domain/email parsing to preserve original casing, add stop-word detection, single-letter+word joining, remove dead code - telephone: add >11 digit formatting (NNN NNNN [middle] NNNN) - time: support prefix-preserving "X to Y" with am/pm context
| let at_pos = input.find(" at ")?; | ||
| let orig_local = &original[..at_pos]; | ||
| let orig_domain = &original[at_pos + 4..]; |
There was a problem hiding this comment.
🟡 Byte-position mismatch when slicing original with offset from lowercased input in parse_email
The code finds at_pos in input (which is original.to_lowercase()) at src/asr/en/electronic.rs:49, then uses that byte offset to slice original at lines 50-51. For non-ASCII characters whose lowercase form has a different byte length (e.g., Turkish İ (2 bytes) → i̇ (3 bytes), or ẞ (3 bytes) → ß (2 bytes)), the byte position from the lowercased string does not correspond to the correct character boundary in original. This can cause either a panic (slicing at a non-UTF-8 boundary, e.g., input "testẞ at gmail dot com") or incorrect output (slicing at the wrong position, producing garbled local/domain parts).
Regression from safe primary path in old code
The old code had a safe primary path that split directly on original using original.splitn(2, " at "), only falling back to the byte-position approach when " at " wasn't literally lowercase in the original. The new code always uses the byte-position approach, and also newly slices orig_domain this way (line 51), which the old code didn't do.
| let at_pos = input.find(" at ")?; | |
| let orig_local = &original[..at_pos]; | |
| let orig_domain = &original[at_pos + 4..]; | |
| // Find " at " position in original (case-insensitive) | |
| let at_pos = original.to_lowercase().find(" at ")?; | |
| let orig_local = &original[..at_pos]; | |
| let orig_domain = &original[at_pos + 4..]; |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Test plan