Add Korean TN support for cardinal numbers and postprocessing#285
Add Korean TN support for cardinal numbers and postprocessing#285mgrafu merged 6 commits intoNVIDIA:ko_tn_staging_v1from
Conversation
Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
for more information, see https://pre-commit.ci
nemo_text_processing/text_normalization/ko/taggers/tokenize_and_classify.py
Fixed
Show fixed
Hide fixed
| digit_except_one = pynini.difference(NEMO_DIGIT, "1") | ||
| digit_except_zero_one = pynini.difference(digit_except_one, "0") | ||
|
|
||
| graph_digit_alt = digit_except_zero_one @ graph_digit |
There was a problem hiding this comment.
let's call this something like graph_digit_no_zero_one to make it more explicit in its 'alt'-ness
There was a problem hiding this comment.
Renamed graph_digit_alt to graph_digit_no_zero_one for clarity as suggested.
|
|
||
| graph_digit_alt = digit_except_zero_one @ graph_digit | ||
| graph_ty = pynini.string_file(get_abs_path("data/number/ty.tsv")) | ||
| graph_teen = pynini.string_file(get_abs_path("data/number/teen.tsv")) |
There was a problem hiding this comment.
are the rules for creating teens any different from the ones to create ties? does it make sense to have a separate file for them, or would it be possible to just add 1 십 to the ties file?
There was a problem hiding this comment.
You're right! Since both follow the pattern of combining a base digit with “십”, it makes sense to consolidate them. I’ll go ahead and remove the teen.tsv file and just add 1 십 to the ty.tsv file.
| graph_teen = pynini.string_file(get_abs_path("data/number/teen.tsv")) | ||
|
|
||
| # Compose all basic number forms | ||
| graph_all = (graph_ty + (graph_digit | pynutil.delete('0'))) | graph_teen | graph_digit |
There was a problem hiding this comment.
this covers numbers from 1 to 99, right? can we rename the variable for more clarity?
There was a problem hiding this comment.
I’ll rename it to graph_1_to_99 for clarity.
| graph_hundred_thousand = hundred_thousands @ graph_hundred_thousand_component | ||
|
|
||
| millions = NEMO_DIGIT**7 | ||
| graph_million_component = ((NEMO_DIGIT**3 @ graph_hundred_component) + pynutil.insert('만')) + pynini.union( |
There was a problem hiding this comment.
(NEMO_DIGIT**3 @ graph_hundred_component) == graph_hundred, let's try to use the same variable intead of redefining the rule (this applies to other _component graphs too)
There was a problem hiding this comment.
Updated the code to use graph_hundred and graph_thousand instead of (NEMO_DIGIT3 @ graph_hundred_component) and (NEMO_DIGIT4 @ graph_thousand_component).
|
|
||
| # FST | ||
| graph_num = pynini.union( | ||
| graph_thousand_trillions, |
There was a problem hiding this comment.
let's add a test case in test_cases_cardinal.txt for each of these rules
There was a problem hiding this comment.
Sure! I’ll add one test case for each of these rules in test_cases_cardinal.txt
| ).optimize() | ||
|
|
||
| # Sign and final formatting | ||
| optional_sign = pynini.closure(pynutil.insert('negative: "true" ') + pynini.cross("-", ""), 0, 1) |
There was a problem hiding this comment.
let's also add a few test cases in test_cases_cardinal.txt for negative numbers
There was a problem hiding this comment.
I'll add some negative cases as well.
| runtest $input | ||
| } | ||
|
|
||
| #testTNSpecialText() { |
There was a problem hiding this comment.
let's remove the commented out tests for now and add them as we develop each class
There was a problem hiding this comment.
Deleted all the commented out tests
…feedback Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
|
LGTM -- let's just get CI running properly |
Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
…#285) * Add Korean TN support for cardinal numbers and postprocessing Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor Korean TN cardinal and postprocessing logic based on review feedback Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add __init__.py to ko/data directory Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Update KO_TN_CACHE to trigger Korean CI run Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
…#285) * Add Korean TN support for cardinal numbers and postprocessing Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor Korean TN cardinal and postprocessing logic based on review feedback Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add __init__.py to ko/data directory Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Update KO_TN_CACHE to trigger Korean CI run Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
What does this PR do ?
Classify and verbalize grammars
Unit tests with coverage up to 17-digit numbers
Support for spacing around units (억, 만, 조, 경)
Post-processing logic for formatting
Unit tests and Sparrowhawk tests
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.