Hindi TN 2.0 - Accuracy Enhancements & New Telephone Class Integration#294
Hindi TN 2.0 - Accuracy Enhancements & New Telephone Class Integration#294ngachchi wants to merge 17 commits intoNVIDIA:staging_hi_tnfrom
Conversation
* Future Implementations for classes - Measure, Money, and Date (NVIDIA#258) * Future Implementations for classes - Measure, Money, and Date Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * Resolved the conflicts with mm_yyyy and date ranges and added the previously removed failing test cases. Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed the unused empty string implementation Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for the tagger files Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reformatted decimal final graph Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * incorporated the suggestion for decimal graph Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Century implementations Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * Working on the yyyy format for the date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * reverted yyyy code Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on future implementations Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * working on improving the date class accuracy Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added year prefix for the date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on the commma cases for date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * minor fixes Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implemented mixed fractions Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * rectified the test case Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on quarterly measurements Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * reformatted the prefixes and suffixes for date tagger class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * replaced text tag with era tag for the date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * Removed the text tag reference from date class verbalizer Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> --------- Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update jenkins cache Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Potential fix for code scanning alert no. 821: Unused local variable Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> --------- Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Namrata Gachchi <ngachchi@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
nemo_text_processing/text_normalization/hi/taggers/telephone.py
Outdated
Show resolved
Hide resolved
nemo_text_processing/text_normalization/hi/verbalizers/fraction.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
for more information, see https://pre-commit.ci
…e telephone class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
for more information, see https://pre-commit.ci
nemo_text_processing/text_normalization/hi/data/telephone/STD_codes.tsv
Outdated
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
for more information, see https://pre-commit.ci
| @@ -0,0 +1,8 @@ | |||
| २ दो | |||
There was a problem hiding this comment.
is this mapping any different than cardinals (lines 1-4)
There was a problem hiding this comment.
These refer to the validation of landline numbers starting with specific digits within India.
There was a problem hiding this comment.
but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs
There was a problem hiding this comment.
I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.
| @@ -0,0 +1,8 @@ | |||
| ६ छह | |||
There was a problem hiding this comment.
is this mapping any different than cardinals (lines 1-4)
There was a problem hiding this comment.
These refer to the validation of mobile numbers starting with specific digits within India.
There was a problem hiding this comment.
but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs
There was a problem hiding this comment.
I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.
| @@ -0,0 +1,20 @@ | |||
| ० शून्य | |||
There was a problem hiding this comment.
is this mapping any different than cardinals?
There was a problem hiding this comment.
For Hindi digits, no, it's actually the same as cardinal single digits. But for English digits, yes, it's just a common resource for telephone class.
There was a problem hiding this comment.
can you please use cardinal for Hindi digits and filter the inputs you need, and only add a file for English digits in that case? let's avoid repetition
There was a problem hiding this comment.
yes sure, I've updated the same
| @@ -0,0 +1,100 @@ | |||
| ० एक | |||
There was a problem hiding this comment.
is this mapping any different than cardinals (lines 1-4)?
There was a problem hiding this comment.
Yes, actually 0.75 is converted to a quarter, so zero is mapped to one in paune_mappings.
There was a problem hiding this comment.
we don't want a data file that is 100 lines -- please reuse cardinal when applicable or reapply with rules elsewhere
nemo_text_processing/text_normalization/hi/taggers/telephone.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
for more information, see https://pre-commit.ci
| def __init__(self): | ||
| super().__init__(name="telephone", kind="classify") | ||
|
|
||
| mobile_number = generate_mobile(["नंबर", "मोबाइल", "फोन", "कॉल"]) |
There was a problem hiding this comment.
can these inputs be part of a tsv file instead of hardcoding them here?
There was a problem hiding this comment.
yes sure, I've removed these inputs and converted them to respective tsv files
tests/nemo_text_processing/hi/data_text_normalization/test_cases_fraction.txt
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
|
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
What does this PR do ?
This PR introduces Hindi Text Normalization 2.0, which features substantial accuracy improvements across multiple classes and the addition of a new Telephone class. It also integrates culturally relevant linguistic constructs to enhance natural language understanding.
Accuracy Improvements by Class:
Key Enhancements:
New Class: Telephone
Linguistic Enrichment:
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.