-
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
197 additions
and
168 deletions.
There are no files selected for viewing
365 changes: 197 additions & 168 deletions
365
Notebooks/5_javascript_unicode_pattern_matching.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,170 +1,199 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Unicode Pattern Matching" | ||
] | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "view-in-github", | ||
"colab_type": "text" | ||
}, | ||
"source": [ | ||
"<a href=\"https://colab.research.google.com/github/sualeh/What-a-Character/blob/go/Notebooks/5_javascript_unicode_pattern_matching.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "PHrPaZuXetrR" | ||
}, | ||
"source": [ | ||
"# Unicode Pattern Matching" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "l35sZ-DdetrS" | ||
}, | ||
"source": [ | ||
"## Case Insensitive Matching" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "4bpKPj0yetrT" | ||
}, | ||
"source": [ | ||
"In Greek, the word for dog in lowercase is \"σκύλος\". Notice that the first and last letter are both sigma." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "b0D8XmmUetrT" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"const lower_greek = /σκύλος/iu\n", | ||
"const upper_greek = \"ΣΚΎΛΟΣ\";\n", | ||
"matches = lower_greek.test(upper_greek);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "e4kztbf8etrT" | ||
}, | ||
"source": [ | ||
"When a lower -case character results in more than one uppercase character, there is no match." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "IBI6YatUetrT" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
" const lower_german = /\"straße\"/iu\n", | ||
" const upper_german = \"STRASSE\";\n", | ||
" matches = lower_german.test(upper_german);\n", | ||
"\n", | ||
" console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "N8U7JfYyetrT" | ||
}, | ||
"source": [ | ||
"## Matching Numbers" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "kEyK5CxBetrU" | ||
}, | ||
"source": [ | ||
"A naive match with a range of digits `[0-9]` does not work." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "3hm1wMm7etrU" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"hindiNumber = \"१२३४५६७८९०\";\n", | ||
"\n", | ||
"digit = /[0-9]+/\n", | ||
"matches = digit.test(hindiNumber);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "y9AJrxKFetrU" | ||
}, | ||
"source": [ | ||
"A slightly better regular expression with a `\\d` pattern does not work either." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "dR9XOTsTetrU" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"hindiNumber = \"१२३४५६७८९०\";\n", | ||
"\n", | ||
"standard_digit = /\\d+/\n", | ||
"matches = standard_digit.test(hindiNumber);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "bFFCgcWKetrU" | ||
}, | ||
"source": [ | ||
"The best way to match digits is by matching against the Unicode Decimal Number Category (Nd), using a Unicode Category pattern `\\p{Nd}`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "EHmWCZjyetrU" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"hindiNumber = \"१२३४५६७८९०\";\n", | ||
"\n", | ||
"unicode_digit = /\\p{Nd}+/u\n", | ||
"matches = unicode_digit.test(hindiNumber);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "base", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
}, | ||
"colab": { | ||
"provenance": [], | ||
"include_colab_link": true | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"matches = false;" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Case Insensitive Matching" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In Greek, the word for dog in lowercase is \"σκύλος\". Notice that the first and last letter are both sigma." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"const lower_greek = /σκύλος/iu\n", | ||
"const upper_greek = \"ΣΚΎΛΟΣ\";\n", | ||
"matches = lower_greek.test(upper_greek);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"When a lower -case character results in more than one uppercase character, there is no match." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
" const lower_german = /\"straße\"/iu\n", | ||
" const upper_german = \"STRASSE\";\n", | ||
" matches = lower_german.test(upper_german);\n", | ||
"\n", | ||
" console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Matching Numbers" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"A naive match with a range of digits `[0-9]` does not work." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"hindiNumber = \"१२३४५६७८९०\";\n", | ||
"\n", | ||
"digit = /[0-9]+/\n", | ||
"matches = digit.test(hindiNumber);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"A slightly better regular expression with a `\\d` pattern does not work either." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"hindiNumber = \"१२३४५६७८९०\";\n", | ||
"\n", | ||
"standard_digit = /\\d+/\n", | ||
"matches = standard_digit.test(hindiNumber);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The best way to match digits is by matching against the Unicode Decimal Number Category (Nd), using a Unicode Category pattern `\\p{Nd}`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%script node\n", | ||
"\n", | ||
"hindiNumber = \"१२३४५६७८९०\";\n", | ||
"\n", | ||
"unicode_digit = /\\p{Nd}+/u\n", | ||
"matches = unicode_digit.test(hindiNumber);\n", | ||
"\n", | ||
"console.log(matches);" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "base", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |