Skip to content

Commit

Permalink
Convert regular expression examples to Go
Browse files Browse the repository at this point in the history
  • Loading branch information
sualeh committed May 7, 2024
1 parent ec90e53 commit 5aaca82
Showing 1 changed file with 259 additions and 0 deletions.
259 changes: 259 additions & 0 deletions Notebooks/5_go_unicode_pattern_matching.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"<a href=\"https://colab.research.google.com/github/sualeh/What-a-Character/blob/go/Notebooks/5_go_unicode_pattern_matching.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----------\n",
"\n",
"## Google Colab\n",
"\n",
"You can run this notebook in Google Colab. The cell below should be run only once, and then followed by a change of runtime to `Go (gonb)`. Refresh the browser before running any subsequent code. If you are not running the notebook in Google Colab, skip this section."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "polyglot-notebook"
}
},
"outputs": [],
"source": [
"#@title Prepare Google Colab for Go Kernel\n",
"\n",
"# Install Go and goimports.\n",
"!echo -n \"Installing go ...\"\n",
"!mkdir -p cache\n",
"!wget -q -O cache/go.tar.gz 'https://go.dev/dl/go1.22.2.linux-amd64.tar.gz'\n",
"!tar xzf cache/go.tar.gz\n",
"%env GOROOT=/content/go\n",
"!ln -sf \"/content/go/bin/go\" /usr/bin/go\n",
"!echo \" done.\"\n",
"!go version\n",
"\n",
"# Install gonb, goimports, gopls.\n",
"!echo -n \"Installing gonb ...\"\n",
"!go install github.com/janpfeifer/gonb@latest >& /tmp/output || cat /tmp/output\n",
"!echo \" done.\"\n",
"!ln -sf /root/go/bin/gonb /usr/bin/gonb\n",
"\n",
"!echo -n \"Installing goimports ...\"\n",
"!go install golang.org/x/tools/cmd/goimports@latest >& /tmp/output || cat /tmp/output\n",
"!echo \" done.\"\n",
"!ln -sf /root/go/bin/goimports /usr/bin/goimports\n",
"\n",
"!echo -n \"Installing gopls ...\"\n",
"!go install golang.org/x/tools/gopls@latest >& /tmp/output || cat /tmp/output\n",
"!echo \" done.\"\n",
"!ln -sf /root/go/bin/gopls /usr/bin/gopls\n",
"\n",
"# Install gonb kernel configuration.\n",
"!gonb --install --logtostderr\n",
"!echo \"Done!\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----------"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Auhkou2CdVVl"
},
"source": [
"# Unicode Pattern Matching"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ioVqHQLkdVVm"
},
"source": [
"## Case Insensitive Matching"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qoWvPXGJdVVo"
},
"source": [
"In Greek, the word for dog in lowercase is \"σκύλος\". Notice that the first and last letter are both sigma."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Rvfkcl8JdVVo"
},
"outputs": [],
"source": [
"%%\n",
"patternGreek := regexp.MustCompile(\"(?iU)σκύλος\")\n",
"matches := patternGreek.MatchString(\"ΣΚΎΛΟΣ\")\n",
"\n",
"fmt.Println(matches)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cjQ2WOGNdVVp"
},
"source": [
"When a lowercase character results in more than one uppercase character, there is no match."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xNnL9Xf7dVVp"
},
"outputs": [],
"source": [
"%%\n",
"patternGerman := regexp.MustCompile(\"(?iU)straße\"\n",
"matches := patternGerman.MatchString(\"STRASSE\")\n",
"\n",
"fmt.Println(matches)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "detk1R7DdVVq"
},
"source": [
"## Matching Numbers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Er1YlQy3dVVq"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2xJQaHyxdVVr"
},
"source": [
"A naive match with a range of digits `[0-9]` does not work."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CVdo8t6ndVVr"
},
"outputs": [],
"source": [
"%%\n",
"hindiNumber := \"१२३४५६७८९०\"\n",
"digitRegex := regexp.MustCompile(\"[0-9]+\")\n",
"matches := digitRegex.MatchString(hindiNumber)\n",
"\n",
"fmt.Println(matches)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kdtZo30XdVVr"
},
"source": [
"A slightly better regular expression with a `\\d` pattern works."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DsNUXh3jdVVr"
},
"outputs": [],
"source": [
"%%\n",
"hindiNumber := \"१२३४५६७८९०\"\n",
"digitRegex := regexp.MustCompile(\"\\\\d+\")\n",
"matches := digitRegex.MatchString(hindiNumber)\n",
"\n",
"fmt.Println(matches)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ang1w7yGdVVs"
},
"source": [
"The best way to match digits is by matching against the Unicode Decimal Number Category (Nd), using a Unicode Category pattern `\\p{Nd}`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Shx_V-xsdVVs"
},
"outputs": [],
"source": [
"%%\n",
"hindiNumber := \"१२३४५६७८९०\"\n",
"digitRegex := regexp.MustCompile(\"\\\\p{Nd}+\")\n",
"matches := digitRegex.MatchString(hindiNumber)\n",
"\n",
"fmt.Println(matches)"
]
}
],
"metadata": {
"colab": {
"include_colab_link": true,
"provenance": []
},
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

0 comments on commit 5aaca82

Please sign in to comment.