Skip to content

Commit 9393405

Browse files
committed
20250426_00-rel Pretty complete and usable now.
- Updated docs, - a shortcut for a macOS "QuickAction" - Tested on macOS.
1 parent 69a11c5 commit 9393405

10 files changed

+163
-41
lines changed

README.md

+91-23
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,127 @@
11
# UnicodeFix
22

3-
Normalizes Unicode to ASCII equivalents
3+
Normalizes Unicode to ASCII equivalents.
4+
5+
**I'm getting this out quickly as people need it. Updates will follow to polish this up more soon.**
6+
7+
- [UnicodeFix](#unicodefix)
8+
- [Installation](#installation)
9+
- [Usage](#usage)
10+
- [Shortcut for macOS](#shortcut-for-macos)
11+
- [To add the shortcut:](#to-add-the-shortcut)
12+
- [What's in This Repo:](#whats-in-this-repo)
13+
- [Contributing](#contributing)
14+
- [Support This and Other Projects I Have](#support-this-and-other-projects-i-have)
15+
- [License](#license)
416

517
## Installation
618

7-
Clone the repository somewhere on your system. You will need to pop up a terminal window to do this.
19+
Clone the repository somewhere on your system. You will need to pop open a terminal window to do this.
820

9-
Then copy and paste the following commands into the command window.
21+
Then copy and paste the following commands into the terminal:
1022

1123
```bash
1224
git clone https://github.com/unixwzrd/UnicodeFix.git
1325
cd UnicodeFix
1426
bash setup.sh
1527
```
1628

17-
Setup will create a virtual environment to keep your system python clean.
18-
It will add the items needed to startup the script into your .bashrc
29+
Setup will create a virtual environment to keep your system Python clean.
30+
It will also add the items needed to start the script into your `.bashrc`.
1931

20-
Look at the [setup.sh](setup.sh) file to see what it does if you like, it's very simple.
32+
Look at the [setup.sh](setup.sh) file to see exactly what it does if you like it's very simple.
2133

22-
The .bashrc items are necessary because I will have a shortcut you may use from the macOS context menu to run the script shortly.
34+
The `.bashrc` items are necessary because I have a Shortcut you may use from the macOS context menu to run the script directly.
2335

2436
## Usage
2537

2638
```bash
2739
(python-3.10-PA-dev) [unixwzrd@xanax: UnicodeFix]$ python bin/cleanup-text.py --help
28-
usage: cleanup-text.py [-h] [-o OUTPUT] [infile]
40+
usage: cleanup-text.py [-h] [infile ...]
2941

3042
Clean Unicode quirks from text.
3143

3244
positional arguments:
33-
infile Input file (or use STDIN)
45+
infile Input file(s)
3446

3547
options:
36-
-h, --help show this help message and exit
37-
-o OUTPUT, --output OUTPUT
38-
Output file (default: STDOUT)
48+
-h, --help Show this help message and exit
3949

50+
Example:
4051
python bin/cleanup-text.py <input_file>
4152
```
4253

43-
## What's in this repo:
54+
The output file will be named the same as the input file, but with a `.clean.txt` extension.
55+
56+
You can select multiple files at once.
57+
58+
## Shortcut for macOS
59+
60+
There is a "Shortcut" file in the `macOS/` directory which may be imported into the Shortcuts app.
61+
It will allow the script to be run as a **Quick Action** from the Finder "Right Click" menu.
62+
This allows selecting multiple files and scrubbing the Unicode quirks from them in bulk.
63+
64+
### To add the shortcut:
65+
66+
1. Open the "Shortcuts" app.
67+
68+
![Shortcuts App Menu](macOS/Screenshot%202025-04-25%20at%2005.50.57.png)
69+
70+
2. Go to `File -> Import...`
71+
72+
![Import Shortcut](macOS/Screenshot%202025-04-25%20at%2005.51.54.png)
73+
74+
3. Navigate to the `macOS` directory in this repository and select the `Strip Unicode.shortcut` file.
75+
76+
![Select Shortcut File](macOS/Screenshot%202025-04-25%20at%2005.47.51.png)
77+
78+
4. You will need to open the shortcut and change the location path of the `cleanup-text.py` script.
79+
80+
![Edit Shortcut Script Path](macOS/Screenshot%202025-04-25%20at%2005.07.47.png)
81+
82+
5. You may have to restart Finder (use `Command+Option+Esc`, select Finder, and click "Relaunch").
83+
84+
6. Once setup, right-click on a file or multiple files in Finder, go to `Quick Actions`, and select `Strip Unicode`.
85+
86+
This will invoke the script on the selected files and create `.clean.txt` versions.
87+
88+
Strip all the Unicode quirks out of your text files right in the finder using a Quick Action!
89+
90+
If you know a better way for Linux or Windows users, feel free to submit a PR with your improvements.
91+
92+
## What's in This Repo:
93+
94+
- [bin/cleanup-text.py](bin/cleanup-text.py) — The script that cleans up the text.
95+
- [bin/cleanup-text](bin/cleanup-text) — A symlink without the `.py` extension for prettier usage in scripts.
96+
- [setup.sh](setup.sh) — A script that sets up the virtual environment.
97+
- [LICENSE](LICENSE) — The license for the project.
98+
- [README.md](README.md) — This file.
99+
- [requirements.txt](requirements.txt) — The dependencies needed to run.
100+
- [data/](data/) — Sample files full of Unicode issues for testing.
101+
- [docs/](docs/) — Supporting documentation for the project.
102+
- [macOS/](macOS/) — The Shortcut file for macOS users.
103+
104+
## Contributing
105+
106+
If you have suggestions, enhancements, or fixes, feel free to open an issue or pull request!
107+
Testing and feedback are also very welcome.
108+
109+
## Support This and Other Projects I Have
110+
111+
AI and Unix are my passions — but I need to pay the bills too.
112+
113+
If you find this project useful, please tell others, and consider supporting my work:
114+
115+
- [Patreon](https://www.patreon.com/unixwzrd)
116+
- [Buy me a Ko-Fi](https://ko-fi.com/unixwzrd)
117+
- [Buy me a Coffee](https://www.buymeacoffee.com/unixwzrd)
44118

45-
- [bin/cleanup-text.py](bin/cleanup-text.py) - The script that cleans up the text.
46-
- [setup.sh](setup.sh) - A script that sets up the environment to run the script.
47-
- [LICENSE](LICENSE) - The license for the project.
48-
- [README.md](README.md) - This file.
49-
- [requirements.txt](requirements.txt) - The dependencies for the project.
50-
- [data/](data/) - A directory with sample files full of unicode to test with.
119+
Thank you!
51120

52-
## Coming SOon
53-
- macSO Shortcut
121+
## License
54122

55-
## License
56-
Copyright 2025 [email protected]
123+
Copyright 2025
124+
57125

58126
[MIT License](LICENSE)
59127

bin/cleanup-text

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
cleanup-text.py

bin/cleanup-text.py

+70-17
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,91 @@
1-
#!/usr/bin/env python3
2-
#
1+
#!/usr/bin/env python
2+
3+
"""
4+
Unicode Text Cleaner
5+
6+
This script normalizes problematic Unicode characters to their ASCII equivalents.
7+
It handles common issues like fancy quotes, em/en dashes, and zero-width spaces
8+
that can cause problems in text processing.
9+
10+
The script takes one or more input files and creates cleaned versions with
11+
".clean.txt" appended to the original filename. It skips duplicate files
12+
and handles errors gracefully.
13+
14+
Example:
15+
$ python cleanup-text.py file1.txt file2.txt
16+
[✓] Cleaned: file1.txt → file1.clean.txt
17+
[✓] Cleaned: file2.txt → file2.clean.txt
18+
"""
19+
320
import argparse
21+
import os
422
import re
5-
import sys
623

724
from unidecode import unidecode
825

926

10-
def clean_text(text):
27+
def clean_text(text: str) -> str:
1128
"""
12-
Clean Unicode quirks from text.
29+
Normalize problematic or invisible Unicode characters to safe ASCII equivalents.
30+
31+
This function performs two main operations:
32+
1. Converts typographic characters (quotes, dashes) to their ASCII equivalents
33+
2. Removes zero-width and invisible Unicode characters
34+
35+
Args:
36+
text (str): The input text containing Unicode characters
37+
38+
Returns:
39+
str: The cleaned text with normalized ASCII characters
40+
41+
Example:
42+
>>> clean_text('"Hello" — World')
43+
'"Hello" - World'
1344
"""
1445
replacements = {
15-
'\u2018': "'", '\u2019': "'",
16-
'\u201C': '"', '\u201D': '"',
17-
'\u2013': '-', '\u2014': '-',
46+
'\u2018': "'", '\u2019': "'", # Smart single quotes
47+
'\u201C': '"', '\u201D': '"', # Smart double quotes
48+
'\u2013': '-', '\u2014': '-', # En and em dashes
1849
}
1950
for orig, repl in replacements.items():
2051
text = text.replace(orig, repl)
21-
text = re.sub(r'[\u200B\u200C\u200D\uFEFF]', '', text)
22-
return unidecode(text)
52+
return re.sub(r'[\u200B\u200C\u200D\uFEFF]', '', text)
2353

2454

2555
def main():
56+
"""
57+
Main function that handles command-line interface and file processing.
58+
59+
Parses command line arguments, processes input files, and creates cleaned
60+
output files. Handles duplicate files and errors gracefully with informative
61+
messages.
62+
63+
Returns:
64+
None
65+
"""
2666
parser = argparse.ArgumentParser(description="Clean Unicode quirks from text.")
27-
parser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin,
28-
help='Input file (or use STDIN)')
29-
parser.add_argument('-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
30-
help='Output file (default: STDOUT)')
67+
parser.add_argument("infile", nargs="+", help="Input file(s)")
3168
args = parser.parse_args()
3269

33-
input_text = args.infile.read()
34-
cleaned = clean_text(input_text)
35-
args.output.write(cleaned)
70+
seen = set()
71+
for infile in args.infile:
72+
if infile in seen:
73+
print(f"[!] Skipping duplicate: {infile}")
74+
continue
75+
seen.add(infile)
76+
77+
try:
78+
with open(infile, "r", encoding="utf-8", errors="replace") as f:
79+
raw = f.read()
80+
cleaned = clean_text(raw)
81+
82+
base, _ = os.path.splitext(infile)
83+
outfile = base + ".clean.txt"
84+
with open(outfile, "w", encoding="utf-8") as f:
85+
f.write(cleaned)
86+
print(f"[✓] Cleaned: {infile}{outfile}")
87+
except Exception as e:
88+
print(f"[✗] Failed to process {infile}: {e}")
3689

3790

3891
if __name__ == '__main__':
File renamed without changes.
190 KB
Loading
74 KB
Loading
43.2 KB
Loading
148 KB
Loading

macOS/Strip Unicode.shortcut

21.4 KB
Binary file not shown.

requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
Unidecode==1.4.0
1+
unidecode==1.4.0

0 commit comments

Comments
 (0)