Grammarly/Copilot for Handwritten Chinese
Chinese characters are notoriously difficult to master. Unlike English, where words can be "sounded out" phonetically, Chinese characters often lack obvious patterns between one another. Writing them relies strictly on procedural memory rather than logic.
Personally, when it comes to writing Chinese essays I often find myself forgetting, misremembering, or just straight up not knowing how to write a character. This frusttration led to a question: Can there be a "Grammarly/Copilot" that fixes the errors in a character and autocompletes your ideas seamlessly... in handwritten Chinese???
Neurinese is a real-time handwriting intelligence engine. By combining stroke-level stylistic modeling with semantic language understanding, this projects acts as a real-time "copilot" for handwritten Chinese. Here are the two main features:
-
Smart Autocompletion - Recognizes the context of your sentence and generates the next characters automatically in the same handwriting style
-
Context-Aware Autocorrect - If you write a character with a slight incorrection, it detects the error, matches it with the context of the sentence and regenerates the correct character to be used, again in the handwriting style of the user
As of current, this project can achieve the autocompletion feature.
Purposely writing slanted, writing 帮助 (help) — the model predicts and draws 你 (you) in the same handwriting style (slanted in this case):
Extracting a style vector from the written characters and autocompleting in that style, trained on ~500 samples. Below is a failed attempt
Note: Generation consistency is still being improved — currently training on a small dataset. See Milestones.
Here's a closer look at the cross character style transfer. Extracting a style vector from the written character and generating three different characters in that style
Neurinese combines recognition and generation to form an end-to-end handwriting intelligence pipeline:
Unlike standard OCR which sees static images, this model learns from raw stroke data:
(dx, dy, p1, p2, p3)
dx, dy relative pen displacement from previous point
p1 = 1 pen is DOWN (stroke is being drawn)
p2 = 1 pen is UP (transition/travel between strokes)
p3 = 1 end of character
By modeling the motion rather than the pixels, the system learns a continuous Latent Space that captures two distinct layers of information:
| Layer | What it captures | Handled by |
|---|---|---|
| Content | Which character is being drawn |
char_id (explicit integer label) |
| Style | How a specific person writes |
|
This enables for an autoregressive handwriting syntehsis ability, where characters can literally be drawn by the model as if you drew it.
When Apple first released their Math Notes feature back in the Summer of 2024, it especially intrigued me with how it could not only solve equations but render the solution in the user's own handwriting style.
To achieve the handwriting aspect, a system must somehow undersatnd the dynamics of writing rather than simply recognizing symbols. This project aims to explore this concept in the context of handwritten Chinese, whcih inherently lacks any pattern, perfect for model memorization and handwriting synthesis.
Neurinese explores this idea in the context of Chinese handwriting by modeling characters as sequences of pen movements. This project focuses on learning stroke-level embeddings that facilitate autocompletion, synthesis, and style-aware generation.
While most character reocnigziation rely on CNN-based image recognition, the crucial question this project seeks to answer is:
Can a model learn how characters are written, not just what they look like?
This project investigates human-centered AI and generative modeling, focusing on learning representations from the dynamics of handwriting rather than from images alone.
- User writes a sequence of characters.
- The CNN model recognizes and matches the most recent character intended to an actual character, returning the character ID
- The CVAE Encoder analyzes the strokes to generate a running User Style Embedding (
$z_{style}$ ). - The NLP model combines the currently drawn recognized character to analyze the sentence meaning. Two pathways emerge:
- If the user drew the character incorrectly given the context of the sentence. The autocorrection decision can be made.
- Otherwise, factoring in the current character, the model predicts the next character(s)/phrases if the occurrence is above a threshold
- Input:
"天气非常 (The weather is very)..." - Prediction:
"好 (Good)" - 85% probability - Except this happens on a handwriting level
- Input:
- The system passes the Style Vector (
$z_{style}$ ) and the predicted next character"好 (Good)"into the CVAE Decoder. - The system draws the character
"好 (Good)"using the user's specific handwriting characteristics.
| Stage | Shape | Notes |
|---|---|---|
| Raw stroke input | (seq_len, 5) |
[dx, dy, p1, p2, p3] one-hot pen state |
| Encoder LSTM input | (batch, seq_len, 37) |
|
| Latent vector |
(batch, 64) |
Sampled via Reparameterization Trick |
| Decoder LSTM input | (batch, seq_len, 160) |
|
| Decoder output | (batch, seq_len, 123) |
This model learns the handwriting nuance.
The system uses a recurrent VAE architecture (similar to SketchRNN) but obvisouly modified for the high-density stroke constraints and multimodality of Chinese characters. With character conditioning via embeddings to both the encoder and decoder, it forces the latent vector to specialize on style alone.
The pipeline consists of two core components:
-
Bi-Directional Encoder: A bi-directional LSTM processes the input sequence of strokes, compressing them into a fixed-length latent vector
$z$ which is samples from a Gaussian distribution. This vector acts as a compressed embedding of the character. -
Autoregressive Decoder: An autoregressive uni-directional LSTM conditioned on
$z$ from the encoder at each step predicts the probability distribution of the next state(dx, dy, pen state)based on previous state and global context.
Because the encoder is tole what character is being drawn via the character ID assigned, the latent vector is forced to capture only the user's stroke dynamics.
This allows us to extract a style from one character the user drew and generate a completely different character in that same style.
t-SNE of z_style - three characters form distinct clusters based on handwriting style
| Component | Target | Formula | Purpose |
|---|---|---|---|
| MDN Loss | (dx, dy) |
Negative log-likelihood under GMM | Multimodal stroke position distribution |
| Pen Loss | (p1, p2, p3) |
Cross-entropy on 3-class one-hot | Pen state (down / up / end) classification |
| KL Divergence | Latent space | Regularise z_style toward |
KL Annealing: char_id, destroying style capacity.
Teacher Forcing with Input Dropout: 20% of decoder input tokens are randomly zeroed during training, preventing the model from over-relying on exact ground-truth context and improving robustness at inference time.
At any stroke step, multiple next positions are equally valid. There are several pen lifts and fine details in a given Chinese character. A MDN outputs a Gaussian Mixture Model over (dx, dy) rather than a single point estimate, following the formulation from
Graves (2013) — Generating Sequences with Recurrent Neural Networks.
Per output step:
pi [20] mixture weights softmax
mu_x [20] x means per component
mu_y [20] y means per component
sigma_x [20] x std devs exp-activated
sigma_y [20] y std devs exp-activated
rho [20] x-y correlation tanh-activated
MDN params: 20 × 6 = 120
Pen logits: 3 (p1, p2, p3 - cross-entropy target)
Total output: 123-dim per step
The bivariate Gaussian probability for each mixture component
where:
The final mixture probability is:
trained by minimising negative log-likelihood. This is the key change that solved mode collapse.
At inference, a mixture component is sampled, not argmax. This allows producing committed stroke paths with natural variability rather than averaged smears.
The reason for using this is due to the Multimodality of Handwriting.
MSE mode collapse vs MDN recovery
To understand context.
Work in progress. Currently is a simple ngram derived from a Markov Chain.
This piece is responsible for providing sentence-level context for autocompletion and autocorrect decisions. The NLP model predicts the next char_id given sentence context. Its output feeds directly into the CVAE Decoder.
To recognize characters.
This piece is responsible for recognizing partially-drawn characters by rendering strokes to a 64 x 64 image via PIL and mapping them to a character ID. This bridges raw user input to the CVAE conditioning.
Handwriting is inherently multimodal. Especially in Chinese.
At any point in writing a character, strokes can include:
- Lifting the pen to a new place
- Smooth continuations of a stroke, and oppositely:
- Sudden sharp direction changes
Initial prototypes used standard regression layers and loss functions such as Mean Squared Error (MSE) loss collapsed these possibiilities into their mean, causing the model generation to keep converging into diagonal suqiggles. MSE loss forces the model to minimize the average error between these valid optiopns, causing the model to quite literally dodge handwriting nuance and follow through with the predicted mathematical mean of all valid paths rather than comitting to a specific stroke path.
This failure highlights a key insight that I completely missed when approaching the handwriting dynamic.
Handwriting cannot be learned as a determininistic regression problem
Switching to an MDN with stochastic sampling solved this. The model now samples from a learned distribution, producing strokes that commit to sepcific paths with natural variation.
VAE gives us the style vector, but how do we use it?
The original VAE encoded both what a character looks like and how the user draws it into the same latent vector
Utilizing a CVAE makes the content explicit via a character ID, leaving
Writing can get messy, and pixels amplify this noise.
Raw input contains thousands of redundant near-collinear points per stroke. Training on this causes the model to get messed up from the noise and not learn the structure properly.
On a network level, this noise introduce potential pattern matching gave hinderance to the LSTM's through the extra size of input and increased differences from each drawing (some points would have a longer sequence length)
I needed a algorithm that could simplify these strokes into straight lines wherever necessary while not leaving out the smaller details such as corners, hooks and directional changes present in a Chinese character.
Doing research, I came across the Ramer-Douglas-Peucker algorithm, which reduces each stroke to its geometrically essential points while preserving special elements.
RDP Simplification with
-
Ha & Eck (2017) — A Neural Representation of Sketch Drawings
-
Graves (2013) — Generating Sequences With Recurrent Neural Networks
-
Joseph (2021) — Mixture Density Networks: Probabilistic Regression for Uncertainty Estimation
-
Paul (2020) — Reparameterization trick in Variational Autoencoders
- Python 3.8+
- PyTorch (preferably with CUDA 13.0)
-
Install PyTorch compiled with CUDA via the official site. Check NVIDIA CUDA version by running this command in terminal:
nvidia-smi
-
Clone this project:
git clone https://github.com/dark-sorceror/Neurinese.git cd neurinese pip install -r requirements.txtNote: This includes all the necessary model weights and training data.
-
Run the main file for prototype testing
python main.py
- Achieve some sort of model inference of the character
- Get the autoregressive inference working (learning from itself rather than being observant)
- Style/content disentaglment for latent style vector utillization
- Generate any character in the user's style
- Scale up to train on full dataset; no more training on duplicates to enforce overfitting
- Conditioning and encouraging style consistency to predict character writing in foreign characters
- Incorporate some sort of system to recognize stroke differences between the correct character and the user written character
- Integrate NLP and CNN to generate semantic understanding
- Work toward the autocorrect inference pipeline
- Scaling and deployment








