Hi @caow13,
Hope you're doing well.
When going through your code I've noticed that in gru_d.py there is no input decay to mean. In the paper by Che et al here, the decayed input is described as:

Where the second term contains the decay to last seen observation and the third term contains decay to empirical mean.
But in this implementation the code only does a decay to the last seen observation:

It could be that this was intended. Anyway, hope you can find the time to look into this.
Gr,
Noah