You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to use Zipformer "state of the art in speech recognation" to train a lower frequency video recognition "Sign language" model based on the code provided in icefall/egs/librispeech/ASR/zipformer/train.py and the paper https://arxiv.org/abs/2310.11230.
Problem Statement
The current dataset has a frame rate of 24 frames per second (sample rate) with skeleton data yielding a 1662 feature vector per second. The number of tokens ranges from 30,000 to 70,000, which is considerably high. I am looking for recommendations on parameter adjustments to achieve better recognition with lower frequency data.
Parameters to Handle Lower Frequency in the Dataset
Below is a table listing relevant parameters with their default values.
I would like to use Zipformer "state of the art in speech recognation" to train a lower frequency video recognition "Sign language" model based on the code provided in icefall/egs/librispeech/ASR/zipformer/train.py and the paper https://arxiv.org/abs/2310.11230.
Problem Statement
The current dataset has a frame rate of 24 frames per second (sample rate) with skeleton data yielding a 1662 feature vector per second. The number of tokens ranges from 30,000 to 70,000, which is considerably high. I am looking for recommendations on parameter adjustments to achieve better recognition with lower frequency data.
Parameters to Handle Lower Frequency in the Dataset
Below is a table listing relevant parameters with their default values.
Parameter Default
--num-encoder-layers "2,2,3,4,3,2"
--downsampling-factor "1,2,4,8,4,2"
--feedforward-dim "512,768,1024,1536,1024,768"
--num-heads "4,4,4,8,4,4"
--encoder-dim "192,256,384,512,384,256"
--query-head-dim "32"
--value-head-dim "12"
--pos-head-dim "4"
--pos-dim 48
--encoder-unmasked-dim "192,192,256,256,256,192"
--cnn-module-kernel "31,31,15,15,15,31"
--decoder-dim 512
--joiner-dim 512
--attention-decoder-dim 512
--attention-decoder-num-layers 6
--attention-decoder-attention-dim 512
--attention-decoder-num-heads 8
--attention-decoder-feedforward-dim 2048
--causal False
--chunk-size "16,32,64,-1"
--left-context-frames "64,128,256,-1"
--use-transducer True
--use-ctc False
--use-attention-decoder False
--world-size 1
--ref-duration 600
--prune-range 5
--lm-scale 0.25
--am-scale 0.0
--simple-loss-scale 0.5
--ctc-loss-scale 0.2
Additional Questions
How can I find the grid recipe in Icefall, specifically in this issue #150 by @luomingshuang, which is no longer available?
I know it is out of the scope, but any guidance will be appreciated. Thanks in advance.
The text was updated successfully, but these errors were encountered: