Lower Frequency Video Recognition #1695

kerolos · 2024-07-18T17:49:56Z

I would like to use Zipformer "state of the art in speech recognation" to train a lower frequency video recognition "Sign language" model based on the code provided in icefall/egs/librispeech/ASR/zipformer/train.py and the paper https://arxiv.org/abs/2310.11230.

Problem Statement
The current dataset has a frame rate of 24 frames per second (sample rate) with skeleton data yielding a 1662 feature vector per second. The number of tokens ranges from 30,000 to 70,000, which is considerably high. I am looking for recommendations on parameter adjustments to achieve better recognition with lower frequency data.

Parameters to Handle Lower Frequency in the Dataset
Below is a table listing relevant parameters with their default values.

Parameter Default
--num-encoder-layers "2,2,3,4,3,2"
--downsampling-factor "1,2,4,8,4,2"
--feedforward-dim "512,768,1024,1536,1024,768"
--num-heads "4,4,4,8,4,4"
--encoder-dim "192,256,384,512,384,256"
--query-head-dim "32"
--value-head-dim "12"
--pos-head-dim "4"
--pos-dim 48
--encoder-unmasked-dim "192,192,256,256,256,192"
--cnn-module-kernel "31,31,15,15,15,31"
--decoder-dim 512
--joiner-dim 512
--attention-decoder-dim 512
--attention-decoder-num-layers 6
--attention-decoder-attention-dim 512
--attention-decoder-num-heads 8
--attention-decoder-feedforward-dim 2048
--causal False
--chunk-size "16,32,64,-1"
--left-context-frames "64,128,256,-1"
--use-transducer True
--use-ctc False
--use-attention-decoder False
--world-size 1
--ref-duration 600
--prune-range 5
--lm-scale 0.25
--am-scale 0.0
--simple-loss-scale 0.5
--ctc-loss-scale 0.2

Additional Questions
How can I find the grid recipe in Icefall, specifically in this issue #150 by @luomingshuang, which is no longer available?

I know it is out of the scope, but any guidance will be appreciated. Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower Frequency Video Recognition #1695

Lower Frequency Video Recognition #1695

kerolos commented Jul 18, 2024

Lower Frequency Video Recognition #1695

Lower Frequency Video Recognition #1695

Comments

kerolos commented Jul 18, 2024