Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]training with coodbook loss #138

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

glynpu
Copy link
Collaborator

@glynpu glynpu commented Dec 2, 2021

TODO:
Ner future:

  1. dataloader and augmentation
  2. quantizer training
  3. generating manifest with codebook indices

Further experiments:

  1. different number of code books
  2. different layer of memory embeddings
  3. different weight of codebook loss
  4. frame-rate mismatch between teach model and student model
  5. use wav2vec as a teach model

@glynpu glynpu changed the title training with coodbook loss [WIP]training with coodbook loss Dec 2, 2021
@glynpu
Copy link
Collaborator Author

glynpu commented Dec 2, 2021

@zhu-han Current results show that training with code-book could converge faster and better.
Experiment configuration:

Training data: librispeech-clean-100h dataset.
decoding method: cic-decoding
code-book index is extracted by a model trained with 960h librispeech.

Here is the wer% on librispeech test clean;

epoch baseline wer + code book loss
1 94.24 73.38
2 79.2 57.24
3 66.85 45.32
4 57.71 36.24
5 50.97 29.66
6 43.71 24.51
7 37.85 21.11
8 31.45 18.89
9 27.65 17.74
10 24.65 16.04
11 22.33 15.55
12 20.23 14.64
13 18.52 14.28
14 17.86 13.42
15 16.48 13.28
16 15.92 12.87
17 15.6 12.83
18 14.97 12.12
19 14.85 12.39
20 14.13 12.1
21 13.9 11.74
22 14.04 11.49
23 13.62 11.3
24 13.54 11.24
25 13.44 10.86
26 13.11 10.84
27 12.88 10.84
28 12.74 10.98
29 12.6 10.69
30 12.77 10.66
31 12.65 10.53
32 12.46 10.27
33 12.15 10.28
34 12.19 10.2
35 12.27 10.02
36 12.1 10.05
37 12.2 9.98
38 11.82 9.78
39 11.91 2.9
40 11.91 9.87
41 11.73 9.72
42 11.87 9.79
43 11.15 9.75
44 11.76 9.57
45 11.47 9.34
46 11.21 9.14
47 10.95 9.16
48 11.01 9.11
49 11.3 8.89
50 11.1 8.8
51 11.19 8.85
52 11.2 8.85
53 10.71 8.65
54 10.78 8.65
55 10.86 8.7
56 10.76 8.72
57 10.7 8.63
58 10.49 8.62
59 10.54 8.67
60 10.78 8.83
61 10.17 8.62
62 10.38 8.71
63 10.17 8.58
64 10.35 8.67
65 10.11 8.8
66 10.04 8.76
67 10.07 8.7
68 10.03 8.69
69 10.21 8.64
70 9.57 8.75
71 9.75 8.59
72 9.56 8.71
73 9.84 8.62
74 9.68 8.86
75 10.0 8.74
76 9.54 8.79
77 9.46 8.84

@glynpu
Copy link
Collaborator Author

glynpu commented Dec 23, 2021

Now three directions have been tried:

direction teacher model student mode training data
1 icefall released model trained with 960h icefall model clean-100h
2 icefall released model trained with 960h icefall model full libri 960h
3 wav2vec model icefall model clean-100h

All following results are on test-clean with ctc-decoding.

Conclusions of each direction:
Results of direction 1:
The teacher model help training coverging faster and finally got a much lower wer(around 9.18% vs. 5.9%).
Results of direction 2:
The teacher model help training converging faster. But finally, the results is not significant better than baseline.
Results of direction 3:
Failed to obtain good results, though wav2vec2 model got a much lower wer than icefall released model with only ctc-decoding.

model wav2vec2 icefall released mode
wer 1.85% 2.93%

To reproduce the 1.85% result, run command:

pip install transformers
python conformer_ctc/wav2vec_decode.py 

Results are:
2021-12-23 22:28:07,386 INFO [utils.py:391] [test-clean-ctc_greedy_search] %WER 1.85% [975 / 52576, 95 ins, 67 del, 813 sub ]
2021-12-23 22:29:29,629 INFO [utils.py:391] [test-other-ctc_greedy_search] %WER 3.89% [2036 / 52343, 207 ins, 134 del, 1695 sub ]

Link of wav2vec model used in this exp: https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self
Link of icefall released model used in this exp: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09

Details results of direction 1:
The teacher model help to converge faster and finally got a much lower wer(around 9.18% vs. 5.9%).

epoch baseline 1.e our trained model 2.c.i our trained model 2.b.v
memory no last last
predictor no codebook (old) joint predictor power ful
weigth 0.0 9.3 0.3
max-duration 100 100 100
gpus 3 3 3
num codebooks 4 4 16
status (done) done done current best
0 90.04 100.0 99.5
1 64.72 64.18 63.6
2 47.81 41.91 44.12
3 36.94 28.10 29.17
4 28.29 21.53 20.83
5 25.28 17.54 16.86
6 21.3 15.14 14.42
7 19.42 14.01 13.38
8 18.35 13.43 12.65
9 17.37 12.94 12.12
10 16.41 12.07 11.17
11 16.07 11.71 10.85
12 15.39 11.2 10.5
13 15.15 11.09 10.1
14 14.68 10.66 9.84
15 14.37 10.54 9.4
16 14.67 10.7 9.58
17 13.82 10.09 9.18
18 13.68 9.7 8.71
19 13.39 9.62 8.62
20 13.14 9.39 8.58
21 12.96 9.11 8.14
22 12.4 8.87 8.15
23 12.08 8.68 7.86
24 11.72 8.45 7.64
25 11.39 8.22 7.47
26 11.45 8.12 7.29
27 11.2 7.88 7.29
28 11.05 7.84 6.93
29 10.82 7.86 6.92
30 10.94 7.63 6.81
31 10.92 7.52 6.93
32 10.97 7.6 6.71
33 10.51 7.59 6.66
34 10.74 7.52 6.78
35 10.32 7.45 6.57
36 10.36 7.5 6.7
37 10.32 7.39 6.56
38 10.1 7.29 6.61
39 10.04 7.25 6.47
40 10.19 7.29 6.44
41 10.01 7.23 6.36
42 10.0 7.24 6.47
43 9.85 7.08 6.38
44 10.01 7.06 6.43
45 9.89 7.06 6.4
46 9.98 6.98 6.29
47 9.75 6.96 6.26
48 9.79 7.06 6.42
49 9.71 6.99 6.24
50 9.8 7.09 6.18
51 9.77 7.01 6.08
52 9.69 7.11 5.96
53 9.59 7.07 6.04
54 9.47 7.01 6.18
55 9.64 7.0 6.08
56 9.68 6.95 6.07
57 9.68 6.94 6.08
58 9.72 6.89 6.03
59 9.38 6.89 5.99
60 9.55 6.9 6.02
61 9.61 6.9 5.97
62 9.43 6.88 6.0
63 9.55 6.88 5.97
64 9.4 6.93 6.06
65 9.54 7.0 5.94
66 9.49 6.81 6.06
67 9.43 6.86 6.07
68 9.33 6.89 5.97
69 9.14 6.79 5.87
70 9.21 6.73 5.91
71 9.07 6.8 5.95
72 9.17 6.88 5.8
73 9.34 6.92 5.93
74 9.18 6.96 5.83
75 9.41 6.87 5.79
76 9.21 6.79 5.97
77 9.18 6.77 5.83

Detail Results of direction 2:
The teacher model help training converges faster. But finally, the results is not significant better than baseline.

epoch full libri baseline full libri 2.e.ii
memory no last
predictor no codebook powerful
weigth 0.0 0.3
max-duration 200 270
gpus 4 3
valid batch duration 800=4*200 810=3*270
batches per epoch 13500 13200
num codebooks 0 16
status (done) 2.5 hours/ epoch training 4 hours/ epoch
0 31.96 26.89
1 13.09 9.97
2 9.41 7.67
3 8.29 6.89
4 7.67 6.47
5 7.0 6.08
6 5.75 5.50
7 5.45 5.12
8 5.13 4.76
9 4.81 4.49
10 4.68 4.37
11 4.46 4.35
12 4.38  4.17
13 4.4  4.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant