[WIP]training with coodbook loss #138

glynpu · 2021-12-02T09:21:03Z

TODO:
Ner future:

dataloader and augmentation
quantizer training
generating manifest with codebook indices

Further experiments:

different number of code books
different layer of memory embeddings
different weight of codebook loss
frame-rate mismatch between teach model and student model
use wav2vec as a teach model

glynpu · 2021-12-02T09:32:07Z

@zhu-han Current results show that training with code-book could converge faster and better.
Experiment configuration:

Training data: librispeech-clean-100h dataset.
decoding method: cic-decoding
code-book index is extracted by a model trained with 960h librispeech.

Here is the wer% on librispeech test clean;

epoch	baseline wer	+ code book loss
1	94.24	73.38
2	79.2	57.24
3	66.85	45.32
4	57.71	36.24
5	50.97	29.66
6	43.71	24.51
7	37.85	21.11
8	31.45	18.89
9	27.65	17.74
10	24.65	16.04
11	22.33	15.55
12	20.23	14.64
13	18.52	14.28
14	17.86	13.42
15	16.48	13.28
16	15.92	12.87
17	15.6	12.83
18	14.97	12.12
19	14.85	12.39
20	14.13	12.1
21	13.9	11.74
22	14.04	11.49
23	13.62	11.3
24	13.54	11.24
25	13.44	10.86
26	13.11	10.84
27	12.88	10.84
28	12.74	10.98
29	12.6	10.69
30	12.77	10.66
31	12.65	10.53
32	12.46	10.27
33	12.15	10.28
34	12.19	10.2
35	12.27	10.02
36	12.1	10.05
37	12.2	9.98
38	11.82	9.78
39	11.91	2.9
40	11.91	9.87
41	11.73	9.72
42	11.87	9.79
43	11.15	9.75
44	11.76	9.57
45	11.47	9.34
46	11.21	9.14
47	10.95	9.16
48	11.01	9.11
49	11.3	8.89
50	11.1	8.8
51	11.19	8.85
52	11.2	8.85
53	10.71	8.65
54	10.78	8.65
55	10.86	8.7
56	10.76	8.72
57	10.7	8.63
58	10.49	8.62
59	10.54	8.67
60	10.78	8.83
61	10.17	8.62
62	10.38	8.71
63	10.17	8.58
64	10.35	8.67
65	10.11	8.8
66	10.04	8.76
67	10.07	8.7
68	10.03	8.69
69	10.21	8.64
70	9.57	8.75
71	9.75	8.59
72	9.56	8.71
73	9.84	8.62
74	9.68	8.86
75	10.0	8.74
76	9.54	8.79
77	9.46	8.84

glynpu · 2021-12-23T12:08:28Z

Now three directions have been tried:

direction	teacher model	student mode	training data
1	icefall released model trained with 960h	icefall model	clean-100h
2	icefall released model trained with 960h	icefall model	full libri 960h
3	wav2vec model	icefall model	clean-100h

All following results are on test-clean with ctc-decoding.

Conclusions of each direction:
Results of direction 1:
The teacher model help training coverging faster and finally got a much lower wer(around 9.18% vs. 5.9%).
Results of direction 2:
The teacher model help training converging faster. But finally, the results is not significant better than baseline.
Results of direction 3:
Failed to obtain good results, though wav2vec2 model got a much lower wer than icefall released model with only ctc-decoding.

model	wav2vec2	icefall released mode
wer	1.85%	2.93%

To reproduce the 1.85% result, run command:

pip install transformers
python conformer_ctc/wav2vec_decode.py 

Results are:
2021-12-23 22:28:07,386 INFO [utils.py:391] [test-clean-ctc_greedy_search] %WER 1.85% [975 / 52576, 95 ins, 67 del, 813 sub ]
2021-12-23 22:29:29,629 INFO [utils.py:391] [test-other-ctc_greedy_search] %WER 3.89% [2036 / 52343, 207 ins, 134 del, 1695 sub ]

Link of wav2vec model used in this exp: https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self
Link of icefall released model used in this exp: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09

Details results of direction 1:
The teacher model help to converge faster and finally got a much lower wer(around 9.18% vs. 5.9%).

epoch	baseline 1.e	our trained model 2.c.i	our trained model 2.b.v
memory	no	last	last
predictor	no codebook	(old) joint predictor	power ful
weigth	0.0	9.3	0.3
max-duration	100	100	100
gpus	3	3	3
num codebooks	4	4	16
status	(done)	done	done current best
0	90.04	100.0	99.5
1	64.72	64.18	63.6
2	47.81	41.91	44.12
3	36.94	28.10	29.17
4	28.29	21.53	20.83
5	25.28	17.54	16.86
6	21.3	15.14	14.42
7	19.42	14.01	13.38
8	18.35	13.43	12.65
9	17.37	12.94	12.12
10	16.41	12.07	11.17
11	16.07	11.71	10.85
12	15.39	11.2	10.5
13	15.15	11.09	10.1
14	14.68	10.66	9.84
15	14.37	10.54	9.4
16	14.67	10.7	9.58
17	13.82	10.09	9.18
18	13.68	9.7	8.71
19	13.39	9.62	8.62
20	13.14	9.39	8.58
21	12.96	9.11	8.14
22	12.4	8.87	8.15
23	12.08	8.68	7.86
24	11.72	8.45	7.64
25	11.39	8.22	7.47
26	11.45	8.12	7.29
27	11.2	7.88	7.29
28	11.05	7.84	6.93
29	10.82	7.86	6.92
30	10.94	7.63	6.81
31	10.92	7.52	6.93
32	10.97	7.6	6.71
33	10.51	7.59	6.66
34	10.74	7.52	6.78
35	10.32	7.45	6.57
36	10.36	7.5	6.7
37	10.32	7.39	6.56
38	10.1	7.29	6.61
39	10.04	7.25	6.47
40	10.19	7.29	6.44
41	10.01	7.23	6.36
42	10.0	7.24	6.47
43	9.85	7.08	6.38
44	10.01	7.06	6.43
45	9.89	7.06	6.4
46	9.98	6.98	6.29
47	9.75	6.96	6.26
48	9.79	7.06	6.42
49	9.71	6.99	6.24
50	9.8	7.09	6.18
51	9.77	7.01	6.08
52	9.69	7.11	5.96
53	9.59	7.07	6.04
54	9.47	7.01	6.18
55	9.64	7.0	6.08
56	9.68	6.95	6.07
57	9.68	6.94	6.08
58	9.72	6.89	6.03
59	9.38	6.89	5.99
60	9.55	6.9	6.02
61	9.61	6.9	5.97
62	9.43	6.88	6.0
63	9.55	6.88	5.97
64	9.4	6.93	6.06
65	9.54	7.0	5.94
66	9.49	6.81	6.06
67	9.43	6.86	6.07
68	9.33	6.89	5.97
69	9.14	6.79	5.87
70	9.21	6.73	5.91
71	9.07	6.8	5.95
72	9.17	6.88	5.8
73	9.34	6.92	5.93
74	9.18	6.96	5.83
75	9.41	6.87	5.79
76	9.21	6.79	5.97
77	9.18	6.77	5.83

Detail Results of direction 2:
The teacher model help training converges faster. But finally, the results is not significant better than baseline.

epoch	full libri baseline	full libri 2.e.ii
memory	no	last
predictor	no codebook	powerful
weigth	0.0	0.3
max-duration	200	270
gpus	4	3
valid batch duration	800=4*200	810=3*270
batches per epoch	13500	13200
num codebooks	0	16
status	(done) 2.5 hours/ epoch	training 4 hours/ epoch
0	31.96	26.89
1	13.09	9.97
2	9.41	7.67
3	8.29	6.89
4	7.67	6.47
5	7.0	6.08
6	5.75	5.50
7	5.45	5.12
8	5.13	4.76
9	4.81	4.49
10	4.68	4.37
11	4.46	4.35
12	4.38	4.17
13	4.4	4.12

training with coodbook loss

a4722dd

glynpu changed the title ~~training with coodbook loss~~ [WIP]training with coodbook loss Dec 2, 2021

glynpu added 6 commits December 2, 2021 18:27

fix ci

54bcc16

test exclude

3570cb7

copy quantization files from dan's repo

8985440

train with full libri

3b42f03

use wav2vec as a teacher model

a9ad955

add reference

e31c14b

glynpu added 2 commits December 31, 2021 15:38

extract middle layer memory embedding

6bb949e

concat codebook index and mid layer of student model

4281121

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]training with coodbook loss #138

[WIP]training with coodbook loss #138

glynpu commented Dec 2, 2021 •

edited

Loading

glynpu commented Dec 2, 2021 •

edited

Loading

glynpu commented Dec 23, 2021 •

edited

Loading

[WIP]training with coodbook loss #138

Are you sure you want to change the base?

[WIP]training with coodbook loss #138

Conversation

glynpu commented Dec 2, 2021 • edited Loading

glynpu commented Dec 2, 2021 • edited Loading

glynpu commented Dec 23, 2021 • edited Loading

glynpu commented Dec 2, 2021 •

edited

Loading

glynpu commented Dec 2, 2021 •

edited

Loading

glynpu commented Dec 23, 2021 •

edited

Loading