Skip to content

onucharles/pytorch-speech-commands

This branch is up to date with tugstugi/pytorch-speech-commands:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

36f3cd9 · Jan 29, 2018

History

77 Commits
Jan 21, 2018
Jan 22, 2018
Jan 25, 2018
Jan 22, 2018
Jan 21, 2018
Jan 29, 2018
Jan 29, 2018
Jan 22, 2018
Jan 25, 2018
Jan 21, 2018
Jan 22, 2018
Jan 25, 2018
Jan 25, 2018

Repository files navigation

Convolutional neural networks for Google speech commands data set with PyTorch.

General

We, xuyuan and tugstugi, have participated in the Kaggle competition TensorFlow Speech Recognition Challenge and reached the 10-th place. This repository contains a simplified and cleaned up version of our team's code.

Features

  • 1x32x32 mel-spectrogram as network input
  • single network implementation both for CIFAR10 and Google speech commands data sets
  • faster audio data augmentation on STFT
  • Kaggle private LB scores evaluated on 150.000+ audio files

Results

Due to time limit of the competition, we have trained most of the nets with sgd using ReduceLROnPlateau for 70 epochs. For the training parameters and dependencies, see TRAINING.md. Earlier stopping the train process will sometimes produce a better score in Kaggle.

        Model         CIFAR10
test set
accuracy
Speech Commands
test set
accuracy
Speech Commands
test set
accuracy with crop
Speech Commands
Kaggle private LB
score
Speech Commands
Kaggle private LB
score with crop
        Remarks        
VGG19 BN 93.56% 97.337235% 97.527432% 0.87454 0.88030
ResNet32 - 96.181419% 96.196050% 0.87078 0.87419
WRN-28-10 - 97.937089% 97.922458% 0.88546 0.88699
WRN-28-10-dropout 96.22% 97.702999% 97.717630% 0.89580 0.89568
WRN-52-10 - 98.039503% 97.980980% 0.88159 0.88323 another trained model has 97.52%/0.89322
ResNext29 8x64 - 97.190929% 97.161668% 0.89533 0.89733 our best model during competition
DPN92 - 97.190929% 97.249451% 0.89075 0.89286
DenseNet-BC (L=100, k=12) 95.52% 97.161668% 97.147037% 0.88946 0.89134
DenseNet-BC (L=190, k=40) - 97.117776% 97.147037% 0.89369 0.89521

Results with Mixup

After the competition, some of the networks were retrained using mixup: Beyond Empirical Risk Minimization by Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin and David Lopez-Paz.

        Model         CIFAR10
test set
accuracy
Speech Commands
test set
accuracy
Speech Commands
test set
accuracy with crop
Speech Commands
Kaggle private LB
score
Speech Commands
Kaggle private LB
score with crop
        Remarks        
VGG19 BN - 97.483541% 97.542063% 0.89521 0.89839
WRN-52-10 - 97.454279% 97.498171% 0.90273 0.90355 same score as the 16-th place in Kaggle

About

Speech commands recognition with PyTorch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.4%
  • Shell 0.6%