Phonaestesia

What would it mean to generate images from the sound of speech, not from the semantics of the words themselves? I built this with a generative model called a Variational Auto Encoder, for Fundamentals of Speech Recognition (E6998), a graduate level speech recognition class at Columbia.

The project was inspired by this incredible paper by Zach Leiberman: https://github.com/alex-calderwood/phonaesthesia/blob/master/papers/leiberman_paper.pdf

I never really got it to generate anything interesting, but I think that was a finding in itself, and I want to return to it one day.

Run

Requires:

spacy
tensorflow
opencv (for image processing)
requests (for scraping ImageNet data)

To obtain images, run:

python image_processing.py

Run until you have enough images. You will probably want to run this script in parrallel, or on multiple instances.

To transform your data into 32 x 32 pixel images run:

python resize.py
To generate new images based on the phonemes of the filenames in data/imagenet/scaled/ run:

python vae_gan.py

Because other KALDI code is not a dependency for this project, I've only included the relavant directories from tedlium: lang/ and local/ to reduce the size of the upload. The idea is that this can integrate with KALDI/TEDLIUM by putting this code into the tedlium directory and running from there.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/imagenet		data/imagenet
papers		papers
vae_experiments		vae_experiments
README.md		README.md
image_processing.py		image_processing.py
resize.py		resize.py
vae_gan.ipynb		vae_gan.ipynb
vae_gan.py		vae_gan.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phonaestesia

Run

About

Releases

Packages

Languages

alex-calderwood/phonaesthesia

Folders and files

Latest commit

History

Repository files navigation

Phonaestesia

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages