LipNet: End-to-End Lipreading

Jishnu T S

doi:10.54105/ijdm.A1632.04010524

Published May 30, 2024 | Version CC-BY-NC-ND 4.0

Journal article Open

LipNet: End-to-End Lipreading

Jishnu T S (Contact person)¹

1. Department of Computer Science, St. Albert's College, Kochi (Kerala), India.

Contributors

Contact person:

Jishnu T S¹

Researcher:

Anju Antony¹

1. Department of Computer Science, St. Albert's College, Kochi (Kerala), India.

Abstarct: Lipreading is the task of decoding text from the movement of a speaker’s mouth. This research presents the development of an advanced end-to-end lipreading system. Leveraging deep learning architectures and multimodal fusion techniques, the proposed system interprets spoken language solely from visual cues, such as lip movements. Through meticulous data collection, annotation, preprocessing, model development, and evaluation, diverse datasets encompassing various speakers, accents, languages, and environmental conditions are curated to ensure robustness and generalization. Conventional methods divided the task into two phases: prediction and designing or learning visual characteristics. Most deep lipreading methods are trainable from end to end. In the past, lipreading has been tackled using tedious and sometimes unsatisfactory techniques that break down speech into smaller units like phonemes or visemes. But these methods often fail when faced with real-world problems, such contextual factors, accents, and differences in speech patterns. Nevertheless, current research on end-to-end trained models only carries out word classification; sentence-level sequence prediction is not included. LipNet is an end-to-end trained model that uses spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss to translate a variable-length sequence of video frames to text. LipNet breaks from this traditional paradigm by using an all-encompassing, end-to-end approach supported by deep learning algorithms, Convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which are skilled at processing sequential data and extracting high-level representations, are fundamental to LipNet's architecture.LipNet achieves 95.2% accuracy in sentence-level on the GRID corpus, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy. The results underscore the transformative potential of the lipreading system in real-world applications, particularly in domains such as assistive technology and human-computer interaction, where it can significantly improve communication accessibility and inclusivity for individuals with hearing impairments.

Files

A163204010524.pdf

Files (359.5 kB)

Name	Size	Download all
A163204010524.pdf md5:9595b69559074f355b946e4db70eec7f	359.5 kB	Preview Download

Additional details

DOI: 10.54105/ijdm.A1632.04010524
EISSN: 2582-9246

Accepted: 2024-05-15

Manuscript received on 06 March 2024 | Revised Manuscript received on 02 May 2024 | Manuscript Accepted on 15 May 2024 | Manuscript published on 30 May 2024.

L. Qu, C. Weber and S. Wermter, "LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading," in IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 2, pp. 2772-2782, Feb. 2024, doi: 10.1109/TNNLS.2022.3191677.
G. I. Chiou and Jenq-Neng Hwang, "Lipreading from color video," in IEEE Transactions on Image Processing, vol. 6, no. 8, pp. 1192-1195, Aug. 1997, doi: 10.1109/83.605417.
Matthews, T. F. Cootes, J. A. Bangham, S. Cox and R. Harvey, "Extraction of visual features for lipreading," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198-213, Feb. 2002, doi: 10.1109/34.982900.
T. Afouras, J. S. Chung, A. Senior, O. Vinyals and A. Zisserman, "Deep Audio-Visual Speech Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8717-8727, 1 Dec. 2022, doi: 10.1109/TPAMI.2018.2889052.
F. Xue, Y. Li, D. Liu, Y. Xie, L. Wu and R. Hong, "LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4507-4517, Sept. 2023, doi: 10.1109/TCSVT.2023.3282224.
Konduri, R. R., Roopavathi, N., Lakshmi, B. V., & Chaitanya, P. V. K. (2024). Mitigating Peak Sidelobe Levels in Pulse Compression Radar using Artificial Neural Networks. In Indian Journal of Artificial Intelligence and Neural Networking (Vol. 3, Issue 6, pp. 12–20).). https://doi.org/10.54105/ijainn.f9517.03061023
Kumar, P., & Rawat, S. (2019). Implementing Convolutional Neural Networks for Simple Image Classification. In International Journal of Engineering and Advanced Technology (Vol. 9, Issue 2, pp. 3616–3619). https://doi.org/10.35940/ijeat.b3279.129219
Reddy, M. V. K., & Pradeep, Dr. S. (2021). Envision Foundational of Convolution Neural Network. In International Journal of Innovative Technology and Exploring Engineering (Vol. 10, Issue 6, pp. 54–60). https://doi.org/10.35940/ijitee.f8804.0410621
Magapu, H., Krishna Sai, M. R., & Goteti, B. (2024). Human Deep Neural Networks with Artificial Intelligence and Mathematical Formulas. In International Journal of Emerging Science and Engineering (Vol. 12, Issue 4, pp. 1–2). https://doi.org/10.35940/ijese.c9803.12040324
Razia, Dr. S., Reddy, M. V. D., Mohan, K. J. S., & Teja, D. S. (2019). Image Classification using Deep Learning Framework. In International Journal of Recent Technology and Engineering (IJRTE) (Vol. 8, Issue 4, pp. 10253–10258). https://doi.org/10.35940/ijrte.d4462.118419

	All versions	This version
Views	100	100
Downloads	216	216
Data volume	79.8 MB	79.8 MB

Contributors

Contact person:

Researcher:

A163204010524.pdf

Files (359.5 kB)

Identifiers

Dates

References

LipNet: End-to-End Lipreading

Authors/Creators

Contributors

Contact person:

Researcher:

Description

Files

A163204010524.pdf

Files (359.5 kB)

Additional details

Identifiers

Dates

References