Speech Recognition Study Notes

General Science July 28, 2025 5 min read

1. Introduction

Speech recognition is the technology that allows computers to understand and process human speech. It converts spoken words into text or commands, enabling hands-free control, accessibility features, and new ways to interact with devices.

2. History of Speech Recognition

Early Beginnings (1950s-1970s)

1952: Bell Labs created “Audrey,” which recognized spoken digits (0-9) from a single voice.
1960s: IBM’s “Shoebox” could understand 16 English words, mostly numbers and simple commands.
1970s: Hidden Markov Models (HMMs) introduced statistical methods for recognizing speech patterns.

Key Milestones (1980s-1990s)

1980s: DARPA (U.S. Defense Advanced Research Projects Agency) funded research, leading to breakthroughs in continuous speech recognition.
Dragon Systems (1987): Released DragonDictate, the first consumer speech recognition product.
1990s: Large Vocabulary Continuous Speech Recognition (LVCSR) systems developed, capable of recognizing thousands of words.

Modern Era (2000s-present)

2000s: Machine learning and deep learning revolutionized speech recognition accuracy.
2010s: Smartphones integrated voice assistants (e.g., Siri, Google Assistant).
2020s: End-to-end neural networks and transformer models (like Wav2Vec 2.0) set new benchmarks.

3. Key Experiments in Speech Recognition

Dynamic Time Warping (DTW)

Used in the 1970s to match spoken words to templates, allowing for variations in speed and pronunciation.

Hidden Markov Models (HMMs)

Equation:
- Probability of sequence:
  P(O | λ) = Σ Q P(O | Q, λ) P(Q | λ)
  Where O is the observed sequence, Q is the state sequence, and λ is the model.
HMMs model speech as a sequence of states with transition probabilities, enabling recognition of variable speech.

Deep Neural Networks (DNNs)

Replace HMMs for acoustic modeling.
Use multiple layers to learn complex features from raw audio.

End-to-End Models

Wav2Vec 2.0 (2020):
- Uses transformer architecture to learn directly from raw audio.
- Achieves state-of-the-art accuracy with less labeled data.

Recent Study

Baevski et al. (2020):
“Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477).
Demonstrated that large transformer models trained on unlabeled speech data can outperform traditional supervised models.

4. Modern Applications of Speech Recognition

Voice Assistants

Examples: Siri, Alexa, Google Assistant.
Enable hands-free device control, reminders, and information retrieval.

Accessibility Tools

Helps people with disabilities use computers and smartphones.
Converts speech to text for people who cannot type.

Automated Transcription

Converts meetings, lectures, and interviews into written text.
Used in journalism, education, and legal fields.

Call Centers & Customer Service

Automates responses and routes calls based on spoken requests.
Improves efficiency and customer experience.

Language Learning

Provides pronunciation feedback and interactive speaking exercises.

Healthcare

Doctors dictate notes directly into electronic health records.
Reduces paperwork and improves accuracy.

Drug and Materials Discovery

AI-powered speech recognition assists researchers in documenting findings and sharing results hands-free.
Speeds up collaboration in labs.

Automotive

Voice-controlled navigation, entertainment, and communication systems in cars.

5. Practical Applications

In the Classroom

Students use speech recognition to take notes or write essays by speaking.
Teachers record and transcribe lessons for review.

At Home

Smart speakers control lights, music, and appliances with voice commands.

For Research

Scientists use speech recognition to record lab notes and automate data entry.

In Programming

Developers use voice commands to write code, debug, and run tests in IDEs like Visual Studio Code.

6. Key Equations in Speech Recognition

Feature Extraction

Mel-Frequency Cepstral Coefficients (MFCCs):
- Converts audio signals into a set of features for recognition.
- MFCC = DCT(log(Mel(Spectrum)))
  Where DCT is Discrete Cosine Transform.

Acoustic Modeling

Hidden Markov Model (HMM):
P(O | λ) = Σ Q P(O | Q, λ) P(Q | λ)
Models the probability of observed features given the speech model.

Neural Network Training

Loss Function:
L = -Σ y_i log(p_i)
Where y_i is the true label, p_i is the predicted probability.

7. Common Misconceptions

Speech recognition is perfect:
- Even the best systems make mistakes, especially with accents, background noise, or technical vocabulary.
Speech recognition understands meaning:
- It converts speech to text but does not comprehend context or intent unless combined with natural language understanding.
Works equally for all languages:
- Accuracy varies by language, dialect, and available training data.
Requires internet connection:
- Many systems can work offline, though cloud-based models often perform better.
Only for smartphones:
- Used in cars, medical devices, research labs, and more.

8. Recent Research & News

Self-Supervised Learning:
- Wav2Vec 2.0 (Baevski et al., 2020) showed that speech recognition models can be trained on large amounts of unlabeled data, making it easier to support new languages and dialects.
AI for Drug Discovery:
- Speech recognition integrated into laboratory workflows enables researchers to record experimental results hands-free, speeding up drug and materials discovery (Nature News, 2023).

9. Summary

Speech recognition has evolved from simple digit recognizers to advanced AI-powered systems using deep neural networks and transformers. Key experiments such as DTW, HMMs, and end-to-end models have driven progress. Today, speech recognition is used in voice assistants, accessibility tools, healthcare, research, and more. While powerful, it is not perfect and faces challenges with noise, accents, and context. Recent advances, like self-supervised learning, continue to improve accuracy and expand applications. Understanding speech recognition helps students appreciate how AI is transforming communication, research, and daily life.