Study Notes: Natural Language Processing (NLP)

General Science July 28, 2025 4 min read

1. Introduction

Natural Language Processing (NLP) is a subfield of artificial intelligence focused on enabling computers to understand, interpret, and generate human language. NLP bridges the gap between human communication and computer understanding, making it possible for machines to process text and speech in a meaningful way.

NLP Pipeline

2. Key Concepts

2.1 Tokenization

Breaking text into smaller units (words, sentences, or subwords).
Example: “AI is amazing.” → [“AI”, “is”, “amazing”, “.”]

2.2 Part-of-Speech Tagging

Assigning grammatical categories (noun, verb, adjective) to each token.
Useful for syntactic analysis.

2.3 Named Entity Recognition (NER)

Identifying entities such as people, organizations, locations in text.
Example: “Microsoft was founded in Redmond.” → [“Microsoft” (ORG), “Redmond” (LOC)]

2.4 Sentiment Analysis

Determining the emotional tone behind text.
Used in product reviews, social media monitoring.

2.5 Machine Translation

Automatically translating text from one language to another.
Example: English to Spanish translation.

2.6 Text Summarization

Condensing long documents into concise summaries.
Two types: Extractive (selecting key sentences), Abstractive (generating new sentences).

3. NLP Techniques

3.1 Rule-Based Approaches

Use hand-crafted linguistic rules.
Limited scalability and adaptability.

3.2 Statistical Methods

Utilize probabilistic models (e.g., Hidden Markov Models, Naive Bayes).
Require large annotated datasets.

3.3 Deep Learning

Neural networks (RNNs, CNNs, Transformers) for complex tasks.
Transformers (e.g., BERT, GPT) revolutionized NLP since 2018.

Transformer Architecture

4. Applications

Search Engines: Understanding queries, ranking results.
Voice Assistants: Speech recognition, natural conversation.
Healthcare: Extracting information from clinical notes, predicting patient outcomes.
Drug Discovery: Mining scientific literature for new compounds.
Social Media Analysis: Detecting trends, misinformation, and sentiment.

5. Surprising Facts

NLP models can generate realistic synthetic scientific papers that sometimes fool peer reviewers (Nature, 2021).
Language models can predict the properties of molecules by interpreting chemical notation as a language, accelerating drug and material discovery.
NLP is used to revive endangered languages by analyzing historical texts and generating new educational materials.

6. Global Impact

Healthcare: NLP extracts patient data from unstructured records, improving diagnostics and personalized medicine.
Education: Automated essay scoring, personalized feedback, and language learning tools.
Business: Chatbots, customer support automation, and market analysis.
Science: Rapid literature review, hypothesis generation, and data extraction for research.

Story Example

A team of scientists used NLP to analyze millions of published research articles on COVID-19. By automatically extracting relationships between drugs, genes, and symptoms, they identified promising drug candidates in weeks, a process that previously took years. This accelerated the global response to the pandemic, saving countless lives.

7. Future Trends

Multimodal NLP: Combining text, images, and audio for richer understanding.
Low-Resource Language Models: Extending NLP to languages with limited data.
Explainable NLP: Making model decisions transparent for trust and safety.
Integration with Robotics: Enabling robots to understand human instructions in natural language.
Real-Time Translation: Seamless communication across languages in video calls and conferences.

8. Recent Research

A 2022 study published in Nature Machine Intelligence demonstrated how transformer-based NLP models can accelerate drug discovery by mining chemical literature and predicting molecular properties (source). This research highlights the growing synergy between NLP and scientific innovation.

9. Summary Table

Concept	Description	Example
Tokenization	Splitting text into units	“Hello world!” → [“Hello”, “world”, “!”]
POS Tagging	Assigning grammatical roles	“run” → Verb
NER	Identifying entities	“Paris” → Location
Sentiment Analysis	Detecting emotion or opinion	“Great product!” → Positive
Machine Translation	Translating languages	English → Spanish
Text Summarization	Condensing information	Long article → Short summary

10. References

Nature Machine Intelligence, 2022: “Accelerating drug discovery with transformer-based NLP models” (link)
Wikipedia: Natural Language Processing (link)
Jalammar, Transformer Architecture (link)

11. Visual Summary

NLP Applications

End of Study Notes