[Paper] Natural Language Processing: A Review — IJERT (ISSN 2249-3905)
Natural Language Processing: A Review
Survey Paper · International Journal of Engineering Research and Technology (IJERT) · ISSN 2249-3905
| Title | Natural Language Processing: A Review |
| Type | Survey / Review Paper |
| Journal | International Journal of Engineering Research and Technology (IJERT) |
| ISSN | 2249-3905 |
| Scope | NLP fundamentals · core tasks · approaches from rule-based to neural · applications |
Natural Language Processing sits at the intersection of linguistics, computer science, and artificial intelligence. From the early rule-based parsers of the 1950s to today's large language models, the field has undergone three complete paradigm shifts. This survey maps the full arc — providing a structured entry point for practitioners who need to understand not just what NLP can do today, but why it works the way it does, and where the hard problems remain.
- What Is NLP — Scope and Goals
- Three Paradigms: Rule-Based → Statistical → Neural
- Core NLP Task Taxonomy
- Key Tasks in Depth
- Applications Across Domains
- Challenges and Open Problems
- Assessment & Closing Reflection
📌 (1) What Is NLP — Scope and Goals
Natural Language Processing (NLP) is the subfield of AI concerned with enabling computers to understand, interpret, generate, and interact with human language — in text and speech form. Unlike formal languages (programming languages, logic notation), natural language is inherently ambiguous, context-dependent, and constantly evolving.
Parsing syntax, resolving semantics, identifying named entities, coreference resolution
Translation, summarization, paraphrase, style transfer between languages and registers
Fluent text generation, dialogue systems, question answering, story generation
Sentiment analysis, topic modeling, information extraction, intent classification
🔄 (2) Three Paradigms: Rule-Based → Statistical → Neural
NLP has passed through three distinct paradigms, each supplanting the last while borrowing its insights:
Hand-crafted grammars, pattern matching, expert-coded linguistic rules. High precision on narrow domains. Brittle — breaks immediately outside the rule scope. Does not scale.
n-gram language models, HMMs, CRFs, SVMs trained on large corpora. Replaced rules with probability estimates from data. Enabled machine translation (IBM models, phrase-based SMT) and part-of-speech tagging at scale.
Word embeddings (Word2Vec, GloVe) → sequence models (LSTM, GRU) → attention mechanisms → Transformers (BERT, GPT, T5). End-to-end learned representations replace hand-engineered features entirely.
"Attention Is All You Need" (Vaswani et al., 2017) replaced recurrence with self-attention, enabling massive parallelization and scaling. Pre-trained Transformers (BERT, GPT) introduced the fine-tuning paradigm — pre-train on large corpora, fine-tune on task-specific data — which now dominates NLP across virtually every task.
🗂 (3) Core NLP Task Taxonomy
NLP tasks are traditionally organized by linguistic level. The survey categorizes them as follows:
🔬 (4) Key Tasks in Depth
Identifies and classifies named mentions (persons, organizations, locations, dates) in text. Evolved from hand-crafted gazetteers → CRF sequence labeling → BiLSTM-CRF → BERT fine-tuning. Now achieves near-human F1 on standard benchmarks (CoNLL-2003).
The oldest large-scale NLP application. Statistical MT (phrase-based, IBM models) dominated until 2016 when neural MT (seq2seq + attention) became standard. The Transformer-based architecture now underpins all major MT systems (Google Translate, DeepL).
Classifies the sentiment polarity (positive/negative/neutral) or emotion of text. Spans document-level classification, sentence-level, and aspect-based sentiment (ABSA) — identifying sentiment toward specific entities or attributes within a document.
Extractive QA (SQuAD) finds answer spans within a given passage. Open-domain QA retrieves relevant documents first, then extracts answers. Generative QA (as in GPT-style models) synthesizes answers rather than extracting them — enabling responses beyond what any single document contains.
Extractive methods select and combine existing sentences. Abstractive methods generate new sentences capturing the core meaning. Modern neural abstractive summarizers (PEGASUS, BART) approach human-level performance on news summarization benchmarks.
🌐 (5) Applications Across Domains
Clinical NLP for EHR analysis, ICD coding, adverse event detection, medical literature mining (PubMed NLP)
Contract analysis, case law retrieval, regulatory compliance monitoring, legal document summarization
News sentiment for trading signals, earnings call analysis, regulatory filing extraction, fraud detection in communications
Threat intelligence extraction from dark web text, malware report analysis, phishing detection, vulnerability disclosure NLP
Port state control report mining, incident log analysis, classification society circular extraction, AIS vessel communication processing
Automated essay scoring, reading comprehension assistance, intelligent tutoring systems, language learning feedback
⚠️ (6) Challenges and Open Problems
Lexical, syntactic, and pragmatic ambiguity remains difficult. Irony, sarcasm, and metaphor require world knowledge beyond linguistic pattern matching.
Most NLP advances are English-centric. The majority of the world's ~7,000 languages lack sufficient training data for neural approaches. Cross-lingual transfer and multilingual models address this partially.
Language models learn statistical patterns but lack grounded world models. Commonsense inference — reasoning about physical, social, and temporal relationships — remains a fundamental gap.
Models trained on internet text inherit and amplify social biases. Gender, racial, and cultural biases in NLP outputs are well-documented and difficult to fully eliminate without compromising model capability.
🎯 (7) Assessment & Closing Reflection
Provides a structured map of the NLP landscape in a single accessible document. Valuable as an entry point for engineers and practitioners approaching NLP from adjacent fields.
Covers the full pipeline from linguistic preprocessing to application-level systems, allowing readers to understand where specific techniques fit in the broader architecture.
Survey papers in fast-moving fields age rapidly. For the neural NLP landscape post-2020 — including instruction-tuned LLMs, RLHF, and chain-of-thought prompting — more recent literature is essential reading.
NLP is now embedded in virtually every digital product that processes or generates text. The progression from brittle rules to probabilistic models to large neural networks is not merely a technical story — it reflects a deeper shift in how we think about encoding human knowledge: less explicit specification, more learned approximation from data.
Whether you are new to NLP or revisiting its foundations to contextualize the LLM era — this survey provides the vocabulary and structural map you need. Start with the task taxonomy. Understand the three paradigm shifts. Then read the Transformer paper. Everything else in modern NLP follows from those anchors.
— Captain Ethan, ShipPaulJobs
Comments
Post a Comment