Database normalization is the process of organizing a relational schema so each fact is stored in the right place and data anomalies are reduced.
If you only remember one idea, remember this: normalization helps you avoid storing the same business fact in many rows where it can later become inconsistent.
This article follows Relational Data Modeling and prepares you for SQL Index Performance.
What is database normalization?#
Normalization is a design technique used in relational databases to reduce:
- duplicated data;
- inconsistent updates;
- insert problems;
- delete problems;
- hidden dependencies.
It does this by splitting badly structured tables into smaller, well-defined tables connected by keys.
Why normalization matters#
A non-normalized schema often works at the beginning because it looks simple. But once the data grows, the same shortcut creates maintenance problems.
Normalization improves:
- data consistency;
- schema clarity;
- maintainability;
- integrity enforcement;
- long-term evolution of the database.
The three classic anomalies#
Insert anomaly#
You cannot add one fact without inventing another.
Imagine a table that mixes instructors and courses:
| instructor_name | birth_date | course_code |
|---|---|---|
| Alessandro Ferrini | 1986-11-03 | DBMS-101 |
| Donata Bini | 1992-12-09 | PYTH-206 |
If a new instructor has not been assigned a course yet, where do you store them?
Update anomaly#
The same value appears in many rows and must be updated everywhere.
| full_name | hire_date | duration | |
|---|---|---|---|
| Alessandro Ferrini | 2024-10-23 | 3 | im@alessandroferrini.com |
| Alessandro Ferrini | 2025-01-20 | 20 | aferrini@example.com |
Now the same person has two emails in the database.
Delete anomaly#
Removing one row accidentally removes another important fact.
If the only row containing an instructor also contains their only course assignment, deleting that assignment may delete the instructor record entirely.
Key concepts before normal forms#
Candidate key#
A candidate key is any minimal set of attributes that uniquely identifies a row.
Primary key#
The primary key is the candidate key chosen as the official row identifier.
Functional dependency#
A functional dependency means that one attribute determines another.
If city -> region, then knowing the city is enough to know the region.
Partial dependency#
A partial dependency exists when a non-key column depends on only part of a composite key.
Transitive dependency#
A transitive dependency exists when a non-key column depends on another non-key column instead of depending directly on the primary key.
First Normal Form (1NF)#
Rule#
A table is in 1NF when:
- each column stores atomic values;
- there are no repeating groups;
- each row can be uniquely identified.
Bad example#
| order_id | customer | products | order_date |
|---|---|---|---|
| 1 | Alice Rossi | Pen, Notebook | 2026-01-10 |
| 2 | Marco Bianchi | Mouse | 2026-01-11 |
The products column stores a list, not a single value.
Better 1NF version#
| order_id | customer | product | order_date |
|---|---|---|---|
| 1 | Alice Rossi | Pen | 2026-01-10 |
| 1 | Alice Rossi | Notebook | 2026-01-10 |
| 2 | Marco Bianchi | Mouse | 2026-01-11 |
Now each field is atomic.
Second Normal Form (2NF)#
Rule#
A table is in 2NF when:
- it is already in 1NF;
- every non-key attribute depends on the whole primary key.
2NF matters mainly when the primary key is composite.
Bad example#
Suppose the key is (product_id, sale_point_id):
| product_id | sale_point_id | product_name | region | qty |
|---|---|---|---|---|
| 1 | 10 | Caffe Arabica | Lombardia | 300 |
| 1 | 11 | Caffe Arabica | Piemonte | 250 |
| 2 | 10 | Te Verde | Lombardia | 200 |
Problems:
product_namedepends only onproduct_id;regiondepends only onsale_point_id.
That means the table contains partial dependencies.
Better 2NF version#
products
| product_id | product_name |
|---|---|
| 1 | Caffe Arabica |
| 2 | Te Verde |
sale_points
| sale_point_id | region |
|---|---|
| 10 | Lombardia |
| 11 | Piemonte |
sales
| product_id | sale_point_id | qty |
|---|---|---|
| 1 | 10 | 300 |
| 1 | 11 | 250 |
| 2 | 10 | 200 |
Now each non-key attribute depends on the proper key.
Third Normal Form (3NF)#
Rule#
A table is in 3NF when:
- it is already in 2NF;
- it has no transitive dependencies.
Bad example#
| client_id | client_name | city | region |
|---|---|---|---|
| 1 | Alice Rossi | Prato | Toscana |
| 2 | Marco Bianchi | Milano | Lombardia |
If each city belongs to one region, then:
client_id -> city -> region
So region does not depend directly on client_id; it depends on city.
Better 3NF version#
clients
| client_id | client_name | city |
|---|---|---|
| 1 | Alice Rossi | Prato |
| 2 | Marco Bianchi | Milano |
cities
| city | region |
|---|---|
| Prato | Toscana |
| Milano | Lombardia |
Now each table stores one type of fact.
Boyce-Codd Normal Form (BCNF)#
Rule#
A table is in BCNF when every determinant in a non-trivial functional dependency is a candidate key.
Example#
| professor | room | course |
|---|---|---|
| Prof. Bianchi | Aula Magna | DBMS-101 |
| Prof. Bianchi | Aula Magna | DBMS-102 |
| Prof. Rossi | Aula 2 | PYTH-206 |
If room -> professor, but room is not a candidate key for the table, the design violates BCNF.
Better BCNF version#
rooms
| room | professor |
|---|---|
| Aula Magna | Prof. Bianchi |
| Aula 2 | Prof. Rossi |
courses
| course | room |
|---|---|
| DBMS-101 | Aula Magna |
| DBMS-102 | Aula Magna |
| PYTH-206 | Aula 2 |
Do you always need 4NF and 5NF?#
Usually no.
For most transactional systems, getting to 3NF or BCNF is already enough. Fourth and fifth normal forms are useful in more specialized cases involving multivalued dependencies or join dependencies, but they are far less common in everyday application design.
Normalization vs denormalization#
Normalization is about correctness and maintainability.
Denormalization is a deliberate trade-off for performance or reporting convenience.
Examples of denormalization:
- storing a precomputed total;
- duplicating a descriptive field to avoid a join in a read-heavy system;
- using summary tables for dashboards.
The key point is that denormalization should be intentional, documented, and measurable.
Common mistakes when studying normalization#
Treating normalization as only theory#
Normalization is not just for exams. It explains why some schemas become painful in production.
Splitting tables too early without understanding dependencies#
You should split a table because of a dependency problem, not because “more tables feels more relational”.
Confusing 1NF with full normalization#
Atomic columns are only the first step. A table can satisfy 1NF and still be poorly designed.
Ignoring query patterns completely#
A normalized schema is usually the right starting point, but performance tuning still matters later. That is why indexing comes after modeling and normalization, not before.
FAQ#
What is normalization in DBMS?#
It is the process of structuring relational tables so that redundancy and data anomalies are reduced.
What are 1NF, 2NF, and 3NF?#
They are successive normal forms:
- 1NF removes repeating groups and non-atomic values;
- 2NF removes partial dependencies;
- 3NF removes transitive dependencies.
What is the difference between 3NF and BCNF?#
BCNF is stricter. A table can satisfy 3NF but still violate BCNF when a determinant is not a candidate key.
Is normalization always good?#
It is almost always the best design starting point. Later, if a proven performance issue appears, you can denormalize carefully.
Next step#
After normalization, study SQL Index Performance to understand how a clean schema interacts with indexes, query plans, and real workload tuning.