Relational Normalization

This article builds on the foundation in Introduction to Databases. It explains the relational mindset, why normalization keeps the schema clean, and how the model prepares you for the Relational Data Modeling workflow and SQL Index Performance lessons that follow.

Relational data models and the DBMS role
#

Relational modeling turns business concepts into rows, columns, and constraints so a Database Management System can enforce integrity, manage storage, and present consistent interfaces. During analysis you name entities, describe attributes, and define relationships so the DBMS can guard foreign keys, unique constraints, and cardinality rules. This clarity keeps joins predictable, safeguards migrations, and lets the optimizer reason about queries-see how relational algebra operators surface in relational algebra operations.

Why normalization matters
#

Normalization reorganizes tables so each fact lives in exactly one place. That minimizes redundancy, prevents contradictory rows, and makes the dataset easier to understand. Clean tables also reduce storage costs and produce simpler joins, which helps query planners and downstream developers alike.

Insert anomalies
#

When you try to add a new instructor without enrolling them in a course, you either duplicate irrelevant data or introduce NULL placeholders.

full_name	date_of_birth	course_code
Alessandro Ferrini	1986-11-03	DBMS-101
Donata Bini	1992-12-09	PYTH-206
Fabrizio Moneta	2005-04-11	PYTH-206

Splitting personal details from enrollment records lets you add the instructor before their first assignment appears.

Update anomalies
#

When the same person appears in multiple rows, updating an email requires touching all of them.

full_name	hire_date	duration	email
Alessandro Ferrini	2024-10-23	3	im@alessandroferrini.com
Donata Bini	2024-11-09	12	donata.bini@example.com
Fabrizio Moneta	2024-07-11	5	f.moneta@example.com
Alessandro Ferrini	2025-01-20	20	aferrini@example.com

Normalization centralizes personal information, so a single update keeps every row aligned.

Deletion anomalies
#

Deleting the DBMS-101 row currently removes Alessandro’s profile entirely, even if he teaches other courses.

full_name	date_of_birth	course_code
Alessandro Ferrini	1986-11-03	DBMS-101
Donata Bini	1992-12-09	PYTH-206
Fabrizio Moneta	2005-04-11	PYTH-206

Separating enrollments from instructor data makes it safe to delete a course record without losing the person.

Benefits and normal forms
#

Normalization delivers smaller tables with a single source of truth. That results in lower redundancy, stronger data integrity, easier maintenance, and faster planning because the optimizer can reason about focused structures.

Normal forms are the steps that deliver this benefit. 1NF requires atomic values, 2NF removes partial dependencies, 3NF removes transitive dependencies, and BCNF or higher guards more subtle cases such as overlapping candidate keys or multivalued dependencies.

Key definitions
#

Candidate key: The minimal attribute set that uniquely identifies each row (for example, codice_fiscale or email on PERSONE).
Primary key: The chosen candidate key that serves as the table’s official identifier.
Prime attribute: An attribute that participates in at least one candidate key.
Non-prime attribute: An attribute that does not participate in any candidate key and therefore should depend on prime attributes.
Functional dependency: A -> B means that identical values for A guarantee identical values for B (for example, city -> region in a Comuni table).

Normalization - First Normal Form (1NF)
#

1NF insists that every attribute holds atomic data, forbidding repeating groups and lists, and that each row is identifiable via a primary key. When a field stores several values, constraints, indexes, and predictable joins break down.

order_id	client	products	order_date
1	Alessandro Ferrini	ProdottoA, ProdottoB	10-01-2024
2	Donata Bini	ProdottoC	11-01-2024
3	Fabrizio Moneta	ProdottoA, ProdottoD	12-01-2024

Enforcing 1NF reshapes the table so each row contains a single product entry but repeats the order_id and client values:

order_id	client	product	order_date
1	Alessandro Ferrini	ProdottoA	10-01-2024
1	Alessandro Ferrini	ProdottoB	10-01-2024
2	Donata Bini	ProdottoC	11-01-2024
3	Fabrizio Moneta	ProdottoA	12-01-2024
3	Fabrizio Moneta	ProdottoD	12-01-2024

Now the products attribute is atomic and the primary key (for example, (order_id, product)) distinguishes each row without ambiguity.

Normalization - Second Normal Form (2NF)
#

Starting from 1NF, 2NF removes partial dependencies. When the primary key spans multiple columns, every non-key attribute must depend on the whole key rather than just part of it. If an attribute depends on a portion of a composite key, the table suffers from redundancy and update anomalies.

Imagine a fact table whose key is (product_id, sale_point_id) while keeping product and sale point metadata in the same row:

product_id	sale_point	product_name	region	qty	month
1	10	Caffe Arabica	Lombardia	300	01-2024
1	11	Caffe Arabica	Piemonte	250	01-2024
2	10	Te Verde	Lombardia	200	01-2024
2	11	Te Verde	Piemonte	180	01-2024

Here product_name depends only on product_id and region depends only on sale_point_id, so the non-key attributes do not require the full composite key. The cure is to separate the metadata: create a Products table keyed by product_id with the product name, and a SalePoints table keyed by sale_point_id with the region. The fact table then references those tables with foreign keys and retains only the measurable columns (qty, month). Splitting this way eliminates repetition and ensures each column depends on the determinant that defines 2NF.

Normalization - Third Normal Form (3NF)
#

3NF addresses transitive dependencies. If a non-key attribute depends on another non-key attribute rather than on the primary key, the table violates 3NF even if it already satisfies 2NF.

A Client table illustrates the issue:

client_id	client_name	city	region
1	Alessandro Ferrini	Prato	Toscana
2	Donata Bini	Milano	Lombardia
3	Fabrizio Moneta	Torino	Piemonte

If each city belongs to exactly one region, we have the functional chain client_id -> city -> region. To obey 3NF, move city and region into a separate Cities table where city is the key. The Client table then references Cities, and every non-key attribute depends solely on client_id, eliminating transitive redundancy.

Normalization - Boyce-Codd Normal Form (BCNF)
#

BCNF strengthens 3NF by requiring that for every non-trivial functional dependency X -> Y, the determinant X must be a candidate key in that table. BCNF prevents cases where a non-key attribute determines another column despite not being a candidate key itself.

Consider a university schedule where each classroom hosts exactly one professor:

professor	room	course
Prof. Bianchi	Aula Magna	DBMS-101
Prof. Bianchi	Aula Magna	DBMS-102
Prof. Rossi	Aula 2	PYTH-206

Here room -> professor, but room is not the table’s primary key (course might be). This violates BCNF because room determines professor without being a candidate key. The fix is to extract a Rooms table keyed by room that records the professor in charge, while Courses keeps (course, room) and refers to Rooms. Now every dependency resides in a table where the left-hand side is a candidate key, restoring BCNF.

Fourth Normal Form (4NF)
#

4NF prohibits a single key from expressing unrelated multivalued attributes. When a row links Instructor -> Languages and Instructor -> Subjects, and those lists are independent, every row repeats each language with each subject. 4NF splits each multivalued dependency into its own relation so languages and subjects no longer form a Cartesian product.

instructor	languages	subjects
Prof. Bianchi	Inglese	SQL
Prof. Bianchi	Inglese	Database Design
Prof. Bianchi	Francese	SQL
Prof. Rossi	Inglese	Data Mining
Prof. Rossi	Inglese	AI
Prof. Rossi	Spagnolo	Data Mining
Prof. Rossi	Spagnolo	AI

Split the table into dedicated relations for languages, subjects, and instructors. Each relation captures a single multivalued dependency, eliminating redundant pairings and making it possible to add a new language or subject without duplicating the other list.

Fifth Normal Form (5NF)
#

5NF (Projection-Join Normal Form) removes redundancy that appears only in ternary or higher relations. A table that records (Supplier, Product, Customer) belongs in 5NF when every valid triple can be reconstructed from joins of the binary projections: (Supplier, Product), (Product, Customer), and (Supplier, Customer). Working at this level makes join dependencies explicit and prevents the insertion of combinations that cannot be explained by simpler relations. Use 5NF in analytic schemas when the added integrity justifies the extra joins.

Trade-offs and deliberate denormalization
#

Adding 4NF and 5NF tables increases joins and can hurt latency, so many systems stop at BCNF or 3NF unless analytics demand the extra rigor. When joins do become too costly, denormalize intentionally: duplicate descriptive data, precompute aggregates, or maintain reporting tables that trade redundancy for speed. Always document the purpose of denormalization and automate its refresh so the schema remains predictable.

Carry these normalization checks into the modeling lab and the SQL Origins & Standards lesson so you understand how SQL syntax, planner rules, and index costs relate to the clean schema you just built.

Relational data models and the DBMS role#

Why normalization matters#

Insert anomalies#

Update anomalies#

Deletion anomalies#

Benefits and normal forms#

Key definitions#

Normalization - First Normal Form (1NF)#

Normalization - Second Normal Form (2NF)#

Normalization - Third Normal Form (3NF)#

Normalization - Boyce-Codd Normal Form (BCNF)#

Fourth Normal Form (4NF)#

Fifth Normal Form (5NF)#

Trade-offs and deliberate denormalization#