What is database normalization?

This article explains what database normalization is, why it is needed, and what kinds of normalization exist. For a better understanding of the relationship between tables in a normalized database, we will provide practical examples.

When creating a database, there are some rules to keep in mind. Based on the above, you can give the following wording: database normalization is the process of organizing data in a certain way and design guidelines. That is, tables and the links between them (relationships) are created in accordance with the rules. As a result, the desired level of data security is ensured, and the database itself becomes more flexible. Inconsistent dependencies and redundancy are also eliminated.

Pros of

Normalization is not mandatory, but it brings the following advantages

  • The sampling process is simplified. It is about simplifying the work of making queries, that is, the user will be able to get the necessary information with relatively simple queries;
  • The integrity of the data is ensured. We can talk about minimizing the distortion of information and reducing the likelihood of data loss;
  • Scalability is improved. If the normalization rules are observed, the favorable prerequisites for database growth are formed;
  • There is no data redundancy. Redundancy is a well-known problem of unproductive use of free space on the hard disc that hinders database maintenance. In some cases, this problem is exacerbated by the fact that if altering the records of the same type of data stored in several places (tables) is necessary, the user will have to make the required alterations everywhere, which is a very time-consuming task. It’s much easier to make it so that, for example, the data about cities are stored only in the Cities table and nowhere else. To summarize, redundancy implies duplication of data, which not only complicates the database but also increases its size;
  • Absence of inconsistent dependencies. Inconsistent dependencies make it difficult to access data because the path to such information can be wrong and illogical. In the same table Cities it is logical to look for cities, the number of inhabitants, etc., but not addresses and names of inhabitants - for this information you already need another table - Citizens.

Normalization is the process of separating data into separate related tables. Normalization eliminates data redundancy and thus avoids data integrity violations when changing data, i.e. avoids update anomaly.

As a rule, normalization is mainly used in the bottom-up approach of database design, that is, when all the attributes to be stored in the database are grouped by entities, for which the tables are then created. However, in the top-down approach, where entities are identified first, and then their attributes and relationships between them, normalization can also be used, for example, to verify the correctness of designed tables.

In non-normalized form, a table can store information about two or more entities. It may also contain repeating columns. Columns can also store repeating values. In normalized form, however, each table stores information about only one entity.

Normalization involves applying normal forms to the data structure. There are 7 normal forms. Each normal form (with the exception of the first one) implies that the previous normal form has already been applied to the data. For example, before the third normal form can be applied to the data, the second normal form must be applied to the data. And strictly speaking, a database is considered normalized if the third normal form or higher is applied to it.

What is database normalization?

  1. The first normal form (1NF) assumes that the stored data at the intersection of rows and columns must represent a scalar value, and tables must not contain repeating rows.
  2. The second normal form (2NF) assumes that each column that is not a key must depend on a primary key.
  3. The third normal form (3NF) assumes that each non-key column must depend only on the primary key. The Boyce-Codd Normal Form (BCNF) is a slightly stricter version of the third normal form.
  4. The fourth normal form (4NF) is used to eliminate multivalued dependencies - dependencies where a column with a primary key has a one-to-many relationship with a column that is not a key. This normal form eliminates incorrect many-to-many relationships.
  5. The fifth normal form (5NF) splits tables into smaller tables to eliminate data redundancy. The splitting goes on until the original table can be reconstructed by combining small tables.
  6. The sixth key normal form (domain key normal form / 6NF). Each constraint in the relationships between tables must depend only on key constraints and domain constraints, where domain represents the set of valid values for a column. This form prevents the addition of invalid data by setting the constraint at the relationship level between tables, but not at the table or column level. This form is generally not applicable at the RDBMS level, including SQL Server.

Functional dependency

A key concept of normalization is functional dependency. Functional dependency describes the relationship between attributes of a relationship. For example, if attribute B is functionally dependent on attribute A (A → B), then each value of attribute A is associated with only one value of attribute B. And attributes A and B can consist of one or more attributes. That is, if two rows have the same attribute A value, then they necessarily have the same attribute B value. However, for one value of attribute B, there can be several different values of attribute A. Attribute A in this dependency is also called a determinant.

For example, take the following table, which represents university courses:

  • Course Teacher Position
  • Math Smith Professor
  • Algorithms Adams Assistant
  • JavaScript Adams Assistant.

Here, the Teacher attribute is functionally dependent on the Course attribute (Course → Teacher). That is, knowing the name of the course, we can identify its teacher. And in this case we can say that there is a 1:1 relation between Course and Teacher attributes and a 1:N relation between Teacher and Course because there are several courses that one teacher can teach. The Course attribute is functionally independent of the Teacher attribute.

In addition, there are two other functional dependencies that can be traced here. In particular, the attribute Position depends on the attribute Teacher (Teacher → Position). Knowing the name of the teacher we can determine the position of the teacher.

And also the attribute Position is functionally dependent on the attribute Course - knowing the name of the course we can tell the position of the teacher.

In a table in a normalized database the only determinant should be the attribute which is the primary key. And all other attributes should functionally depend on the primary key.

For example, in this case, we can take the course title as the primary key, given that courses can only have unique titles. But the teacher’s position in this case will depend on two attributes at once - Course and Teacher. And such dependencies may indicate that the database and specifically the course table has flaws in the design and may be a source of update anomalies.

How to perform normalization?

To bring the database back to normal form, you need to:

  1. Combine the existing data into groups.
  2. Figure out the logical relationships between the groups. To ensure that the links are correct, the linked fields must be of the same type.

If the table is not normalized, it may store information about several entities and include repeating columns, and they, in turn, may store duplicate values. If normalized, however, each table stores information about only one entity.

When normalized, normal forms are supposed to be used with respect to the structure of the available data. There are several normalization rules. Each of them is called a “normal form” (NF). Each such form, except the first, assumes that the previous normal form has already been applied to the data. If the first rule is satisfied, the database is represented in the first normal form (1NF); if three rules are satisfied, the database is represented in the third normal form (3NF).

There are seven such forms (levels), but in practice for most applications normalizing the database up to the third normal form is enough (strictly speaking, the database will be considered normalized when a 3NF or higher is applied to it).

Yes, it is not always feasible to provide full compliance with the rules and specifications because you will have to create additional tables for normalization and it is not always acceptable or not accepted by the clients. But if rules have to be broken, it should be understood that all related problems, including uncoordinated dependencies and redundancy, will be taken into account, and that this is acceptable for the application and will not break its functionality.