Reading notes for Code Fellows!
The following notes were distilled from Kris Wenzel’s article “Database Normalization (Explained in Simple English)”.
Database Normalization is the process of organizing a database into tables and columns. This process is based on the concept that each table should pertain to a specific topic and should only include data that supports that topic. Limiting a table to a single purpose, reduces the amount of duplicated data within our database. In turn, this helps to eliminate some of the issues which arise from database modification.
There are established rules to accomplish the previously mentioned objectives. As we implement these rules, new labes will be formed. There are three successive forms of normalization that most databases tend to adhere to. As they progress through these tiers, databases become less prone to modification anomalies as they become more tightly focused on specific topics or purposes.
Databases should be normalized for tbree reasons:
Consider the following table:
SalesStaff
EmployeeID | SalesPerson | SalesOffice | OfficeNumber | Customer1 | Customer2 | Customer3 |
---|---|---|---|---|---|---|
1003 | Mary Smith | Chicago | 312-555-1212 | Ford | GM | - |
1004 | John Hunt | New York | 212-555-1212 | Dell | HP | Apple |
1005 | Martin Hap | Chicago | 312-555-1212 | Boeing | - | - |
EmployeeID is the primary key column.
This table Serves many purposes as opposed to a single one:
Because this table serves many purposes, it is prone to data duplication, data update issues and it increses the effort needed to query it.
The table lists both the SalesOffice and the OfficeNumber for each SalesPerson. It is duplicaed information and presents multiple problems:
If the location of an office were to change, all of the entries on this table would have to be updated. This is known as a modification anomaly. There are three types of modification anomalies which can occur:
Update Anomaly - Having the same data in several rows means that if an element changes multiples updates with have to be made. If they aren’t all changed to reflect the new data, they can lead to inconsistencies.
EmployeeID | SalesPerson | SalesOffice | OfficeNumber | Customer1 | Customer2 | Customer3 |
---|---|---|---|---|---|---|
1003 | Mary Smith | Chicago | 312-555-1212 | Ford | GM | - |
1005 | Martin Hap | Chicago | 312-555-1212 | Boeing | - | - |
The final issue with this table is that it makes queries to find data such as specific customers more difficule by requiring a longer or more convoluted query. If all of the customer data were in one column of the table it would make such a query simpler. This table would also make sorting by customer challenging, requiring three separate UNION
queries to complete.
These issues can be mitigated by separating the information an putting it into multiple tables, each serving a more singular purpose.
There are 3 progressive forms of database normaliztion, each successive form building upon the previous form:
First Normal Form - Information is stored in a realtion table. Each column contains an atomis value and there are no repeating groups of columns
Second Normal Form - All of the tables depend on its primary key
Third Normal Form - None of the table’s columns are transitively dependent on the primary key.