[DB] Basics of Functional Dependencies and Normalization for Relational Databases

SUbbb·2021년 11월 30일

Database

DataBase

목록 보기

12/15

Background

ER Diagram $\rarr$ Conceptual
SQL $\rarr$ Logical

Relational Database design의 가장 중요한 목표 2가지

Information preservation
Minimum redundancy

Relational Database Design

Bottom-up Design
- attributes에서부터 시작하여 entities -> relationships 순으로 database를 design
  $\Rightarrow$ ER design
Top-down Design
- relation의 attributes group으로부터 시작하여 점점 더 작은 relations로 분할
  $\Rightarrow$ Normal Forms를 사용한 Normalization of relations

INFORMAL GUIDELINES FOR RELATIONAL SCHEMAS

Informal Guidelines for Relation Schemas

Relation schema design의 quality 측정 지표

Attributes의 semantics(의미)이 알맞은지
tuples에서 redundant information 줄이기
tuples에서 NULL values 줄이기
illogical tuples 생성 방지
위의 가이드라인들은 반드시 독립적이진 않다.

G1) Impart Clear Semantics to Attributes in Relations

GUIDELINE 1: 암묵적으로, relation의 각 tuple은 하나의 entity 또는 relationship instance를 표현해야 한다.(각각의 relations와 그들의 attributes에 적용)

서로 다른 entities에 있는 attributes는 동일한 relation으로 합쳐져서는 안된다.
foreign keys만이 서로 다른 entities를 참조할 수 있어야 한다.
Entity와 relationship attributes는 가능한 한 분리되어야 한다. (이 2개는 엄연히 다른 특징이 존재)

Bottom Line:
관계별로 쉽게 설명이 가능한 schema를 design
attributes의 semantics는 해석이 쉬워야 한다.

G2) Minimize Redundant Information in Tuples and Update Anomalies

schema design의 한 가지 목표는,
base relations과 그에 해당하는 files에 사용되는 storage space를 최소화하는 것
information이 중복적으로 저장되었다면,
- storage를 낭비
- update anomalies(갱신 이상)을 일으킬 수 있다.
  - Modification (or update) anomalies (수정 이상)
  - Insertion anomalies (삽입 이상)
  - Deletion anomalies (삭제 이상)

Example of an MODIFICATION Anomaly

EMP., PROJ., WOR.를 NATURAL JOIN한 EMP_PROJ라는 relation이 있다고 가정
"Modification Anomaly"
- project number 10의 name을 "Computerization"에서 "Customer-Accounting"으로 변경하는 경우
  - project number 10에 종사하는 모든 employees를 update해야 한다. (일관성의 유지를 위해서)

Example of an INSERT Anomaly

"Insert Anomaly"
1) employee가 할당되지 않은 project 정보는 Insert 할 수 없다.
insert하게 되면 Ssn이 NULL이므로, Entity constraint 위반
2) project를 부여받지 않은 employee 정보는 Insert 할 수 없다.
insert하게 되면, Pnumber가 NULL이므로, Entity constraint 위반

Example of an DELETE Anomaly

"Delete Anomaly" $\Rightarrow$ 2개의 서로 다른 정보들이 함께 저장되어 발생하는 문제
1) project가 삭제되면, 해당 project에 참여하고 있는 모든 employees 정보 또한 삭제될 것이다.
2) 어떤 employee가 어떤 project에 참여하는 유일한 직원인 경우, 해당 employee 정보를 삭제하면 그 project 정보 또한 삭제되게 된다.

Another Example Suffering from the Update Anomalies: EMP_DEPT = EMP.*DEPT.

EMP_DEPT와 EMP_PROJ가 base relation으로서 저장될 수도 있다.
- 매번 join하는 것보다 성능상으로 더 효율적인 경우

GUIDELINE 2: insertion, deletion and modification anomalies로 고통받지 않도록 schema를 design해야 한다.

G3) Use NULL Values in Tuples Sparingly

GUIDELINE 3: relaitons의 attributes가 가능한 NULL values를 적게 가지도록 할당

없어서는 안될 NULL values가 많은 경우, 해당 relation의 PK와 함께 다른 relation으로 분리할 수 있다.
- 이후, join으로 접근 가능
NULL 값의 종류
- Not applicable(or valid)
- Unknown(may exist)
- Not avaliable(or absent)

Generation of Spurious Tuples: Poor Design

EMP_PROJ를 EMP_LOCS와 EMP_PROJ1로 분리
- EMP_LOCS * EMP_PROJ1 $\rarr$ 중복 발생, 이상한 tuples 발생 가능
Ssn='123456789'인 employee에 대해 두 relation을 NATURAL JOIN한 결과
- "Inconsistency"
위와 같은 Decomposing은 바람직하지 않다.
- 두 relation의 JOIN이 incorrect original information을 생성하기 때문이다.
- Plocation은 두 relation에서 PK도, FK도 아니다.
  - 이는 곧 적합한 Attribute를 PK나, FK로 지정해야 한다는 것을 의미한다.
[IMPORTANT] 따라서, join operation에 대한 meaningful results를 보장하기 위해
"lossless join" property (join시, 소실되는 정보가 없는 특성)가 사용된다. (Top-down Design에서 충족해야 하는 특성)

G4) Satisfy the Lossless Join Condition

GUIDELINE 4: lossless join condition을 만족하는 relation schemas를 design

즉, relation들은 적합한 (PK, FK) pair로 join되어야 한다.
- 그렇지 않으면, spurious tuples가 생성
어떠한 decomposed relations 간의 NATURAL JOIN도 spurious tuples를 생성해서는 안된다.

Relations decompositions의 두 가지 중요한 특성
1) 해당 join에서의 Non-addictive or lossless $\rarr$ 반드시 충족
2) Functional Dependencies의 보존 $\rarr$ 우선순위가 2등 정도, 타협이 가능

1) 특성은 매우 중요하고, 희생되거나 타협되어선 안된다.
2) 특성은 덜 엄격하고 Normalization(BCNF?)에 의해 희생되어질 수 있다.

FUNCTIONAL DEPENDENCIES

: DB의 두 attributes sets사이의 제약조건, 특성을 나타냄

relational database designs의 "goodness"에 대한 formal 측정 지표로 사용
- Keys는 relations의 normal forms를 정의하는 데 사용
$X$ 의 value가 $Y$ 의 "unique" value를 결정한다면,
"attribute $X$ 의 집합은 attribute $Y$ 의 집합을 functionally determines"
- $X \rarr Y$ 로 표현 ( $X$ , $Y$ 는 $R$ 의 subset)
- $r(R)$ 의 any two tuples $t_1, t_2$ 가 $t_1[X] = t_2[X]$ 이라면, $t_1[Y] = t_2[Y]$ 일 것이다. ( $X$ = key attribute)
- $Y$ 값들은 $X$ 값에 의존하게 된다.

Example of FD Constraints in COMPANY

Ssn determines employee name
$Ssn \rarr Ename$
Project number determines project name & location
$Pnumber \rarr \{Pname, Plocation\}$
Essn and project number determines hours per week
$\{Ssn, Pnumber\} \rarr \{Hours\}$

이러한 제약조건은 "모든" relation instance $r(R)$ 에서 지켜져야 한다.

$K$ 가 $R$ 의 Key라면, $K$ 는 $R$ 의 모든 attributes를 functionally determines
- Key에 대해서는 동일한 값을 갖는 두 개의 "distinct" tuple을 가질 수 없기 때문이다.
- $K$ 가 $R$ 의 candidate key라면, $K \rarr R$
$X \rarr Y$ 이라고 해도, 반드시 $Y \rarr X$ 인 것은 아니다.

Defining FDs from Relation Instances

주어진 relation instance에 대해, FD는 특정 attributes 사이에 존재할 수도 있다.
- in $CAR, \{State, Driver\_license\_number\} \rarr \{Ssn\}$
dependencies이 위반을 보이는 tuples로 인해 특정 FDs는 존재하지 않을 수도 있다.
- in $PROJECT, \{Plocation\} \rarr \{Pnumber\}$
- 동일한 $Plocation$ 에 여러 $Project$ 가 존재할 수도 있기 때문이다.

Ruling Out FDs

FD는 우리가 PK를 선언하는 것처럼 attribute의 semantics를 알고 있는 사람에 의해 명시적으로 정의되어야 한다.
- $Text \rarr Course$ FD는 존재할 수 있다.
  - $Text$ 를 알면 유일한 $Course$ 를 알 수 있다.
- 해당 FDs는 제거할 수 있다.
  - $Teacher \rarr Course,\,Teacher \rarr Text,\,Course \rarr Text$

What FDs May Exist?

FDs
- $B \rarr C,\,C \rarr B,\,\{A, B\} \rarr C,\,\{A, B\} \rarr D$

Digrammatic Notation for FDs

2개의 FD를 가진다.

NORMAL FORMS BASED ON PRIMARY KEYS

Normalization of Relations

Normalization (정규화)

좋지 못한 relations을 작은 relations로 분해하는 과정

minimizing redundancy & minimizing the update anomalies에 목적을 두고 있다.

Denormalization (비정규화)
정규화가 무조건적으로 좋은 것은 아니라는 것을 보여준다.
때때로는 성능상의 이유로 lower normal form에 있는 relation을 base relation으로 사용하기도 한다.
$\rarr$ 더 많은 중복과 갱신 이상의 위험이 존재한다.

Normal Form (NF)

key와 FD를 사용하여 relation의 schema가 특정 normal form에 존재하는지를 의미하는 relation의 condition

degree를 이용하여 정규화의 정도를 나타낼 수 있다.

NF의 level

1NF, 2NF, 3NF, and Boyce-Codd NF(BCNF) 정도로 오른쪽으로 갈수록 정규화의 정도가 엄격해진다. 그리고, 성능상의 이유로 더 높은 수준까지의 정규화는 수행하지 않는다.

First Normal Form (1NF)

가장 basic(flat)한 relational model

허용하지 않는 것

Composite attributes
Multi-valued attributes(MVAs)
Nested relations(NRs)
- 각 tuple의 attribute values가 non-atomic한 경우
- Relations within relations

대부분의 RDBMSs는 1NF에 있는 relation만 정의할 수 있다.

Normalization into 1NF for (1) and MVA

Dlocations Attribute가 MVA이므로 1NF를 만족하지 않는다.
Dnumber 는 유일한 Dlocations 를 식별할 수 있지도 않고, Dlocations 는 유일한 Dnumber 를 식별할 수 있지 않다.

이를 1NF에 두기 위해,

Dlocations 를 Key attribute로 지정하고 단일 value만을 가지게 한다.
하지만 이는 또다른 Key attribute인 Dnumber 의 중복을 발생시킨다.

Three Main Techniques for (1) and MVA

1) 1NF를 위반하는 Dlocations attribute 제거하고 또 다른 relation으로 이동

relation의 PK도 함께 가져옴
이는 다른 attribute의 중복을 최소화할 수 있으므로 best

2) Key attribute를 확장하여 DEPARTMENT 의 각 location에 대해 DEPARTMENT relation에 각각의 tuple들이 존재

$\{Dnumber, Dlocation\}$ : primary key로 지정
PK인 Dnumber 의 중복을 발생시키므로 X

3) 해당 attribute의 값이 m개라면, m개의 attributes를 생성하여 각 attribute가 1개의 value만 가지도록 지정

매우 많은 NULL 을 생성, Storage 낭비

Normalizaiton for (2) Nested Relations

EMP_PROJ relation에 Projs relation이 중첩된 상태
nested relations의 경우도 mutli-valued attributes의 경우와 동일하게 새로운 relation(ex. EMP_PROJ1 , EMP_PROJ2 처럼 PK를 가져와서 사용)을 생성하여 처리

Definitions: (Non) Prime Attributes

Key
: minimal superkey, SK, $t_1[SK] \neq t_2[SK]\,in\,\,r(R)$
Primary key
: candidate keys 중 임의의 하나
Secondary key
: primary key를 제외한 candidate key

Prime attribute
: candidate key의 member인 attribute
Nonprime attribute
: prime attribute가 아닌 attribute

prime & nonprime attribute?

Ssn, Pnumber는 prime, Hours는 nonprime

Second Normal Form(2NF)

full functional dependency에 기반

$X \rarr Y$ 는 $X$ 로부터 어떠한 attribute $A$ 를 제거하는 것이 dependency를 유지하지 못한다면 full functional dependency이다.

Full dependency: $(X-\{A\})$ 면, $X$ 가 functionally하게 $Y$ 를 결정하지 못한다.
Partial dependency: $X$ 에서 특정 attribute를 제거하더라도 dependency가 남아있는 경우

$R$ 에 있는 모든 nonprime attribute $A$ 는 $R$ 의 primary key에 fully functionally dependent해야 2NF를 만족한다.

Examples
- $\{Ssn, Pnumber\} \rarr \{Hours\}$ : Full FD
- $\{Ssn, Pnumber\} \rarr \{Ename\}$ : Partial FD(PK 중 Ssn 만이 functionally determine)

Normalizing into 2NF

EMP_PROJ relation에서 FD1는 Full FD이지만, FD2와 FD3는 Partial FD이다.
2NF Normalization을 통해, EP1, EP2, EP3로 나누었고 모두 Full FD를 만족한다.

Third Normal Form (3NF)

transitive functional dependency에 기반

$R$ 에 있는 FD $X \rarr Z$ 는 $X \rarr Y, Y \rarr Z$ 와 같이 두 FDs로 분해될 수 있는데, 경유지인 $Y$ 는 candidate key도, $R$ 의 어떠한 key의 하위 집합도 아니여야 한다. (즉, nonprime attribute)

prime -> non prime -> non prime $\rarr$ transitive functional dependency가 존재
근데 3NF는 이러한 transitive functional dependency는 없어야 한다. 여기서 Y가 non prime이 아니라 prime이어야 3NF 만족

relation $R$ 이 2NF를 만족하고, $R$ 의 어떠한 nonprime attribute가 $R$ 의 primary key에 transitively depedent하지 않아야 한다. 즉 $Y$ 가 nonprime attribute가 아니어야 한다.

Example
- $\{Ssn\} \rarr \{Dmgrssn\}$ : transitive FD
  - $\{Ssn\} \rarr \{Dnumber\}\, \& \, \{Dnumber\} \rarr \{Dmgrssn\}$ 에서 $Dnumber$ 는 nonprime attribute
- $\{Ssn\} \rarr \{Ename\}$ : non-transitive FD
  - $\{Ssn\} \rarr \{X\} \, \& \, \{X\} \rarr \{Ename\}$ 를 만족하는 $X$ 가 없음

"Normalization of 3NF"

$X$ 가 primary key이고, $X \rarr Y \& Y \rarr Z$ 일 때, $Y$ 가 candidate key가 아닐 때만 3NF를 만족하지 않는 것
- E.g., EMPLOYEE(Ssn $\rarr$ Eno $\rarr$ Salary) 는 3NF를 만족한다. (Eno가 candidate key이므로)

Normalizing into 3NF

EMP_DEPT 는 2NF를 만족하지만, nonprime attribute인 Dnumber 으로 인해 transitive FD가 발생해 3NF를 만족하지는 않는다.
ED1 와 ED2 라는 새로운 relation으로 정규화를 수행, 두 relation의 join으로 원래 relation 복원 가능

Normal Forms Defined Informally

1NF

모든 attribute는 key attribute에 의존 (Flat, key 일부 또는 전체에 대해 의존, FD 없음)

2NF

모든 attribute는 key 전체에 의존 (Fully FD, key 전체에 대해서 의존, 단, transitive property에 의해, 다른 nonprime attribute에 대해 의존성을 가질 수도 있다.)

3NF

모든 attribute는 오직 key attribute에만 의존 (No transitive FD, 모든 attribute가 key 외에 의존하는 것이 없음)

어떠한 FD의 left-hand side는 PK의 일부여야 한다.
어떠한 FD의 left-hand side에 nonkey(nonprime) attribute가 있다면 문제가 된다.

GENERAL DEFINITIONS OF SECOND AND THIRD NORMAL FORMS

General Definition of 2NF (for Multiple Candidate Keys)

relation $R$ 가 2NF 만족

$R$ 의 모든 non-prime attribute $A$ 가 $R$ 의 어떠한 key에 대해서 NOT partially(즉, fully functionally) dependent하다는 의미

FD1, FD2는 Full FD이고, FD3는 Partial FD, FD4는 nonprime $\rarr$ nonprime (FD4는 transitive property: prime -> non prime -> non prime)
FD3로 인해 해당 relation은 1NF는 만족하지만 2NF는 만족하지 못한다.

2NF Normalization을 통해, 2NF를 만족하도록 두 개의 relation으로 분리
하지만, nonprime attribute에 대해 transitive property를 가지는 FD4로 인해 3NF를 만족하지 못한다.

General Definition of 3NF (for Multiple Candidate Keys)

relation $R$ 가 3NF 만족

$R$ 의 nontrivial FD $X \rarr A$ 마다 (a) $X$ 가 $R$ 의 superkey, (b) $A$ 가 $R$ 의 prime attribute 를 만족해야 한다.

FD1, FD2는 (a)를 만족하지만 FD4는 (a), (b) 모두 만족하지 않아 해당 relation은 3NF를 만족하지 못한다.
3NF Normalization을 통해 두 relation으로 나누어, FD4는 (a)를 만족하게 된다.

Interpreting the General Definition of 3NF

3NF의 (a)는 두 type의 violations을 감지한다.

prime attribute가 functionally하게 non-prime attribute를 결정하는 경우
- 이는 partial functional dependencies로 인해 2NF 위반을 감지할 수 있다.
non-prime attribute가 functionally하게 non-prime attribute를 결정하는 경우
- 2NF로 인해 모든 non prime attribute는 기본적으로 prime attribute에 대해 Functionally determine된다.
- 이는 prime attribute $\rarr$ non prime attribute $\rarr$ non prime attribute로, transitive dependency로 인해 3NF 위반을 감지할 수 있다.

3NF의 Alternative definition

relation $R$ 은 $R$ 의 모든 non-prime attribute가 아래 두 경우를 모두 만족하는 경우 3NF를 만족한다.
- (a) $R$ 의 모든 (primary) key에 fully functionally dependent
- (b) $R$ 이 모든 (candidate) key에 대해 non-transitively dependent
어떤 relation이 3NF를 만족한다면, (a) 특성으로 인해 자동으로 2NF를 만족한다.
(b) 특성은 3NF는 만족하지만, Boyce-Codd Normal Form(BCNF)를 만족하지 못하도록 한다.
- BCNF는 candidate key에 의해 발생하는 transitive property 또한 없애기 위한 정규화

BOYCE-CODD NORMAL FORM

Example of BCNF Normalization

FD1, FD2는 2NF를 만족, FD5는 prime -> nonprime attribute -> prime attribute를 결정하므로, 3NF를 만족한다.
BCNF는 이렇게 중복을 발생시키는 transitive property(FD5)를 허용하지 않는다.
BCNF Normalization을 통해 두 relation으로 분해하고, Area 를 통해 join하여 원래의 relation으로 복원이 가능하다.
- 이러한 정규화를 통해 LOTS1A 에 생성되는 수많은 Country_name, Area에 대한 중복 정보를 줄일 수 있다.

BCNF (Boyce-Codd Normal Form)

relation $R$ 은 FD $X \rarr A$ 에 대해, $X$ 가 $R$ 의 super key인 경우만 BCNF를 만족한다. (3NF만족 조건 중에서 1가지가 제외)
*각각의 normal form은 이전 단계를 순차적으로 만족한다.

해당 relation은 3NF는 만족하지만 FD2로 인해 BCNF를 만족하지 못한다.

*relation design의 목표는 각 relation이 BCNF(or 3NF)를 만족하도록 하는 것이다.

Another Example of a Relation in 3NF But Not in BCNF

TEACH relation에는 2개의 FD가 존재
- $\{Student, Course\}, \{Instructor\}$
  - left side는 TEACH 의 candidate key (superkey)
  - right side는 non-prime attribute
  - 따라서 transitive FD가 존재하고, 3NF는 만족( $X\rarr Z$ 에서 $Z$ 가 prime attribute이므로)하지만, BCNF는 만족하지 못한다.

How to Achieve BCNF? Decompostion!

TEACH relation에 대해서 3가지 decompostion이 가능
D1) $R1$ (Student, Instructor) and $R2$ (Student, Course) - Student로 join
D2) $R1$ (Course, Instructor) and $R2$ (Course, Student) - Course로 join
D3) $R1$ (Instructor, Course) and $R2$ (Instructor, Student) - Instructor로 join

위 모든 FD들은 FD1을 잃게 된다. (FD1은 3개의 attribute가 필요)
- lossless property의 만족을 위해서 FD 보존은 희생되어질 수 있다.
D3) 가 NJB를 만족하기에 선택될 것이다.
- Nonaddictive Join Test for Binary Decompostion

NJB (Nonaddictive Join Test for Binary Decompostions)

$F$

FD1: $\{Student,\,Course\} \rarr \{Instructor\}$
FD2: $\{Instructor\} \rarr \{Course\}$

FD (( $R1 \cap R2$ ) $\rarr$ ( $R1 - R2$ )) in $F^+$ or
FD (( $R1 \cap R2$ ) $\rarr$ ( $R2 - R1$ )) in $F^+$ ( $F^+$ 는 모든 FD 포함)

D1) $R1$ (Student, Instructor) and $R2$ (Student, Course)

$Student \rarr Instructor$ and $Student \rarr Course$ 가 false(둘다 $F^+$ 에 존재하지 않는 FD)

D2) $R1$ (Course, Instructor) and $R2$ (Course, Student)

$Course \rarr Instructor$ and $Course \rarr Student$ 가 false(둘다 $F^+$ 에 존재하지 않는 FD)

D3) $R1$ (Instructor, Course) and $R2$ (Instructor, Student)

$Instructor \rarr Course$ (=FD2) $\sube F^+$ 임을 추론 가능
NJB property 만족, D3로 decomposition

BCNF Decomposition Algorithm

relation $R$ 이 BCNF를 만족하지 않는다고 가정, $X \sube R$
FD $X \rarr A$ 가 BCNF의 위반을 발생
relation $R$ 은 아래의 두 relations으로 분해 가능
- (i) $R - A$ , (ii) $XA (= X\cup A)$
분해한 (i) 나 (ii)가 BCNF를 만족하지 않는 경우, 만족하지 않는 relation에 대해 만족할 때까지 반복

SUbbb

배우고 정리하고 공유하기

이전 포스트

[DB] Web Database Programming Techniques Using Tomcat + JSP

다음 포스트

[DB] Disk Storage, Basic File Structures

1개의 댓글

Blue

2023년 12월 12일

감사링

답글 달기

[DB] Basics of Functional Dependencies and Normalization for Relational Databases

DataBase

Background

INFORMAL GUIDELINES FOR RELATIONAL SCHEMAS

Informal Guidelines for Relation Schemas

G1) Impart Clear Semantics to Attributes in Relations

G2) Minimize Redundant Information in Tuples and Update Anomalies

Example of an MODIFICATION Anomaly

Example of an INSERT Anomaly

Example of an DELETE Anomaly

Another Example Suffering from the Update Anomalies: EMP_DEPT = EMP.*DEPT.

G3) Use NULL Values in Tuples Sparingly

Generation of Spurious Tuples: Poor Design

G4) Satisfy the Lossless Join Condition

FUNCTIONAL DEPENDENCIES

FUNCTIONAL DEPENDENCIES

Example of FD Constraints in COMPANY

Defining FDs from Relation Instances

Ruling Out FDs

What FDs May Exist?

Digrammatic Notation for FDs

NORMAL FORMS BASED ON PRIMARY KEYS

Normalization of Relations

Normalization (정규화)

Normal Form (NF)

NF의 level

First Normal Form (1NF)

Normalization into 1NF for (1) and MVA

Three Main Techniques for (1) and MVA

Normalizaiton for (2) Nested Relations

Definitions: (Non) Prime Attributes

Second Normal Form(2NF)

Normalizing into 2NF

Third Normal Form (3NF)

Normalizing into 3NF

Normal Forms Defined Informally

1NF

2NF

3NF

GENERAL DEFINITIONS OF SECOND AND THIRD NORMAL FORMS

General Definition of 2NF (for Multiple Candidate Keys)

General Definition of 3NF (for Multiple Candidate Keys)

Interpreting the General Definition of 3NF

3NF의 Alternative definition

BOYCE-CODD NORMAL FORM

Example of BCNF Normalization

BCNF (Boyce-Codd Normal Form)

Another Example of a Relation in 3NF But Not in BCNF

How to Achieve BCNF? Decompostion!

NJB (Nonaddictive Join Test for Binary Decompostions)

BCNF Decomposition Algorithm

[DB] Web Database Programming Techniques Using Tomcat + JSP

[DB] Disk Storage, Basic File Structures

1개의 댓글

관련 채용 정보