Notes: Database Phase II


Effective and Efficient Data and Databases

Congratulations! You survived the initial database segment!  Before we dive into the details, let's review how this topic and the first topic fit together.  In the last topic we introduced the concept of entities and attributes, which are simply ways of categorizing and organizing data.  You learned some of the important properties of entities and attributes, and you learned about grouping entities into entity sets.  Entity sets are tables in MS-Access.  All of the attributes in an entity have common information and are called fields (or columns) in an MS-Access table. Finally, you learned different ways tables are related using foreign and primary keys.  The determination of what is an entity, what attributes are important, how attributes are grouped into entities, and how entities are related is made by the database designer. 

SDLC Phase 1 - Analyze

Okay, let’s apply this phase to the PIM Project. First, we need to write one or more paragraphs that describe the purpose of the database, i.e. what do you expect to accomplish by building the database?  This creates the general scope of the project and sets the parameters as to what to expect when the project is done. This paragraph should include: 1) How long the project might take; 2) What the expected cost would be for the project; and 3) A reason the Steering committee (the part of a business that would approve the project) should allow this project to go forward. (A sample reason for the class project would be to enhance your knowledge of database construction and you won’t get the grade you desire if you don’t do this)

The first phase in the System Development Life Cycle (SLDC) process is to analyze all the aspects of the project. To do this, ‘Do you home work!’ Find examples of all the data that you will need for your database. This will include input information (phone call collection of data, other databases, screens used to collect data, the mail, etc.) and output information (this could include reports, statements and invoices, screens used to review data, output to other databases, etc.). You need to spend time with everyone who will use, input, receive reports, and maintain the database. Make a list of all the fields you find during this phase. A thorough list would include asking people involved with the final project very direct questions. Questions might be: 1) Will we ever need a 9 digit zip code, or will a 5 digit code work? 2) Will phone numbers need area codes or maybe international codes?  Or 3) Are middle initials important?

The four steps of database design are involved and none of them involve working with any database management software such as MS-Access.  In fact, many designers complete the first four steps without using a computer for anything other than word processing or email.  The point is that these first steps need to be done before typing anything into MS-Access.  You don't start designing a database by starting to create tables, you start designing a database by interviewing people, identifying the problem, figuring out the business rules, drafting the ERD, and then normalizing the entities.  Then you can fire up Access and start creating tables and relationships. The four steps are:

In database design, there is no particular order to accomplish the four major steps; in fact, a good database designer will circulate through the four steps many times before finishing the project. We might create a few rules during the ‘Analyze’ phase, and come up with even more rules as we start developing the other steps in the ‘Design’ phase.  

So, first, we need to create a list of attributes (fields) that we will need for the project. As we create this list, we need to start the first of four necessary steps for effective database design.

  1. Business Rules – Each user community must supply a list of how the data is to be used. These rules will aide the designer in creating the final product.

Now I need to make a few points about the database design process.  Databases are designed for a particular purpose, usually to support some operational function of a business.  The rules of the business determine how the database is designed and operated, and are the ultimate arbiter of design decisions.  You should note that business rules are unique for every business and every situation. 

The first three business rules of any database are often unstated:
          Business rule #1: The database has to perform fast enough to keep the sponsor with the money happy
          Business rule #2: The database can't be too expensive
          Business rule #3: You can't take too long to complete the project

The result of these pesky, real-world considerations is that real-world databases are compromises.  All designers strive to make "pure" databases that follow all of the "correct" design principles, and all designers end up breaking one or more of these design principles for pragmatic considerations.  Choosing the correct compromises is more art than science, and is one of the reasons good database designers make very good wages.  There are some general guidelines to ensure that the compromises made in your database projects are reasonable:

Know the environment.  This means that you should know as much as possible about the problem you are trying to solve, and what makes a "good" or "bad" solution

Understand the rules of the situation.  This means that you need to make sure that you have a good understanding of how the data is used, what transactions are going to be implemented, and the ultimate purpose of building the database.

Understand the constraints.  This means understanding performance, cost, and functionality limitations for your project.  How much money can be spent?  How fast do the transactions need to perform?  What are the "gotcha's" in this project?
Be flexible where possible - you never know how things might change

Two of the major design rules are that databases should be both effective and efficientEffective means that the problem is solved, i.e. the correct tables and relationships have been built, and the tables contain the correct fields and data.  Efficient means that the database is concise and easily processed, i.e. there are no extraneous data or relationships, and processing is minimal.

Effectiveness and efficiency are measures of database quality.  Either may be compromised by external constraints (welcome to the real world).  There are other measures of quality that may be applied (or compromised) as well.  These include

Flexibility - the ability to adapt the database to a new purpose with a minimum amount of changes
Comprehensiveness - the ability to generate new information from existing data
Transportability - the ability to move the database from one machine to another machine

The essential point to remember is that someone is paying to have a database built, and that someone may have unique and unanticipated quality criteria or design guidelines for the finished database.  The designer needs to make sure all important criteria are understood before creating the ERD.

Efficiency and effectiveness are the design guidelines important in this course.  There are many ways to degrade these qualities, and the three most common are data dependencies, data redundancies and excessive nulls.  Data redundancy exists when the same data is duplicated in more than one table in the database.  Data dependence exists when the value of one attribute depends on the value of another, i.e. when you know the value of one then you know the value of the other.  Nulls leave voids as to whether data exists. None of these are bad per se, but their existence may indicate a poor database design.

Following is a list of attributes I came up with when analyzing the project.

Last Name

 

Company Name

 

First Name

 

Job Title

 

Spouse Name

 

Work Address

 

Child Name

 

Work City

 

Date of Birth

 

Work State

 

Home Address

 

Work Zip

 

Home City

 

Work Country

 

Home State

 

Work Main Phone

 

Home Zip

 

Personal Work Phone

 

Home Country

 

Work E-Mail

 

Home Phone

 

Notes

 

Cell Phone

 

Dif Surname

 

Personal E-Mail

 

Category

 

 

As a beginning database designer, it is helpful to add some data so we can see how the data is working. Look at the spreadsheet created (PIM(1)) for this project to see how the attributes suggested might look in a ‘Flat’ file (in this case a spreadsheet). For this project, we will only use Simple, Single Valued attributes. The spreadsheet is for the Sutton family. It includes relatives, neighbors, coworkers, and professionals you might use (eye doctor). At first look, this project seems like it would be very easy. However, quickly we start seeing complexities that show why flat files are cumbersome. Here are some questions that arise:

Many people have the same last name. How can we separate these names in order to create a Primary key? We will establish our next Business Rule:

          Business Rule #4: A family is everyone who lives at the same address.

Other questions surface that will need to be handled throughout the design process. They might include:
How many children would the biggest family have?
How can we list these names and still put birthdays, cell phones and E-Mail addresses attached to each person. The biggest problem here is if I choose a number, like the maximum amount of children is 3, then Murphy’s law will tell us that soon we will have a family with 4 kids. How can we plan ahead?
What if someone is working two jobs? How do we deal with this?
What if the spouse also has a job? Where do I put this information?
What if the family is a broken family where kids live at two different houses?
What if there are different last names living at the same address? (Wife retained her maiden name, kids from previous marriages, roommates) We will create a new field called: ‘DIF SURNAME’.
How do we separate our information into groups of family, friends, etc.? We will set up a ‘CATEGORY’ field to depict the relationship for each person. This field will include either: Family, Friend, Business, or Professional.

SDLC Phase 2 - Design

Now that we have some general components of our database, we need to rework the data and fields so the DBMS can run more efficient and effective. During the design stage we will break the fields down to multiple tables. This decomposition of the fields will seem a little strange at first, however, it should be noted that later in the process we will write queries (will explain later) and create forms (again, will be discussed later) to reconstruct the data for a more pleasing result.

After looking at the PIM(1) spreadsheet we see our fields have multiplied to 59 fields. This is way too cumbersome to work with, so the seasoned database designer needs to break this ‘Table’ down to more usable entities. This decomposition of the data moves us to the second element of creating effective databases:

  1. Normalization – Here we analyze each of the attributes within the entity to search out data redundancy problems, excess nulls, and general problems that have surfaced. Normalization is one of the more intricate steps in designing a database. There are many rules that designers use to find easier ways to handle the data. The scope of this course will not get too deep into these practices. However, we will do some basic work to make our database more useful.

A question that arises in database design is the issue of which attributes should be assigned to which entity.  This is really the question of which fields are included in a particular table.  For example, if your task is to keep track of a list of friends, should all of the information be in one table?  Should friend’s names be stored in one table, addresses in a second table, and miscellaneous information in a third table?  The concepts of data normalization are used to help decide this type of question.  Data normalization does not really tell you what attributes should be in a table, but it does indicate certain combinations of attributes that should not be in the same table. In other words, data normalization is a process that tells you when you should break one large table into several smaller tables.

The normalization process occurs often when designing a database.  Normalization may result in additional tables in the database, which requires a revision to the ERD.  The revisions to the ERD may or may not require additional normalization steps.  The loop may continue indefinitely, although in practice the designer moves on when the design in “good enough” rather than “perfect”.

The first step we need to approach in the PIM(1) spreadsheet will be to give each person their own row (record). Having all persons in the family in one row is what is creating many of the fields we need to reduce. Next, we will give each person in the family a ‘ROLE’. The ‘ROLE’ field will depict what role each person plays in the household. We will give the person we most likely want to talk to the role of “Main Contact’. After that the spouse will be ‘Spouse’ and each kid will have the roll of ‘Child’. By doing this, we don’t have to guess how many kids a family will have. We will only create a new row/record if a person exists. (See the second PIM(2) spreadsheet for these changes). This also allows us to let all people, including kids, the possibility of having a job, their own date of birth, cell phone and E-Mail address.

Now, as we look at the PIM(2) spreadsheet, we want to look for redundancy. There are many areas where we are required to reenter the same information over and over again. One of these problem areas happens when people work for the same company. Notice that every time someone works at Nike, the work address, work city, etc are always the same. So we look for the field that starts this redundancy. In this case it is the “COMPANY NAME’ field. Every time the company name is Nike, the rest of the work fields are the same (they are dependent). Look also at the Portland State Univ. address entries. The same redundancy is happening. So we are going to create a new table we will call it ‘COMPANY’.  By creating a new table, we will only need to enter company information once. Then, whenever a person works for a company, we point the person to the company table, and the rest of the information is already entered there.

The next area of redundancy happens at the home address information. Notice, everyone who lives at a particular address will always have the same address, city, state, etc. It is troublesome to have to reenter this information for every person that lives at a particular address. On the other hand, the personal information like date of birth, personal cell phone and personal E-Mail are matched to the ‘FIRST NAME’ field and are not duplicated. Here, we will break this information into two tables. The first one we will call ‘FAMILY’ (all fields that pertain to whole family) and the second table we will call ‘PERSON’ (all fields that pertains specifically to the person).

So now we have three tables. A FAMILY table, a PERSON table, and a COMPANY table. This takes us to the next step in database design, to build an ERD, using the business rules as a guide.  :

  1.  ERD (Entity Relationship Diagram) – this is a graphical depiction of how the tables and relationships will work together.

There are few hard and fast rules to convert business rules to ERDs, but here are some general rules of thumbs (but beware that they don't always hold true):

For any non-trivial database, one question that quickly arises is the question of how to keep track of all of this information during the design process.  The ERD is designed to record the decisions of the database designer in an easy to read and compact format.  The ERD is a blueprint of the overall database, and shows each table in the database and all of the relationships between all of the tables in the database.

ERD’s are simply pictures that represent the logical structure of a database, i.e. the tables and their relationships.  ERD’s do not contain information about the attributes in each table, other than indicating which tables have primary keys and which have foreign keys.  There are several different and competing standards for creating ER diagrams.  This course will use the conventions typically called Crow’s feet. However, you should be aware that other authors (and other designers) have developed ER diagramming techniques using different symbols and conventions.  The underlying concepts are identical, but the details of expression differ.

There are five basic symbols used in building ER diagrams for this class.  These symbols identify entities, weak entities, composite entities, relationships, and whether the relationship is mandatory or optional.  Labels on the ERD identify the Connectivity (type of relationship: 1:1 or 1:M) between tables.  It also depicts the Cardinality (the minimum and maximum number of connections between tables: generally the minimum is either 0 or 1, and the maximum is either 1 or many). That is all the information contained in an ERD - the tables, the relationships between the tables, and what type of relationship exists.  This information constitutes the logical structure of a relational database.


Entity Relationship Notation

 

Entity relationship diagramming is all about identifying entities and their relationships and then drawing a diagram that accurately depicts the system. This applies equally to the design of a new system or the analysis of an existing one. The end result of entity relationship diagramming should be a clear picture of how information is stored and related within a proposed, or existing, system.

Entity relationship diagramming uses a standard set of symbols to represent each of these defined data groups and then proceeds by establishing the relationships between them. The first of these symbols is the entity. Generally we use a rectangle.

The other main component on a data model is the relationship line. A Relationship is an association between two entities to which all of the occurrences of those entities must conform. The relationship is represented by a line that joins the two entities, to which it refers. This line represents two reciprocal relationships: That of the first entity with respect to the second, and that of the second entity with respect to the first. Many ERD diagrams use diamonds to represent these relationships along with lines.

Note that entities are always named in the singular; for example: COMPANY, FAMILY, and PERSON, and not companies, families, and people. A common method is to use nouns. Entity names are normally single words and the name chosen should be one familiar to the users. The entity name can include a qualifier in order to clarify their meaning. However, if different names are currently used to describe a given entity in different areas of the organization then a new one should be chosen that is original, unique and meaningful to all of the users.

The relationship is the association between two entities to which all of the occurrences of those entities must conform. The entities on data models are linked by relationship lines. Every relationship line shows two reciprocal relationships: That of the first entity with respect to the second and that of the second entity with respect to the first.

There are three distinct properties of the relationship; firstly the relationship link phrase, secondly the degree or cardinality of the relationship and thirdly the participation or optionality of the relationship. These three properties are collectively termed the relationship statement.

Relationship link phrase

The first property of the relationship statement is the relationship link phrase. This should be a short description of the nature of the relationship, typically between three and five words long. It is always read clockwise with respect to the entities that it links. An example would be: Families include persons.

Identifying Relationships

The second property of the relationship statement is the degree, or maximum cardinality, of the relationship. If an entity has a crowsfoot symbol drawn against it, then many occurrences of that entity may relate to the other entity. Conversely if no crowsfoot is drawn against it, at most one occurrence of that entity may relate to the other entity.

There are just two questions that need to be asked, in order to establish the degree of the relationship that exists between any two entities.

In order to identify the degree of the relationship between the entities X and Y the following two questions need to be asked.

Question 1
Can an occurrence of X to be associated with more than one occurrence of Y?

Question 2
Can an occurrence of Y to be associated with more than one occurrence of X?

Each of these questions can be answered 'Yes' or 'No' and both questions must be answered. This means that there are four possible outcomes as shown in the table.
                         data modeling - decision grid

The nature of the relationship associated with each outcome is as follows:

Option 1, Question1 equals Yes, Question2 equals No.
In this case a one-to-many relationship has been identified, represented by the relationship line shown.

Option 2, Question1 equals No, Question2 equals Yes
As in the first case a one-to-many relationship has been identified, represented by the relationship line shown.

Option 3, Question1 equals Yes, Question2 equals Yes
In this case a many-to-many relationship has been identified.

Many-to-many relationships may be shown in the early 'preliminary' data model in order to aid the clarity of these early diagrams. However, such relationships are invalid and are resolved using two steps. We will discuss this later.

Option 4, Question1 equals No, Question2 equals No
In this case a one-to-one link has been identified.

Legitimate one-to-one relationships are rare and it is likely that this relationship is one that needs to be rationalized. The methods used to investigate one-to-one relationships and to re-model them where necessary are explained later in the course.

In a one-to-many relationship the entity at the 'one' end is normally referred to as the master, and the entity at the 'many' end referred to as the detail entity.

                  DBphase2image4

In the first two tables in the example: Each FAMILY has one or more PERSON(s), but Each PERSON can only belong to one FAMILY. This is called a one-to-many relationship. In the second two tables, we say that one PERSON can only work at one company, but a COMPANY may have several PERSON(s). Again, a one to many relationship. (This is explained in more detail below)

Optionality and Cardinality

The third and final property of the relationship statement is the participation or optionality. We use the circle to represent optionality. If there is a mandatory relationship (Like between FAMILY and PERSON, no circle is used, but when a PERSON does not have to work for a company, this, then is optional). A solid line shows that an entity occurrence must be associated with each occurrence of the other entity and a line with a circle on it depicts optionality.

Symbols at the ends of the relationship lines indicate the optionality and the cardinality of each relationship.  “Optionality” expresses whether the relationship is optional or mandatory. “Cardinality” expresses the maximum number of relationships.

As a relationship line is followed from an entity to another, near the related entity two symbols will appear.  The first of those is the optionality indicator.  A circle ( ™ ) indicates that the relationship is optional—the minimum number of relationships between each instance of the first entity and instances of the related entity is zero.  One can think of the circle as a zero, or a letter O for “optional.”  A stroke ( | ) indicates that the relationship is mandatory—the minimum number of relationships between each instance of the first entity and instances of the related entity is one.

The second symbol indicates cardinality.  A stroke ( | ) indicates that the maximum number of relationships is one.  A “crows-foot”  (DBphase2image6 ) indicates that many such relationships between instances of the related entities might exist.
 
The following diagram indicates all of the possible combinations:

DBphase2image8

 

So, applying these ideas to the PIM database, we start with setting three rectangles depicting the three tables (see diagram above).

First, we will look at the relationship between the FAMILY table and the PERSON table. We need to ask the questions:

Question 1
Can the FAMILY table be associated with more than one occurrence of PERSON Table? (Can a family have more than one person?) The answer is yes.

Question 2
Can the PERSON table be associated with more than one occurrence in the FAMILY table? (Can a person belong to more than one family? The answer is also yes. If there is a broken family where the kids are living at both parents home t different times)

This creates an undesirable situation of a Many to Many (M:N) relationship. One technique for resolving a M:N relationship is to create another business rule. This is what we will do here. Remember it is OUR database and we can create whatever rules fit our needs.

          Business Rule #5: People can ONLY be associated with one family.

This resolves our problem and turns ‘Question 2’ above to ‘No” making the relationship One to Many (1:M).

Next we ask what the minimum amount is required between the two tables. The question would be: Can a family exist that does not have a Person?  Our answers should be no (there would not be a reason to have a family if there were no persons). This creates a ‘Mandatory’ relationship and will be reinforced by a new business rule:

          Business Rule #6: No family can be entered into the database that does not have at least one person connected to it.

Next, let’s look at the relationship between the PERSON table and the COMPANY table. Again, we need to ask the following questions:

Question 1
Can the PERSON table be associated with more than one occurrence in the COMPANY Table? (Can a person hold more than one job?) The answer is yes.

Question 2
Can the COMPANY table be associated with more than one occurrence in the PERSON table? (Can a company have more than one employee?) The answer is yes.

Again, this creates an undesirable M:N relationship. If we follow the resolution from above, we can take control of our database and write another business rule. It might read like this:

          Business Rule #7: A person cannot hold more than one job.

This resolves our problem and turns ‘Question 2’ above to ‘No” making the relationship One to Many (1:M).

Next we ask what the minimum amount is required between the two tables. The question would be: Can a person exist that does work for a company?  Our answers should be yes (some people don’t work, particularly kids). This creates an ‘Optional’ relationship and will be reinforced by a new business rule:

          Business Rule #8: A person does not have to have a job.

(In this case, we created a business rule to resolve a M:N problem. Actually, a better solution would be to create a ‘Composite Table’. However, composite tables are beyond the scope of this course).

Now, if we refer back to the ERD diagram above, we see the tables of our database, the relationship lines, and the mandatory/optional lines are shown and understood.

  1.  Relational schema - that show the PK and FK fields and a breakdown of what attributes will be contained in the entity.

The last step is to create relational schema.  A relational schema is composed of the actual tables, attributes and relationships that implement the design reflected by the ERD.  ERDs are pictures on a piece of paper (or maybe in a word-processor file or Visio file), while the relational schema is a collection of tables, fields, and relationships that will be stored in an Access file.
 
The creation of the relational schema is often an epiphany for the database designer, because the database sometimes doesn't work as designed.  Actually, it works as designed, but the designer didn't consider all of the relevant factors.  A beautiful and elegant ERD that perfectly reflects the business rules may result in a relational schema that is unworkable due to speed, expense, software limitations, or some other factor not immediately obvious from the business rules. For your PIM project, there are not any of these complications.

Actually, by now, developing the relational schema is the easiest of all the steps. We have already found our tables and we have listed our attributes. There are two things left to do. Find the Primary and foreign keys and moving the attributes to correct table.

We first create a column for each table we have in our database. Then, as mentioned earlier in the narrative, we will start each table with an AUTO-NUMBERING attribute as our Primary Key (PK).  Our list would look like this:

          Tables:                  FAMILY                          PERSON                         COMPANY
          PK                        FAM-ID                          PER-ID                           COM-ID

Now, we will go through each of the attributes we listed in the ‘Analysis’ stage. We ask the questions:

Which table does attribute ‘X’ best fit?

The first attribute in our list is ‘Last name’. After asking the question above we see it belongs in the FAMILY table. (Everyone in the Family has the same last name)
The second attribute in our list is ‘First Name’. This one fits best in the PERSON table.
If we go through every attribute listed above our Relational Schema will eventually look like this:

 

Tables: FAMILY PERSON COMPANY
Primary Key (PK) FAM-ID PER-ID COM-ID
Attributes Last Name First Name Company Name
  Home Address Role Work Address
  Home City Date of Birth Work City
  Home State Personal Cell Phone Work State
  Home Zip Personal E-Mail Work Zip
  Home Country Job Title Work Country
  Home Phone Work Personal Line Main Work Phone
  Category Different Surname  
    Notes  

 

The last item we need to make this a Relational database would be the addition of foreign keys. There must be a thread connecting all tables in the database and as previously mentioned, the foreign key creates this relationship. For this exercise, we will put foreign keys in the table that has the ‘Many’ side of the relationship. That is the table that has the crows feet side of the relationship line. The PERSON table has both crows feet attached to it. So we will bring the PK from each of the other two tables and add these attributes to the PERSON table. Our final version of the relational schema will look like this:

Tables: FAMILY PERSON COMPANY
Primary Key (PK) FAM-ID PER-ID COM-ID
Attributes Last Name First Name Company Name
  Home Address Role Work Address
  Home City Date of Birth Work City
  Home State Personal Cell Phone Work State
  Home Zip Personal E-Mail Work Zip
  Home Country Job Title Work Country
  Home Phone Work Personal Line Main Work Phone
  Category Different Surname  
    Notes  
    FAM-ID  
    COM-ID  

 

Whew, we have finished the most difficult part of designing a database. We now have four different items that should completely describe our final product. Now we can pass these items on the database builder for phase 3.