In the early 1980s companies began to replace flat file databases and hierarchical databases with relational databases. Relational databases were developed to provide easier access to data, allowing users to approach data from a variety of viewpoints. In a hierarchical database, users must define business questions in a distinct order, starting with the root of the database. In a relational database, users can ask business questions from many points of view.

One way to think about data in a relational database is to think about a cube of data. For example, at The Gap, the CEO, the district managers and the sales managers might think about their business as selling products (jeans, sweaters) in many markets (stores, regions) and measuring their progress over time (days, weeks, months). Many of the questions these managers have about their business can be answered by looking to the intersection of data defined by the edges of the cube pictured in Figure 1.


Figure 1

The sales manager may ask, “How many jeans did we sell yesterday at the store on 23rd Street and 8th Avenue?” The district manager may wonder, “What were sales by store in New York in September?” The CEO may require a report of total sales for all stores for the third quarter. For each of these business questions, the answer can be found at the point of intersection among the dimensions of product, market and time data.

While this example may seem very simple, it illustrates some of the fundamentals about relational databases. A relational database is, at its core, a system for storing and using data based upon the relationships among the elements of data. A cube is one way to illustrate relations among data as it helps to visualize data intersections. While it is easiest for us to picture a three-dimensional cube, a relational database stores data in many dimensions. We can think of dimensions as the entry points into the data or those business concepts we will use to slice and dice our data. In some organizations, dimensions are referred to as entities.

Advances in technology over the past twenty years have enabled relational databases to become very efficient at loading data into the database from transactional systems. Yet, our example illustrates that the real focus of a relational database should be on getting data out of the database – accessing data for use in business decisions.

Tables

All the data in a relational database is stored in tables. A table is simply a set of rows and columns. Every row contains the same columns, but the rows are not in any particular sorted order. Each row does not have to have data in every column, but the column will still exist for that row with a null value – it will be empty. Each row may be called a record and each column may be called a field.

Returning to our example, we may find that the table for the market dimension at The Gap would look something like this:


Figure 2: Market Dimension Table Name

In a relational database, a structured query language (SQL) is used to retrieve a subset of columns or a subset of rows to answer a business question or query. For instance, we might select the columns store_name and store_address if we wanted to send new merchandising materials to all stores. We might select a subset of rows such as all stores where the floor_plan_type = C-3 to determine where to send a particularly large merchandising poster which only fits in stores with lots of space.

Usually, when we are shown “pictures” of a relational database, we do not see the rows and columns illustrated above. We are usually shown a dimensional model – a picture of the tables that make up the database. In a dimensional model, the rows and columns above would look like Figure 3.


Figure 3

Dimensional Modeling: Data Elements and Dimensions

There are some key terms to learn in order to understand a dimensional model. Dimensional modelers think of the world in terms of data elements, i.e., region or store_key in our example above. Each data element will have a domain value which can be either numeric (a metric) or non- numeric (context.) Numeric domain values or metrics are simply numbers and their measurement is taken repeatedly. Units_sold or dollars_cost are examples of numeric domain values. Units are sold in a particular store, by a particular sales associate, at a particular cash register, on a particular day. This context or the data elements describing more about the numeric domain value of units_sold are referred to as non-numeric domain values.

Data elements are usually divided up into logical clumps such as market, product, time, associate, customer and others. We call these clumps of data elements dimensions and usually assume that these dimensions are independent. In our example, a customer may always buy the same style and size jeans so a correlation would, in fact, exist between product and customer, our data will clump more easily and neatly into separate dimensions to access information about sales.

Keep in mind that we can perform mathematical operations on numeric domain values. We can add units_sold across stores in a region or over time in a store. We can multiply units_sold by dollars_cost for information about our product inventory. We use non-numeric domain values in different ways than numeric domain values. Some non-numeric domain values are hierarchical, such as city, state and country. We can “roll up” cities to states and states to country to view data. Some non- numeric domain values are non-hierarchical, like customer names, so we must find other ways to group or segment customers. Some non-numeric domain values may contain numbers, like a purchase date, and, although we can roll-up dates, we do not usually perform mathematical operations with them.

Dimensional models are comprised of fact tables and dimension tables. The dimensional model illustrated in Figure 4 comes from The Data Warehouse Toolkit by Ralph Kimball. It simply describes a part of the relational database for a retail store.


Figure 4: Dimension Model from Ralph Kimball

If we go back to our cube, this diagram or dimensional model is another way of expressing that a business (The Gap) sells products (jeans, sweaters) in markets (stores, regions) and measures performance over time (days, months, years.)

Dimensional Keys: How Data is Organized in a Relational Database

In the illustration above, the sales facts in the fact table are measured repeatedly. Many customers will buy many products in many stores at many times. We will call this type of data relationship a many-to-many relationship. In many-to-many relationships we use dimensional keys to organize the data. Look for the keys in the dimensional model in Figure 4.

Each dimension has a single primary key. The primary key is unique to each row or record in our database and its value should not change over time. If we review the market dimension table on page 2, we find that each store_key is unique; no two stores can have the same store_key. Although it might seem that we could use store_name as a unique means of identifying each store, we have two stores with the same name (Market St.) although they are different stores with different addresses, states, regions, etc.

A primary key is often a consecutive or random number assigned to the record as it enters the database. A primary key can also be made up of components of other fields in the table. For instance, in some customer address dimension tables the primary key may be a long alpha-numeric combination of ZIP code, first three numbers in the address, first five letters in the address, etc. The primary key is used for indexing the table to make it more efficient to search, sort, link and perform other operations on the data.

Go back and take a look at the Sales Fact Table. It contains three keys: time_key, product_key, and store_key. When these keys appear in the fact table, they are referred to as foreign keys. In the Sales Fact Table, the foreign key is no longer unique. It may appear many times or not at all. In the store dimension table, the store at 23rd and 8th appears one time, with the unique store_key (primary key) of 12345. In the Sales Fact Table, the store_key (this time, foreign key) appears every time a unit is sold in that store. So if there were no sales today at 23rd and 8th, the table would not contain any records for 12345 – no rows would be stored with that store_key. If, on the other hand, this was a great day at 23rd and 8th, the table might contain 1000 rows with sales facts for store_key 12345.

No sale can enter the Sales Fact Table without a time_key, a product_key, and a store_key. If a pair of jeans is sold at The Gap, it has to be sold at a known time, with a known description and brand, at a known store. So the foreign keys in the fact table must have counterparts in the dimension tables to which it refers. This requirement of relational databases is called referential integrity.

If you spend a great deal of time talking with data modelers you may come across a few more terms having to do with keys, such as composite keys and concatenated keys. Every fact table in a relational database has a composite key. This is the primary key for the fact table and it is usually made up of a combination of the foreign keys maintained in the fact table. These foreign keys are concatenated (linked together into a single entry) into a primary key for the fact table.

Stars, Snowflakes and Normal Forms: How a Relational Database is Structured

The dimensional model we have been using as our example may be called a star schema or star join schema. The term comes from the idea that the diagram looks like a star with a large central table (usually our fact table) surrounded by smaller (usually dimension) tables radiating out from the center like rays of the sun (actually a star, remember?) Usually it is only the center table, the fact table, which has multiple tables connecting to it. The other tables, the dimension tables may only have a single join connecting them to the central (fact) table.

There are times when a dimension table will “sprout” additional smaller tables to take into account data that does not change frequently and to save space in storing that data. Our star diagram becomes more complex and begins to look more like a snowflake, so this type of data model is called a snowflake schema. In the diagram in Figure 5, the only the product dimension has been “blown out” into a snowflake.


Figure 5: Snowflake Schema

All relational database designs are based on a normalized logical data model. Normalization is simply putting one fact in one appropriate place. When disk space was considered expensive and designers focused on updating data, this approach made a great deal of sense. However, many of the activities described here are about accessing and retrieving data. When a fact is stored in only one place, retrieving many different but related facts usually requires going to many different places. This tends to slow the retrieval process.

With a normalized data model, one fact is stored in one place, related facts about a single entity are stored together, and every column of each entity refers to only the unique identifier for that entity. There are three basic rules that define the three normal forms:

  1. In first normal form: All column values are atomic and non-repeating.
    For instance, rather than storing the entire postal address as a single field (in one column), it is preferable to break the address up into usable parts, such as street_address, city, state, ZIP code and country. These “atomic-level” fields will make it easier for users to search and sort the data.
  2. In second normal form: All column values depend on the value of the primary key.
    Say we decided to maintain customer and product data in the same table because most of our customers buy the same products over and over again. Some fields in the table would depend upon the customer, like their address or birth date, and some fields would depend upon the product, like the description or the brand. Again, splitting tables into logical and distinct sets of fields will ease searching and sorting.
  3. In third normal form: No column value depends on the value of any other column except the primary key.
    The Sales Fact Table in our example contains units_sold and dollars_sold, but it does not contain, say, total_sale_value. Since total_sale_value would depend on both units_sold and dollars_sold, it is not an independent column value. You might want to think of this rule as simply meaning that the tables will have no calculated values. This saves the table from having to update multiple fields when a single field changes.

When the database designers have applied these three rules, the database is said to be in the Third Normal Form (3NF) or simply normalized. Some databases, especially large data warehouses and decision support systems are purposefully built without following the third rule. These databases are said to be denormalized.

Let’s describe these database states a bit more technically. Dimensional models are full-fledged relational models when the fact table is in third normal form and the dimension tables are in second normal form. Note that the dimension tables in second normal form are referred to as denormalized. Quite simply, this means that the dimension tables may maintain repeat data. Denormalization is the process of putting one fact in numerous places. This goes against the “one fact in one place” notion behind normalization, but allows for rapid data retrieval.

Remember that the chief difference between second and third normal forms is that repeated entries are removed from a second normal form table and placed in their own "snowflake." Thus the act of removing the context from a fact record and creating dimension tables places the fact table in third normal form. Dimensions are often left in second normal form to make querying the database and retrieving data more efficient.

Entity Relationship Diagrams: Reading Relationships in a Data Model

The relational databases you work with will be much more complex than the diagrams pictured here. However, you should be able to understand the relationships among the tables in the database by beginning with the central table (a large fact table) and using the foreign keys in the fact table to determine the links to the dimension tables. The primary key and table name of each dimension table will guide you in understanding the type of data contained in the table.

The picture of the relational database you work with may be called an entity relationship diagram or ERD. In many organizations, the ERD is simplified for users by hiding tables that were developed to help with data transformations or calculations or to help speed processing. Although the diagram may seem overwhelming at first, you will find that there are a set of distinct tables you will use in your work, and you can concentrate on learning the contents of these tables and the relationships among the data in these tables first.

To read the relationship between two tables (sometimes called entities), you need to recognize the series of symbols shown in Figure 6


Figure 6: In Preparation for Table Joins

To help us think about joining tables, let’s step back a moment and look at Boolean logic. This system of thinking was developed by George Boole in the late 1800s and is the basis for how the bits and bytes of computers work. Boolean logic introduces us to three very basic, but important, concepts that we will use to access data effectively from a relational database.

Boolean logic is best illustrated using Venn diagrams in simple examples that describe the three basic concepts: “and,” “or” and “not.” See Figure 7.


Figure 7

AND means "I want only records that contain both data elements."

OR means "I want records that contain either data element. I don't care which one."

NOT means “I do not want records that contain this data element.”

The diagrams below will highlight the importance of using these three logic gates appropriately. By using OR when you should have used AND, you will capture and return many records you do not need. By using AND when you should have used OR, you will not capture records that should be included in your search. By using NOT inappropriately, you may eliminate records that should be included in your search.

If you are searching for the shaded area in each of the diagrams below, your logic would be as follows in Figure 8.


Figure 8

These logic gates, AND, OR and NOT, are the basis of SQL, the structured query language behind campaign management tools and many other querying and reporting packages. In addition to these three concepts, we will also use the logic symbols familiar to you from basic grade school math.

Equal to

=

Greater than

>

Greater than or equal to

>=

Less than

<

Less than or equal to

<=

Not equal to

<>

Another important term we will use in joining tables is a WHERE clause. The WHERE clause helps to specify the conditions around the question you have for the database. These conditions help you to choose smaller and more specific sets of data and to use your computing resources more efficiently and effectively.

While we will not actually write SQL code, the tools we use will perform operations on the database using similar constructs. Let’s look at a simple query to understand a common mistake beginners make in joining tables.

SELECT a.cust_id, b.product_id, b.product_name   FROM order a, product b 

In this query, the user has asked the database to look at the order table and the product table and to select all customers, and products. The user may be doing this to select everyone who has ever purchased any product in order to send holiday cards to active customers. The user did not specify any condition – they did not say how the join should be done. So, what will the database do? The database will chug away to return every record in the table order and every record in the table product. If we were a relatively small company with 100 products and 5,000 orders, we would get back 500,000 rows of nonsense. Imagine what would happen in your company in the same situation – millions or even billions of rows! This effect is called a Cartesian join and will often result in a system shut down and/or a very unpleasant call from the database administrator!

Table Joins: Using the Data in a Relational Database

A table join is the process of combining data from two or more tables. The database will use the columns specified in the join (the WHERE clause) to combine the results with like values from two or more tables. As mentioned above, if you do not specify a join condition, all combinations of rows from the tables listed in the FROM clause are returned back, even if the data from the two tables are unrelated. This result is referred to as a “cross product” or a Cartesian join. This concept is repeated to impress upon you the importance of thinking through the WHERE clause and the FROM clause prior to moving forward with your query!

Let’s look at some examples based on two tables. These are simplifications of tables in a real data mart at a financial services firm to illustrate table joins. The actual tables contain many more rows or records and many more columns or fields.

Let’s say that CHECKING_ACCT contains the following card information:

BASE_ACCT_ID

BASIC_NO

FIRST_NAME

LAST_NAME

11111111111

01

JOHN

PUBLIC

22222222222

01

CAROL

CUSTOMER

33333333333

01

SUSAN

SHOPPER

44444444444

01

PAUL

PAYLESS

And, EMAIL_ADDR contains the following email address information:

BASE_ACCT_ID

BASIC_NO

CUSTOMER_ID

INET_ADDR_TXT

11111111111

01

100

johnqpublic@aol.com

33333333333

01

200

shopper@yahoo.com

88888888888

01

300

latepay@msn.com

In order to join the two tables together, we use the BASE_ACCT_ID and BASIC_NO keys as these keys appear in both tables. Notice in this example, John Public and Susan Shopper both have a card and have provided an email address. Carol Customer and Paul Payless have a card but have not provided an email address.

Let’s look at an example where we produce a Cartesian join or receive a Cartesian product of the two tables. In this example, a join condition is not specified (we don’t see a WHERE clause) so every combination of rows is present in our result. Each cardholder is matched with each e-mail address and the results don’t make much sense!

The Query

SELECT c.base_acct_id, c.basic_no, c.first_name, c.last_name,
e.base_acct_id, e.basic_no, e.customer_id, e.inet_addr_txt
FROM CHECKING_ACCT c,
EMAIL_ADDR e

The Results: A Cartesian Product!

11111111111 01 JOHN PUBLIC 11111111111 01 100 johnqpublic@aol.com
22222222222 01 CAROL CUSTOMER 11111111111 01 100 johnqpublic@aol.com
33333333333 01 SUSAN SHOPPER 11111111111 01 100 johnqpublic@aol.com
44444444444 01 PAUL PAYLESS 11111111111 01 100 johnqpublic@aol.com
11111111111 01 JOHN PUBLIC 33333333333 01 200 shopper@yahoo.com
22222222222 01 CAROL CUSTOMER 33333333333 01 200 shopper@yahoo.com
33333333333 01 SUSAN SHOPPER 33333333333 01 200 shopper@yahoo.com
44444444444 01 PAUL PAYLESS 33333333333 01 200 shopper@yahoo.com
11111111111 01 JOHN PUBLIC 33333333333 01 300 latepay@msn.com
22222222222 01 CAROL CUSTOMER 33333333333 01 300 latepay@msn.com
33333333333 01 SUSAN SHOPPER 33333333333 01 300 latepay@msn.com
44444444444 01 PAUL PAYLESS 33333333333 01 300 latepay@msn.com

How do you get results that make sense, results that you can use in your work? There are two main types of table joins you will use – inner joins and outer joins. Inner joins keep only the rows from the tables that meet the join condition. If a row exists in one table, but not the other, the information is not included in the results.

The following example produces the inner join of the two tables. The inner join brings back the cards that have an assigned email address. Carol Customer and Paul Payless are not returned because they do not have an associated email address in theEMAIL_ADDR table. Notice the WHERE clause.

The Query

SELECT c.base_acct_id, c.basic_no, c.first_name, c.last_name,
e.base_acct_id, e.basic_no, e.customer_id, e.inet_addr_txt
FROM CHECKING_ACCT AS c INNER JOIN
EMAIL_ADDR AS e
WHERE c.base_acct_id = e.base_acct_id
AND c.basic_no = e.basic_no

The Results: An Inner Join

11111111111 01 JOHN PUBLIC 11111111111 01 100 johnqpublic@aol.com
33333333333 01 SUSAN SHOPPER 33333333333 01 200 shopper@yahoo.com

Notice that the results of the inner join consist of rows that have matching values based on the base_acct_id and basic_no values in the right (CHECKING_ACCT) and left (EMAIL_ADDR) tables.

Outer joins are a little more complicated than inner joins. Where inner joins eliminate all rows that do not appear in both tables, outer joins enable you to include rows that appear in only one table in addition to those rows which appear in both tables. Let’s see how outer joins work.

Outer joins are a concatenation of the inner join and the missing rows from the left table, right table or both tables. When you perform an outer join on two tables, you arbitrarily assign one table as the left table and the other as the right table. There are three types of outer joins:

  1. Left outer join includes the inner join rows and the rows from the left table that are not included in the inner join;
  2. Right outer join includes the inner join rows and the rows from the right table that are not included in the inner join; and
  3. Full outer join includes the inner join rows and the rows from both the left and right tables that are not included in the inner join.

In your query, in the FROM clause, you will list the name of the first table followed by the keyword LEFT OUTER JOIN, RIGHT OUTER JOIN or FULL OUTER JOIN. Then specify the second table followed by the ON keyword. Following the ON keyword, you specify the join condition to express a relationship between the tables joined.
Let’s try a left outer join. In the following example, CHECKING_ACCT is designated as the right table and EMAIL_ADDR as the left table. By using LEFT OUTER JOIN, we list the all the cards in the CHECKING_ACCT card table and their email address if they have provided one.

The Query

SELECT c.base_acct_id, c.basic_no, c.first_name, c.last_name,
e.base_acct_id, e.basic_no, e.customer_id, e.inet_addr_txt
FROM CHECKING_ACCT AS c LEFT OUTER JOIN
EMAIL_ADDR AS e
ON c.base_acct_id = e.base_acct_id
AND c.basic_no = e.basic_no

The Results: A Left Outer Join

11111111111 01 JOHN PUBLIC 11111111111 01 100 johnqpublic@aol.com
22222222222 01 CAROL CUSTOMER BLANK BLANK BLANK BLANK
33333333333 01 SUSAN SHOPPER 33333333333 01 200 shopper@yahoo.com
44444444444 01 PAUL PAYLESS BLANK BLANK BLANK BLANK

Remember, the inner join results included rows with values in all columns – the John Public and the Susan Shopper rows are the result of the inner join. With the left outer join, the results will also include those rows from the card table where the join condition was not satisfied -- BLANK values are indicated in the appropriate columns of the right table (Carol Customer and Paul Payless do not have email addresses.) Note that all rows from the left table (CHECKING_ACCT) are included in the result set even though they do not have a corresponding row in the (EMAIL_ADDR) table.

Let’s try a right outer join. As in the previous example, CHECKING_ACCT is designated as the right table and EMAIL_ADDR as the left table. For the RIGHT OUTER JOIN example, we want to list all the email addresses from the EMAIL_ADDR table and their First and Last Name if they have one.

The Query

SELECT c.base_acct_id, c.basic_no, c.first_name, c.last_name,
e.base_acct_id, e.basic_no, e.customer_id, e.inet_addr_txt
FROM CHECKING_ACCT AS c RIGHT OUTER JOIN
EMAIL_ADDR AS e
ON c.base_acct_id = e.base_acct_id
AND c.basic_no = e.basic_no

The Results: A Right Outer Join

11111111111 01 JOHN PUBLIC 11111111111 01 100 johnqpublic@aol.com
33333333333 01 SUSAN SHOPPER 33333333333 01 200 shopper@yahoo.com
BLANK BLANK BLANK BLANK 88888888888 01 300 latepay@msn.com

As in the left outer join example, the rows with values in all columns are returned as the result of the inner join. These are rows that satisfy the join condition: People who have a card and have provided an email address (John Public and Susan Shopper.) For the row where the join condition was not satisfied, a BLANK value appears in the appropriate columns. Since the latepay@msn.com email address does not have a corresponding card record in the Card table, the first name and last name columns have BLANK values. Notice that all rows from the right table (EMAIL_ADDR) are included in the result set even if they do not have a corresponding card record.

Finally, let’s try a full outer join. This example uses FULL OUTER JOIN with the two tables. This query will return all the names from the Card and Email Address table. Notice how the full outer join differs from the Cartesian join in our first example.

The Query

SELECT c.base_acct_id, c.basic_no, c.first_name, c.last_name,
e.base_acct_id, e.basic_no, e.customer_id, e.inet_addr_txt
FROM CHECKING_ACCT AS c FULL OUTER JOIN
EMAIL_ADDR AS e
ON c.base_acct_id = e.base_acct_id
AND c.basic_no = e.basic_no

The Results: A Full Outer Join

11111111111 01 JOHN PUBLIC 11111111111 01 100 johnqpublic@aol.com
22222222222 01 CAROL CUSTOMER BLANK BLANK BLANK BLANK
33333333333 01 SUSAN SHOPPER 33333333333 01 200 shopper@yahoo.com
44444444444 01 PAUL PAYLESS BLANK BLANK BLANK BLANK
BLANK BLANK BLANK BLANK 88888888888 01 300 latepay@msn.com

The results of the full outer join include the left outer join, right outer join and inner join rows. Just like the left outer join and right outer join, for values where the join condition was not satisfied, the BLANK value is populated in the appropriate column. Every row from both tables is returned in the result set.

The most effective means of learning about the data in your database is to spend time exploring the data using simple queries and progressing to more complex questions over time. Think of business questions you would like to answer or think about those questions on the mind of the CEO. With some patience and a little practice, even the most techno- phobic among us can read data models, query databases and understand more about the nuts and bolts behind their business.

CELLSPACING="4">

Relational Database Basic Concepts

Relational database – system for storing and using data based upon the relationships among the elements of data

Table – a set of rows and columns; the building block of a relational database

Field – a column in a table containing a single characteristic of a subject

Record – a row in a table; all the fields for one item in a table

Dimensional model – a picture of the tables that make up the database

Facts – numeric metrics collected in a fact table; also known as a numeric domain value

Attributes – non-numeric context around the facts; collected in a dimension table; also known as a non-numeric domain value

Primary key – a column or field that uniquely identifies each row or record in a database; its value does not change over time

Foreign key – primary keys from dimension tables stored in a fact table; foreign keys are not unique and may appear repeatedly.

Denormalization – the process of putting one fact in numerous places optimizes data retrieval

Normalization – the process of putting one fact in one appropriate place, optimizes data maintenance

Star schema – a type of dimensional model with a large central table (usually a fact table) surrounded by smaller (usually dimension) tables radiating out from the center, usually a denormalized database

Snowflake schema – a type of dimensional model in which the dimension tables are related to smaller, satellite tables, usually a normalized database

Table Join – the process of combining data from two or more tables

Inner Join – the process of combining data from two tables where the result is only the rows from the tables that meet the join condition; if a row exists in one table, but not the other, the data elements in that row are not included in the results

Outer Join – the process of combining data from two tables where the result is a combination of the results of the inner join and the missing rows from the left table, right table or both tables

Left Outer Join – results include the inner join rows and the rows from the left table that are not included in the inner join

Right Outer Join – resultsinclude the inner join rows and the rows from the right table that are not included in the inner join

Full Outer Join – results include the inner join rows and the rows from both the left and right tables that are not included in the inner join

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access