Using Joins to Combine Tables

When designing a database, it often makes sense to divide data between two or more tables. For example, if we are maintaining a database of patient records, we would probably want at least four tables: one for the patient's personal information (such as their name date of birth), a second to keep track of their appointments, a third for information about the doctors who are treating them, and a fourth for information about the hospitals those doctors work in (see Figure 15.6, on the following page). We could store all of this in one table, as shown Figure 15.7, on the next page, but then a lot of information would be needlessly duplicated.

If we divide information between tables, though, we need some way to pull that information back together. For example, if we want to know the hospitals at which a patient has had appointments, we need to combine data from all four tables to find out:

• What appointments the patient has had

• Which doctor each appointment was with

• Which hospital that doctor works at

The right way to do this in a relational database is to use a join. As the name suggests, a join combines information from two or more tables to

Figure 15.6: Dividing data between tables

Patient-Doctor-Appointment-Hospital

patient ^lice\

birthday doctor

Rajani date

2008/09/01

hospital

Central y'52 Walnut St

address

Alice

1978/04/02

Nianiaris

Newton

1964/12/15

2008/09/1

East

Newton

2008/09/1

East

Zack

2008/11/0

East V.

8 Elm Sty

Figure 15.7: Abad database design create a new set of records, each of which can contain some or all of the information in the tables involved.

To begin, let's add another table that contains the names of countries, the regions that they are in, and their populations:

Download db/db_add_tables.cmd

>>> cur.execute('CREATE TABLE PopByCountry(Region TEXT, Country TEXT, Population INTEGER)')

and then insert data into the new table:

Download db/db_insert_2.cmd

>>> cur.execute('INSERT INTO PopByCountry VALUES("Eastern Asia", "China", 1285238)')

Inserting data one row at a time like this requires a lot of typing. It is simpler to make a list of tuples to be inserted and write a loop that inserts the values from these tuples one by one:

Download db/db_insert_3.cmd

>>> countries = [("Eastern Asia", "DPR Korea", 24056), ("Eastern Asia", "Hong Kong (China)", 8764), ("Eastern Asia", "Mongolia", 3407), ("Eastern Asia", "Republic of Korea", 41491), ("Eastern Asia", "Taiwan", 1433), ("North America", "Bahamas", 368), ("North America", "Canada", 40876), ("North America", "Greenland", 43), ("North America", "Mexico", 126875), ("North America" , "United States" , 493038)] >>> for c in countries:

... cur.execute( 'INSERT INTO PopByCountry VALUES (?, ?, ?)', (c[0], c[1], c[2])) >>> con.commitO

This time, the call to execute has two arguments. The first is the SQL command with question marks as placeholders for the values we want to insert. The second is a tuple of values, which the database matches up against the question marks from left to right when it executes the command.

Now that we have two tables in our database, we can use joins to combine the information they contain. There are several types of joins; we will begin with inner joins, which involve the following:

1. Constructing the cross product of the tables

2. Discarding rows that do not meet the selection criteria

3. Selecting columns from the remaining rows

These steps are shown graphically in Figure 15.8, on the following page. First, all combinations of all rows in the tables are combined, which makes the cross product. Second, the selection criteria specified by WHERE is applied, and rows that don't match are removed. Finally, the selected columns are kept, and all others are discarded.

In an earlier query, we retrieved the names of regions with projected populations greater than 1 million. Using an inner join, we can get the names of the countries that are in those regions.

Compute cross product PopByRegion PopByCountry

Eastern Asia

1362955

_____M

Eastern Asia

Mongolia

3407

North America

661157

North America

Greenland

43

Keep rows where PopByRegion.Region = PopByCountry.Region

Eastern Asia

1362955

Eastern Asia

Mongolia

3407

North America

661157

North America

Greenland

43

Eastern Asia

1362955

North America

Greenland

43

North America

661157

Eastern Asia

Mongolia

3407

<2^ Keep rows where PopByRegion.Population > 1000000

Eastern Asia

1362955

Eastern Asia

Mongolia

3407

North America

661157

North America

Greenland

43

Keep columns PopByRegion.Region and PopByCountry.Country

Eastern Asia

1362955

Eastern Asia

Mongolia

3407

Figure 15.8: Inner joins in action

The query and its result look like this:

Download db/db_inner_join.cmd

SELECT PopByRegion.Region, PopByCountry.Country FROM PopByRegion INNER JOIN PopByCountry WHERE (PopByRegion.Region = PopByCountry.Region) AND (PopByRegion.Population > 1000000)

[( 'Eastern Asia', 'China'), ('Eastern Asia', 'DPR Korea'), ('Eastern Asia' , 'Hong Kong (China)'), ('Eastern Asia' , 'Mongolia'), ('Eastern Asia' , 'Republic of Korea'), ('Eastern Asia' , 'Taiwan')]

To understand what this query is doing, we can analyze it in terms of the three steps outlined earlier:

1. Combine every row of PopByRegion with every row of PopByCountry. PopByRegion has two columns and twelve rows, while PopByCountry has three columns and eleven rows, so this produces a temporary table with five columns and 132 rows (see Figure 15.9, on the next page).

Central Africa

330993

Eastern Asia

DPR Korea

24056

Southeastern Africa

743112

Eastern Asia

DPR Korea

24056

Northern Africa

1037463

Eastern Asia

DPR Korea

24056

Southern Asia

2051941

Eastern Asia

DPR Korea

24056

Asia Pacific

785468

Eastern Asia

DPR Korea

24056

Middle East

687630

Eastern Asia

DPR Korea

24056

Eastern Asia

1362955

Eastern Asia

DPR Korea

24056

South America

593121

Eastern Asia

DPR Korea

24056

Eastern Europe

223427

Eastern Asia

DPR Korea

24056

North America

661157

Eastern Asia

DPR Korea

24056

Western Europe

387933

Eastern Asia

DPR Korea

24056

Japan

100562

Eastern Asia

DPR Korea

24056

Central Africa

330993

Eastern Asia

Hong Kong (China)

8764

Southeastern Africa

743112

Eastern Asia

Hong Kong (China)

8764

Northern Africa

1037463

Eastern Asia

Hong Kong (China)

8764

Southern Asia

2051941

Eastern Asia

Hong Kong (China)

8764

Asia Pacific

785468

Eastern Asia

Hong Kong (China)

8764

Middle East

687630

Eastern Asia

Hong Kong (China)

8764

Eastern Asia

1362955

Eastern Asia

Hong Kong (China)

8764

South America

593121

Eastern Asia

Hong Kong (China)

8764

Eastern Europe

223427

Eastern Asia

Hong Kong (China)

8764

North America

661157

Eastern Asia

Hong Kong (China)

8764

Western Europe

387933

Eastern Asia

Hong Kong (China)

8764

Japan

100562

Eastern Asia

Hong Kong (China)

8764

Figure 15.9: An inner join in progress

2. Discard rows that do not meet the selection criteria. The join's WHERE clause specifies two of these: the region taken from Pop-ByRegion must be the same as the region taken from PopByCountry, and the region's population must be greater than 1 million. The first criterion ensures that we don't look at records that combine countries in North America with regional populations in East Asia; the second filters out information about countries in region whose populations are less than our threshold.

3. Finally, we select the region and country names from the rows that have survived.

To find the regions where one country accounts for more than 10 percent of the region's overall population, we would also need to join the two tables.

Download db/db_duplicates.cmd

SELECT PopByRegion.Region

FROM PopByRegion INNER JOIN PopByCountry

WHERE (PopByRegion.Region = PopByCountry.Region)

AND ((PopByCountry.Population * 1.0) / PopByRegion.Population > 0.10)''') >>> print cur.fetchall()

[('Eastern Asia' ,), ('North America' ,), ('North America' ,)]

We use multiplication and division in our WHERE condition to calculate the percentage of the region's population by country as a floatingpoint number. The resulting list contains duplicates, since more than one Eastern Asian country accounts for more than 10 percent of the region's population. To remove the duplicates, we add the keyword DISTINCT to the query:

Download db/db_distinct.cmd

SELECT PopByRegion.Region

FROM PopByRegion INNER JOIN PopByCountry

WHERE (PopByRegion.Region = PopByCountry.Region)

AND ((PopByCountry.Population * 1.0) / PopByRegion.Population > 0.10)''')

Was this article helpful?

0 0

Post a comment