Summary: when a one-to-many JOIN produces excess “repetition” in a result set, use a UNION.
Example: Articles and Comments
Suppose you have a Web site with articles and comments. The database for this might have one table containing the articles, and another table for the comments on those articles, with a relationship between these tables, linking each comment to its article.
create table articles ( id integer not null primary key , title varchar(100) not null , article text not null )
Each of these columns is NOT NULL because either this is required (for the primary key) or expected by the application (an article must have both a title and article text to be accepted).
The last column,
article, could be any large string datatype, like VARCHAR(5000) or TEXT, which, depending on the database system, might hold up to 2 gigabytes. The exact datatype doesn’t really matter, just that the datalength of the column is very large, because it needs to hold the article text of any article.
create table comments ( id integer not null primary key , article_id integer not null foreign key , title varchar(100) null , comment text not null
comments table has a similar structure, with an additional column being a foreign key to link each comment to the article it’s commenting on. The
title is NULL because it is optional for comments. Finally, there is a large string datatype column for the comment text, which you might think need not be as large as the collumn for the article text, but just to be safe, it is usually TEXT as well.
This is a classical one-to-many relationship. Each article can have one or more comments, and each comment belongs to only one article.
Technically, it’s a one-to-zero-or-many relationship, because an article must be able to exist without any comments, which is the situation that pertains immediately after the article is first added, before the first comment on it is created.
Query to retrieve an article and its comments
To retrieve an article, its title and article text, along with all its comments, you can use the following query:
select articles.id , articles.title , articles.article , comments.id , comments.title , comments.comment from articles left outer join comments on comments_article_id = articles.id where articles.id = 937 order by articles.id , comments.id
The query retrieves a specific article by id, plus its associated comments, if any, and it’s a left outer join because the article might not have comments. The result set will look something like this:
aid atitle article cid ctitle comment 937 Article Title [TEXT] 1013 Comment Title [TEXT] 937 Article Title [TEXT] 1024 Comment Title [TEXT] 937 Article Title [TEXT] 1037 Comment Title [TEXT] 937 Article Title [TEXT] 1042 Comment Title [TEXT]
As the design was the classical one-to-many relationship, so the result set is the classical one-to-many result set. With one minor item of note: the article text is repeated in the result set four times, and it could be 2 gigabytes. The query works, but it wastes bandwidth.
So is the answer to write two queries, one for the article and one for its comments? Since there is overhead associated with any query, and two queries are more than one query, it might be interesting to see whether there is a tradeoff.
But there is another way to combine two queries into one, without the “repetition” of data in the result set that you see with a one-to-many join.
The UNION query
The following UNION query consists of two SELECTs combined with the UNION ALL operator:
select 'article' as rowtype , articles.id , articles.title , articles.article from articles where articles.id = 937 union all select 'comment' as rowtype , comments.id , comments.title , comments.comment from articles inner join comments on comments.article_id = articles.id where articles.id = 937 order by rowtype
While there are two SELECTs, there is only one ORDER BY, which sorts the combined results of both SELECTs into one sorted final result set.
The two SELECTs return the article and its comments, respectively. Each SELECT is driven by an identical WHERE clause, to select a specific article or its comments. The second SELECT’s join must be an inner join instead of a left outer join, because the first SELECT retrieves the article, while the second query retrieves only related comments.
rowtype id title article article 937 Article Title [TEXT] comment 1013 Comment Title [TEXT] comment 1037 Comment Title [TEXT] comment 1042 Comment Title [TEXT]
Optimal bandwidth is achieved, because there is no repetition of any large column value. The
rowtype can be a single-byte column if desired; its values are assigned as literals in the SELECTs.
In the second SELECT, is it necessary to do the join at all? Couldn’t the second SELECT simply retrieve from the
comments table using a similar WHERE condition on
articles_id instead? Yes, but it’s better to leave the join in place, so that, if desired, instead of retrieving a specific article by id, you wanted to return all articles that meet a certain condition, such as a keyword in their title. This requires using the same WHERE clause in each SELECT.
Why does this work so well?
You cannot take this approach with every one-to-many relationship that you meet. In order for you to write this type of UNION query, both SELECTs must return a union compatible result set. This means the same number of columns, with each pair of columns having compatible datatypes (dates with dates, numbers with numbers, strings with strings). Articles and comments are well suited for this type of UNION because the tables share such a similar design.