THE MISSING JOIN !
The
goal of this article is to present a new type of join for SQL. This is
preparatory work towards a more complete article, with emphasis on finding
an adequate relational-algebra tool for expressing such a join.
Par
Frédéric BROUARD
PARIS 2002-02-04
Translated from french By Peter GULUTZAN, author of Ocelot, SQL 3 RDBMS and "SQL 99 complete really" printed by R&D books - With many thanks
In
view of the grave problems that colleagues and Internet users bring to
me, I have tried to resolve, on a case-by-case basis, a family of problems
which are rather difficult to express with simple SQL queries. I have arrived
at the conclusion that it would be interesting to add a new type of join
to SQL -- the "LINEAR" join or ROW JOIN. We are first of all going to show
the various serious problems, and see how we can resolve them by introducing
this new operator.
I
won't venture immediately into the relational algebra to determine whether
the good Doctor Codd missed something in his task. So I reserve to the
theorists a nice prospective battle, to determine what this operation offers,
and the ways of representing it.
1 - Enumerate rows and all the queries that flow from them
Suppose the table
T_CLIENT_CLI (CLI_ID, CLI_NOM) is as follows :
CLI_ID CLI_NOM
------- ------------ 17 DURAND 192 DUPONT 44 DUVAL 11 DUMOULIN 741 DULIN 82 DUPOND 177 DURAND |
To create the test
set:
CREATE TABLE T_CLIENT_CLI
(CLI_ID INTEGER, CLI_NOM VARCHAR(10)) INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
|
The question is:
how does one get, in reply, the names of our clients, in alphabetic order,
with their rank (from 1 to n) ?
1.1 - RESULT 1
If one poses this
question strictly, then the result is:
CLI_NOM
RANG
------------- ----- DULIN 1 DUMOULIN 2 DUPOND 3 DUPONT 4 DURAND 5 DURAND 5 DUVAL 7 |
Thus, the two DURANDs
(which happen to be equal) are in the fifth rank, while DUVAL is in not
the sixth but the seventh rank !
1.2 - RESULT 2
Another possible
solution is :
CLI_NOM
RANG
------------- ----- DULIN 1 DUMOULIN 2 DUPOND 3 DUPONT 4 DURAND 5 DURAND 6 DUVAL 7 |
That is : a straightforward
and direct numeration without taking into account the duplications and
ambiguities of the selected information. It's a bit like what auto-incrementing
columns do in certain RDBMSs.
1.3 - RESULT 3
Finally, one can
refine the last solution by introducing a counter to make the duplicates
disappear :
CLI_NOM
RANG NOMBRE
------------- ----- ------ DULIN 1 1 DUMOULIN 2 1 DUPOND 3 1 DUPONT 4 1 DURAND 5 2 DUVAL 7 1 |
... which has the
advantage of being cleaner -- however it doesn't correspond to the initial
question!
1.4 - RESULT 4
Pushing things to
the extreme, one can demand that the numeration by rank should be strict
and without any "holes", like this :
CLI_NOM
RANG NOMBRE
------------- ----- ------ DULIN 1 1 DUMOULIN 2 1 DUPOND 3 1 DUPONT 4 1 DURAND 5 2 DUVAL 6 1 |
But what queries
are necessary to arrive at the various proposed solutions? ?
1.4 - The Queries for the above results
In principle, the queries for answering this sort of question require an automatic "non-equi" join, in order to make numbers for tuples whose values precede the current tuple.
Query for Result
1 :
SELECT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS
RANG
FROM T_CLIENT_CLI T1 LEFT OUTER JOIN T_CLIENT_CLI T2 ON T1.CLI_NOM > T2.CLI_NOM GROUP BY T1.CLI_ID, T1.CLI_NOM ORDER BY RANG |
Query for Result
2 :
SELECT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS
RANG
FROM T_CLIENT_CLI T1 LEFT OUTER JOIN T_CLIENT_CLI T2 ON T1.CLI_NOM || CAST(T1.CLI_ID AS CHAR(16)) > T2.CLI_NOM || CAST(T2.CLI_ID AS CHAR(16)) GROUP BY T1.CLI_ID, T1.CLI_NOM ORDER BY RANG |
Query for Result
3 :
SELECT T1.CLI_NOM, RANG, NOMBRE
FROM (SELECT DISTINCT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS RANG FROM T_CLIENT_CLI T1 LEFT OUTER JOIN T_CLIENT_CLI T2 ON T1.CLI_NOM > T2.CLI_NOM GROUP BY T1.CLI_ID, T1.CLI_NOM) T1 INNER JOIN (SELECT CLI_NOM, COUNT(CLI_ID) AS NOMBRE FROM T_CLIENT_CLI T1 GROUP BY CLI_NOM) T2 ON T1.CLI_NOM = T2.CLI_NOM ORDER BY RANG |
Query for Result
4 :
SELECT T2.CLI_NOM, COUNT(T1.CLI_NOM) AS RANG,
NOMBRE
FROM (SELECT DISTINCT CLI_NOM FROM T_CLIENT_CLI) T1 LEFT OUTER JOIN (SELECT DISTINCT CLI_NOM FROM T_CLIENT_CLI) T2 ON T1.CLI_NOM <= T2.CLI_NOM INNER JOIN (SELECT CLI_NOM, COUNT(CLI_ID) AS NOMBRE FROM T_CLIENT_CLI GROUP BY CLI_NOM) T3 ON T2.CLI_NOM = T3.CLI_NOM GROUP BY T2.CLI_NOM, T3.NOMBRE ORDER BY RANG |
The least that one
can say is that this type of request doesn't raise a beginner's spirits.
What then could we say about the developer, faced with the problem in a
context where tables are much more densely populated ? It's a good bet
that he/she will simply throw up his/her hands and resort to a procedure
: at best a stored procedure, at worst a procedure on the client machine.
2 - Assigning rows to places
The second family of problems which warrants our attention in this context is the problems of assignment, problems which are dear to the hearts of every pupil at the start of a school year (for example). The question is: starting with a table which constitutes a population, and another table which constitutes a place (with each one place being destined to receive one pupil, one spectator, or the like) how does one assign one place for each element of the population?
Let's take our CLIENTS
table again, and add a table of SEATS, modelling the places in a theatre
T_PLACE_PLC (PLC_REF). As you know, places in a theatre are designated
by letters (the row) and by digits (the seat within the row). For our demonstration
we will limit ourselves to three rows, and 5 seats per row. In other words:
a pocket theatre!
PLC_REF
------- A01 A02 A03 A04 A05 B01 B02 B03 B04 B05 C01 C02 C03 C04 C05 |
Creation of the test
sample :
CREATE TABLE T_PLACE_PLC
(PLC_REF CHAR(3)) INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('A01')
|
The problem is quite simple: how to assign the clients to the first seats ?
It seems on the face of it that the simplest thing would be to number the client rows and then number the seat rows, and effect a join over these numbers.
Something like :
CLI_ID CLI_NOM
CLI_NUM
------- ------------ ------- 17 DURAND 1 192 DUPONT 2 44 DUVAL 3 11 DUMOULIN 4 741 DULIN 5 82 DUPOND 6 177 DURAND 7 |
PLC_REF PLC_NUM
-------- ------- A01 1 A02 2 A03 3 A04 4 A05 5 B01 6 B02 7 B03 8 B04 9 B05 10 C01 11 C02 12 C03 13 C04 14 C05 15 |
Given the above,
the solution is easy to see :
SELECT CLI_NOM, PLC_REF
FROM T_CLIENT_CLI JOIN T_PLACE_PLC ON CLI_NUM = PLC_NUM |
Which gives :
CLI_NOM PLC_REF
------------ ------- DURAND A01 DUPONT A02 DUVAL A03 DUMOULIN A04 DULIN A05 DUPOND B01 DURAND B02 |
However we don't
actually have such columns at our disposal. How would we make them ?
All that's necessary
is to apply what we just saw in the previous example, for the clients,
but also for the seats, and to join the whole thing over the sequencing
columns that were thus generated.
I just give you the
result in a rough form, its maintenance is cheery enough as it is !!!
SELECT CLI_NOM, PLC_REF
FROM (SELECT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS RANG FROM T_CLIENT_CLI T1 LEFT OUTER JOIN T_CLIENT_CLI T2 ON T1.CLI_NOM || CAST(T1.CLI_ID AS CHAR(16)) > T2.CLI_NOM || CAST(T2.CLI_ID AS CHAR(16)) GROUP BY T1.CLI_ID, T1.CLI_NOM) C INNER JOIN (SELECT T3.PLC_REF, COUNT(T4.PLC_REF) + 1 AS RANG FROM T_PLACE_PLC T3 LEFT OUTER JOIN T_PLACE_PLC T4 ON T3.PLC_REF > T4.PLC_REF GROUP BY T3.PLC_REF) P ON C.RANG = P.RANG |
And once again we've
realized that the PLC_REF column is a candidate key for the T_PLACE_PLC
table ...
3 - The solution : linear join (or ROW JOIN) !
The original task
is to handle a very simple table consisting of a single column and filled
with the set of integers: T_I_ENT (ENT_I). Of course there will be some
limit, for example a sequence going from 0 to 1000 or more if necessary
:
ENT_I
------- 0 1 2 3 4 5 6 7 8 9 10 ... |
CREATE TABLE T_I_ENT
(ENT_I INTEGER) INSERT INTO T_I_ENT (ENT_I) VALUES (0)
|
Let's note in passing
that it's useless to grab all the numbers from 1 to 100, the first ten
numbers are enough, and a simple INSERT statement will play the role of
complementary insertion :
INSERT INTO T_I_ENT (ENT_I)
SELECT DISTINCT I1.ENT_I + (I2.ENT_I * 10) + (I3.ENT_I * 100) FROM T_I_ENT I1 CROSS JOIN T_I_ENT I2 CROSS JOIN T_I_ENT I3 CROSS JOIN T_I_ENT I4 WHERE I1.ENT_I + (I2.ENT_I * 10) + (I3.ENT_I * 100) BETWEEN 0 AND 1000 |
From there, the juxtaposition
of the projection of the name column (from the client table ordered by
client) with the projection of the table of integers is a response to our
attempt :
CLI_NOM
T_I_ENT
---------- ------------ DULIN 1 DUMOULIN 2 DUPOND 3 DUPONT 4 DURAND 5 DURAND 6 DUVAL 7 |
And that is why I
propose a new operator for linear joins :
LINEAR JOIN (or ROW JOIN),
which makes it possible to relate the row rank number one ("ordinal one")
of the table on the left of the operator, with the row rank number one
("ordinal one") of the table on the right, and so on.
4 - Syntax and Rules of linear joining
Linear joining is
done with the following syntax :
SELECT <select list>
FROM <left table> [LEFT | RIGHT] LINEAR JOIN <right table> [OFFSET <offset value>] |
For our preceding
example, then, it's enough to say:
SELECT CLI_NOM, T_I_ENT
FROM T_CLIENT_CLI LINEAR JOIN T_I_ENT OFFSET 1 /* elimination of the first line, a zero */ ORDER BY CLI_NOM, T_I_ENT |
Some explanations:
Example of a
right outer linear join:
SELECT CLI_NOM, T_I_ENT
FROM T_CLIENT_CLI &nbs p; RIGHT LINEAR JOIN T_I_ENT ORDER BY CLI_NOM, T_I_ENT |
which gives :
CLI_NOM T_I_ENT
---------- ------------ DULIN 0 DUMOULIN 1 DUPOND 2 DUPONT 3 DURAND 4 DURAND 5 DUVAL 6 NULL 7 NULL 8 ... NULL 1000 |
To solve our problem
of assignment of places in a theatre, all we have to do is :
SELECT CLI_NOM, PLC_REF
FROM T_CLIENT_CLI LINEAR JOIN T_I_ENT LINEAR JOIN T_PLACE_PLC ORDER BY CLI_NOM, T_I_ENT, PLC_REF |
I don't know what
you think of it, but I find this script to be simple and easy to understand!
CONCLUSION
These queries are
related to the T-JOINs (theta joins) of Doctor Codd, intended to obtain
an optimal correspondence of unequal values (typically the problem of assignment
of pupils to rooms of a given capacity).
I leave it to you
to figure out how such a join would be represented in relational algebra
!