LE SQL de A à Z - 1ere partie : présentation

NOTICE: the content of these web pages is subject to the French and international legal specifications regarding copyright and the rights of authors.

THE MISSING JOIN !

The goal of this article is to present a new type of join for SQL. This is preparatory work towards a more complete article, with emphasis on finding an adequate relational-algebra tool for expressing such a join.

Par Frédéric BROUARD
PARIS 2002-02-04

Translated from french By Peter GULUTZAN, author of Ocelot, SQL 3 RDBMS and "SQL 99 complete really" printed by R&D books - With many thanks

In view of the grave problems that colleagues and Internet users bring to me, I have tried to resolve, on a case-by-case basis, a family of problems which are rather difficult to express with simple SQL queries. I have arrived at the conclusion that it would be interesting to add a new type of join to SQL -- the "LINEAR" join or ROW JOIN. We are first of all going to show the various serious problems, and see how we can resolve them by introducing this new operator.
I won't venture immediately into the relational algebra to determine whether the good Doctor Codd missed something in his task. So I reserve to the theorists a nice prospective battle, to determine what this operation offers, and the ways of representing it.

1 - Enumerate rows and all the queries that flow from them

Suppose the table T_CLIENT_CLI (CLI_ID, CLI_NOM) is as follows :

CLI_ID CLI_NOM
------- ------------
17      DURAND
192     DUPONT
44      DUVAL
11      DUMOULIN
741     DULIN
82      DUPOND
177     DURAND

To create the test set:

CREATE TABLE T_CLIENT_CLI
(CLI_ID INTEGER,
CLI_NOM VARCHAR(10))
INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
VALUES (17, 'DURAND')
INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
VALUES (192, 'DUPONT')
INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
VALUES (44, 'DUVAL')
INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
VALUES (11, 'DUMOULIN')
INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
VALUES (741, 'DULIN')
INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
VALUES (82, 'DUPOND')
INSERT INTO T_CLIENT_CLI (CLI_ID, CLI_NOM)
VALUES (177, 'DURAND')

The question is: how does one get, in reply, the names of our clients, in alphabetic order, with their rank (from 1 to n) ?

1.1 - RESULT 1

If one poses this question strictly, then the result is:

CLI_NOM       RANG
------------- -----
DULIN         1
DUMOULIN      2
DUPOND        3
DUPONT        4
DURAND        5
DURAND        5
DUVAL         7

Thus, the two DURANDs (which happen to be equal) are in the fifth rank, while DUVAL is in not the sixth but the seventh rank !

1.2 - RESULT 2

Another possible solution is :

CLI_NOM       RANG
------------- -----
DULIN         1
DUMOULIN      2
DUPOND        3
DUPONT        4
DURAND        5
DURAND        6
DUVAL         7

That is : a straightforward and direct numeration without taking into account the duplications and ambiguities of the selected information. It's a bit like what auto-incrementing columns do in certain RDBMSs.

1.3 - RESULT 3

Finally, one can refine the last solution by introducing a counter to make the duplicates disappear :

CLI_NOM       RANG NOMBRE
------------- ----- ------
DULIN         1     1
DUMOULIN      2     1
DUPOND        3     1
DUPONT        4     1
DURAND        5     2
DUVAL         7     1

... which has the advantage of being cleaner -- however it doesn't correspond to the initial question!

1.4 - RESULT 4

Pushing things to the extreme, one can demand that the numeration by rank should be strict and without any "holes", like this :

CLI_NOM       RANG NOMBRE
------------- ----- ------
DULIN         1     1
DUMOULIN      2     1
DUPOND        3     1
DUPONT        4     1
DURAND        5     2
DUVAL         6     1

But what queries are necessary to arrive at the various proposed solutions? ?

1.4 - The Queries for the above results

In principle, the queries for answering this sort of question require an automatic "non-equi" join, in order to make numbers for tuples whose values precede the current tuple.

Query for Result 1 :

SELECT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS RANG
FROM   T_CLIENT_CLI T1
       LEFT OUTER JOIN T_CLIENT_CLI T2
            ON T1.CLI_NOM > T2.CLI_NOM
GROUP BY T1.CLI_ID, T1.CLI_NOM
ORDER BY RANG

Query for Result 2 :

SELECT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS RANG
FROM   T_CLIENT_CLI T1
       LEFT OUTER JOIN T_CLIENT_CLI T2
            ON T1.CLI_NOM || CAST(T1.CLI_ID AS CHAR(16))
             > T2.CLI_NOM || CAST(T2.CLI_ID AS CHAR(16))
GROUP BY T1.CLI_ID, T1.CLI_NOM
ORDER BY RANG

Query for Result 3 :

SELECT T1.CLI_NOM, RANG, NOMBRE
FROM (SELECT DISTINCT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS RANG
       FROM   T_CLIENT_CLI T1
              LEFT OUTER JOIN T_CLIENT_CLI T2
                   ON T1.CLI_NOM > T2.CLI_NOM
       GROUP BY T1.CLI_ID, T1.CLI_NOM)
       T1
           INNER JOIN (SELECT CLI_NOM, COUNT(CLI_ID) AS NOMBRE
                       FROM   T_CLIENT_CLI T1
                       GROUP BY CLI_NOM) T2
                 ON T1.CLI_NOM = T2.CLI_NOM
ORDER BY RANG

Query for Result 4 :

SELECT T2.CLI_NOM, COUNT(T1.CLI_NOM) AS RANG, NOMBRE
FROM   (SELECT DISTINCT CLI_NOM
        FROM   T_CLIENT_CLI) T1
               LEFT OUTER JOIN (SELECT DISTINCT CLI_NOM
                                FROM   T_CLIENT_CLI) T2
                    ON T1.CLI_NOM <= T2.CLI_NOM
               INNER JOIN (SELECT CLI_NOM, COUNT(CLI_ID) AS NOMBRE
                           FROM   T_CLIENT_CLI
                           GROUP BY CLI_NOM) T3
                    ON T2.CLI_NOM = T3.CLI_NOM
GROUP BY T2.CLI_NOM, T3.NOMBRE
ORDER BY RANG

The least that one can say is that this type of request doesn't raise a beginner's spirits. What then could we say about the developer, faced with the problem in a context where tables are much more densely populated ? It's a good bet that he/she will simply throw up his/her hands and resort to a procedure : at best a stored procedure, at worst a procedure on the client machine.

2 - Assigning rows to places

The second family of problems which warrants our attention in this context is the problems of assignment, problems which are dear to the hearts of every pupil at the start of a school year (for example). The question is: starting with a table which constitutes a population, and another table which constitutes a place (with each one place being destined to receive one pupil, one spectator, or the like) how does one assign one place for each element of the population?

Let's take our CLIENTS table again, and add a table of SEATS, modelling the places in a theatre T_PLACE_PLC (PLC_REF). As you know, places in a theatre are designated by letters (the row) and by digits (the seat within the row). For our demonstration we will limit ourselves to three rows, and 5 seats per row. In other words: a pocket theatre!

PLC_REF
-------
A01
A02
A03
A04
A05
B01
B02
B03
B04
B05
C01
C02
C03
C04
C05

Creation of the test sample :

CREATE TABLE T_PLACE_PLC
(PLC_REF CHAR(3))
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('A01')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('A02')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('A03')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('A04')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('A05')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('B01')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('B02')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('B03')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('B04')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('B05')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('C01')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('C02')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('C03')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('C04')
INSERT INTO T_PLACE_PLC (PLC_REF) VALUES ('C05')

The problem is quite simple: how to assign the clients to the first seats ?

It seems on the face of it that the simplest thing would be to number the client rows and then number the seat rows, and effect a join over these numbers.

Something like :

CLI_ID CLI_NOM      CLI_NUM
------- ------------ -------
17      DURAND       1
192     DUPONT       2
44      DUVAL        3
11      DUMOULIN     4
741     DULIN        5
82      DUPOND       6
177     DURAND       7 PLC_REF PLC_NUM
-------- -------
A01      1
A02      2
A03      3
A04      4
A05      5
B01      6
B02      7
B03      8
B04      9
B05      10
C01      11
C02      12
C03      13
C04      14
C05      15

Given the above, the solution is easy to see :

SELECT CLI_NOM, PLC_REF
FROM   T_CLIENT_CLI
       JOIN T_PLACE_PLC
            ON CLI_NUM = PLC_NUM

Which gives :

CLI_NOM      PLC_REF
------------ -------
DURAND       A01
DUPONT       A02
DUVAL        A03
DUMOULIN     A04
DULIN        A05
DUPOND       B01
DURAND       B02

However we don't actually have such columns at our disposal. How would we make them ?
All that's necessary is to apply what we just saw in the previous example, for the clients, but also for the seats, and to join the whole thing over the sequencing columns that were thus generated.

I just give you the result in a rough form, its maintenance is cheery enough as it is !!!

SELECT CLI_NOM, PLC_REF
FROM   (SELECT T1.CLI_NOM, COUNT(T2.CLI_ID) + 1 AS RANG
        FROM   T_CLIENT_CLI T1
               LEFT OUTER JOIN T_CLIENT_CLI T2
                    ON T1.CLI_NOM || CAST(T1.CLI_ID AS CHAR(16))
                     > T2.CLI_NOM || CAST(T2.CLI_ID AS CHAR(16))
        GROUP BY T1.CLI_ID, T1.CLI_NOM) C
               INNER JOIN (SELECT T3.PLC_REF, COUNT(T4.PLC_REF) + 1 AS RANG
                           FROM   T_PLACE_PLC T3
                                  LEFT OUTER JOIN T_PLACE_PLC T4
                                       ON T3.PLC_REF > T4.PLC_REF
                           GROUP BY T3.PLC_REF) P
                     ON C.RANG = P.RANG

And once again we've realized that the PLC_REF column is a candidate key for the T_PLACE_PLC table ...

3 - The solution : linear join (or ROW JOIN) !

The original task is to handle a very simple table consisting of a single column and filled with the set of integers: T_I_ENT (ENT_I). Of course there will be some limit, for example a sequence going from 0 to 1000 or more if necessary :

ENT_I
-------
0
1
2
3
4
5
6
7
8
9
10
...

CREATE TABLE T_I_ENT
(ENT_I INTEGER)
INSERT INTO T_I_ENT (ENT_I) VALUES (0)
INSERT INTO T_I_ENT (ENT_I) VALUES (1)
INSERT INTO T_I_ENT (ENT_I) VALUES (2)
INSERT INTO T_I_ENT (ENT_I) VALUES (3)
INSERT INTO T_I_ENT (ENT_I) VALUES (4)
INSERT INTO T_I_ENT (ENT_I) VALUES (5)
INSERT INTO T_I_ENT (ENT_I) VALUES (6)
INSERT INTO T_I_ENT (ENT_I) VALUES (7)
INSERT INTO T_I_ENT (ENT_I) VALUES (8)
INSERT INTO T_I_ENT (ENT_I) VALUES (9)
INSERT INTO T_I_ENT (ENT_I) VALUES (10)
...
INSERT INTO T_I_ENT (ENT_I) VALUES (1000)

Let's note in passing that it's useless to grab all the numbers from 1 to 100, the first ten numbers are enough, and a simple INSERT statement will play the role of complementary insertion :

INSERT INTO T_I_ENT (ENT_I)
SELECT DISTINCT I1.ENT_I + (I2.ENT_I * 10) + (I3.ENT_I * 100)
FROM   T_I_ENT I1
       CROSS JOIN T_I_ENT I2
       CROSS JOIN T_I_ENT I3
       CROSS JOIN T_I_ENT I4
WHERE I1.ENT_I + (I2.ENT_I * 10) + (I3.ENT_I * 100) BETWEEN 0 AND 1000

From there, the juxtaposition of the projection of the name column (from the client table ordered by client) with the projection of the table of integers is a response to our attempt :

CLI_NOM               T_I_ENT
----------            ------------
DULIN                 1
DUMOULIN              2
DUPOND                3
DUPONT                4
DURAND                5
DURAND                6
DUVAL                 7

And that is why I propose a new operator for linear joins : LINEAR JOIN (or ROW JOIN), which makes it possible to relate the row rank number one ("ordinal one") of the table on the left of the operator, with the row rank number one ("ordinal one") of the table on the right, and so on.

4 - Syntax and Rules of linear joining

Linear joining is done with the following syntax :

SELECT <select list>
FROM <left table>
[LEFT | RIGHT] LINEAR JOIN <right table> [OFFSET <offset value>]

For our preceding example, then, it's enough to say:

SELECT CLI_NOM, T_I_ENT
FROM T_CLIENT_CLI
LINEAR JOIN T_I_ENT OFFSET 1 /* elimination of the first line, a zero */
ORDER BY CLI_NOM, T_I_ENT

Some explanations:

the key word OFFSET permits indication of wich line to start at : to take into account the first line from the right table, associated with the first line from the left table.
The ORDER BY clause operates separately, before the linear joining, for the joined tables
the joining can be outer left (LEFT LINEAR JOIN) or outer right (RIGHT LINEAR JOIN), but not both. By default it is inner.

Example of a right outer linear join:

SELECT CLI_NOM, T_I_ENT
FROM T_CLIENT_CLI
&nbs p; RIGHT LINEAR JOIN T_I_ENT
ORDER BY CLI_NOM, T_I_ENT

which gives :

CLI_NOM     T_I_ENT
---------- ------------
DULIN       0
DUMOULIN    1
DUPOND      2
DUPONT      3
DURAND      4
DURAND      5
DUVAL       6
NULL        7
NULL        8
...
NULL        1000

To solve our problem of assignment of places in a theatre, all we have to do is :

SELECT CLI_NOM, PLC_REF
FROM   T_CLIENT_CLI
            LINEAR JOIN T_I_ENT
            LINEAR JOIN T_PLACE_PLC
ORDER BY CLI_NOM, T_I_ENT, PLC_REF

I don't know what you think of it, but I find this script to be simple and easy to understand!

CONCLUSION

These queries are related to the T-JOINs (theta joins) of Doctor Codd, intended to obtain an optimal correspondence of unequal values (typically the problem of assignment of pupils to rooms of a given capacity).
I leave it to you to figure out how such a join would be represented in relational algebra !