Semi-Joins and Anti-Joins
Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN
by Roger Schrag
Database Specialists, Inc.
About Database Specialists, Inc.
Database Specialists, Inc. provides remote DBA services and onsite database support for your mission critical Oracle systems. Since 1995, we have been providing Oracle database consulting in Solaris, HP-UX, Linux, AIX, and Windows environments. We are DBAs, speakers, educators, and authors. Our team is continually recognized by Oracle, at national conferences and by leading trade publications. Learn more about our remote DBA, database tuning, and consulting services. Or, call us at 415-344-0500 or 888-648-0500.
Introduction
Optimizing SQL usually gives the most significant results when DBAs are called upon to “make the system run faster.” Using tools like Statspack or Enterprise Manager, it is often easy to find the slow SQL. But how do you make the queries run faster? That is the challenge! In this paper we will discuss the semi-join and the anti-join, two powerful SQL constructs Oracle offers for use in your quest for faster queries. In particular, we will define these two terms, discuss when and why you might want to use the [NOT] EXISTS or [NOT] IN constructs, and demonstrate how you can use optimizer hints and make minor query changes in order to enable Oracle to use some very powerful and efficient access paths.
For certain classes of queries, these features can dramatically reduce logical reads, physical reads, CPU time, and elapsed time. But beware! There are some obscure (and not always easy to find) requirements that must be met in order for Oracle to deploy the semi- and anti-join access paths. If you fail to dot an “I” or cross a “T,” you could be banging your head against the wall for hours. In this paper we will look at SQL from a real application and demonstrate the “before” and “after” versions of queries that ran orders of magnitude faster once semi-joins and anti-joins were implemented correctly.
Before we get started, I want to point out that Oracle Corporation has improved its optimization techniques significantly over the years, specifically in the area of semi-joins and anti-joins. This paper was written using Oracle 9i release 9.2.0.4, and all examples were validated in such an environment. If you are using a newer or older release of Oracle, it is quite possible that you will see different results. Please do not take anything you read in this paper—or in any paper for that matter—for granted. Validate every hypothesis in your environment before relying upon it.
Semi-Join Defined
A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned. Semi-joins are written using the EXISTS or IN constructs.
Suppose you have the DEPT and EMP tables in the SCOTT schema and you want a list of departments with at least one employee. You could write the query with a conventional join:
SELECT D.deptno, D.dname FROM dept D, emp E WHERE E.deptno = D.deptno ORDER BY D.deptno;
Unfortunately, if a department has 400 employees then that department will appear in the query output 400 times. You could eliminate the duplicate rows by using the DISTINCT keyword, but you would be making Oracle do more work than necessary. Really what you want to do is specify a semi-join between the DEPT and EMP tables instead of a conventional join:
SELECT D.deptno, D.dname FROM dept D WHERE EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno;
The above query will list the departments that have at least one employee. Whether a department has one employee or 100, the department will appear only once in the query output. Moreover, Oracle will move on to the next department as soon as it finds the first employee in a department, instead of finding all of the employees in each department.
Anti-Join Defined
An “anti-join” between two tables returns rows from the first table where no matches are found in the second table. An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is found. Anti-joins are written using the NOT EXISTS or NOT IN constructs. These two constructs differ in how they handle nulls—a subtle but very important distinction which we will discuss later.
Suppose you want a list of empty departments—departments with no employees. You could write a query that finds all departments and subtracts off the department of each employee:
SELECT D1.deptno, D1.dname FROM dept D1 MINUS SELECT D2.deptno, D2.dname FROM dept D2, emp E2 WHERE D2.deptno = E2.deptno ORDER BY 1;
The above query will give the desired results, but it might be clearer to write the query using an anti-join:
SELECT D.deptno, D.dname FROM dept D WHERE NOT EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno;
The above query is more efficient because Oracle can employ an anti-join access path. The difference in efficiency here is akin to the difference between a nested loops join and a hash join when you are joining every row in one table to another.
The EXISTS and IN Constructs
In this section we will discuss how Oracle evaluates EXISTS and IN clauses, prerequisites for using Oracle’s semi-join access paths, and hints that influence semi-join query optimization. Then we’ll look at a few sample queries where semi-join access paths can be used to make the query more efficient.
How Oracle Evaluates EXISTS and IN Clauses
The SQL language as implemented in Oracle does not include a syntax for explicitly performing semi-joins, so you use an EXISTS or IN clause with a subquery to indicate to Oracle that a semi-join should be performed instead of a conventional join. Oracle will typically transform. the subquery into a join wherever possible.
An interesting characteristic of Oracle’s query optimizer is that (according to Metalink document 144967.1), in Oracle 8i and 9i the decision of whether or not to transform. a query by merging its subqueries into the main body is a heuristic one. Basically, Oracle will merge subqueries if it can, regardless of impact on cost. The assumption is that merging is always a good thing. In fact, Oracle performs query transformations (merging subqueries into the main body) before making a list of possible execution plans and evaluating their cost. This bit of trivia will come into play when we discuss the second example in this section.
The Oracle Performance Tuning Guide in the Oracle 8i documentation set explains that EXISTS and IN are processed differently by the query optimizer. The manual explains a rule of thumb for when to use each, and provides an example to demonstrate. The Oracle 9i Performance Tuning Guide provides the same rule of thumb with a slightly more polished example. The Oracle 10g Performance Tuning Guide contains substantially the same text as the Oracle 9i manual on this subject.
The rule of thumb goes like this: If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.
The example in the Oracle 8i Performance Tuning Guide shows a query where the main body is not selective, while the target of the semi-join is highly selective. The example demonstrates that Oracle chooses a poor execution plan when an EXISTS clause is used, and a much more efficient execution plan when an IN clause is used instead. I was easily able to reproduce this example in an Oracle 8i environment with the same results as shown in the Performance Tuning Guide.
However, when I reproduced this same example in an Oracle 9i environment, Oracle chose an identical (and efficient) execution plan for both versions of the query. Furthermore, the example in the Oracle 9i Performance Tuning Guide gave an identical (and efficient) execution plan for both versions of the query in my Oracle 9i environment. These tests echo my previous experience—that Oracle 9i is smarter than Oracle 8i about EXISTS and IN clauses, and you don’t typically need to worry about when to use EXISTS versus IN.
The upshot is that if you are writing queries for an Oracle 8i (or earlier) system, then you should think about the selectivity of the predicates in the main query versus the subquery and choose whether to use an EXISTS or an IN clause wisely. If you are using Oracle 9i or later, it doesn’t hurt to follow this rule of thumb, but you may not need to worry as much.
Prerequisites and Hints that Impact Semi-Joins
There are a few situations in which Oracle cannot use a semi-join access path. If the query contains a DISTINCT operation (explicitly or implicitly through a set operator such as UNION) then Oracle cannot transform. EXISTS or IN clauses into something that could use a semi-join access path. The same is true if the EXISTS or IN clause is part of an OR branch. (Semi-join access paths can be used in queries that contain ORs in the WHERE clause, just as long as the EXISTS or IN is not part of the OR.)
Assuming that a semi-join access path is possible, Oracle chooses the semi-join algorithm to use while evaluating execution plans. In Oracle 9i a semi-join can be performed using the nested loops, hash join, or merge join algorithms. As with a conventional join, Oracle 9i will choose the join algorithm with the lowest cost. In Oracle 8i a semi-join can be performed using the hash join or merge join algorithms, but Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the always_semi_join instance parameter dictates whether or not an Oracle 8i database should perform. semi-joins and, if so, which algorithm to use. The always_semi_join instance parameter is obsolete in Oracle 9i.
If a semi-join access path is not possible (because of a DISTINCT operation or an OR predicate), or if instance parameters specify a semi-join should not be used, then Oracle will choose an alternate data access path. This usually amounts to performing a conventional join and sorting out the duplicates, or scanning the first table and using the second table as a filter.
Oracle provides the NL_SJ, HASH_SJ, and MERGE_SJ hints in order for you to manipulate the semi-join process if you need to. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself. In my experience Oracle does not need you to provide a hint in order to consider the efficient semi-join access paths. However, in Oracle 9i or 10g, where you have a wider choice of semi-join algorithms, you might want to use hints if Oracle chooses a hash semi-join when a nested loops semi-join would in fact be better, or vice versa.
Semi-Join Example #1
Suppose you want a list of your gold-status customers who have placed an order within the last three days. You might start with a query that looks like:
SELECT DISTINCT C.short_name, C.customer_id FROM customers C, orders O WHERE C.customer_type = 'Gold' AND O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ORDER BY C.short_name;
The execution plan (from a TKPROF report) for this query is:
Rows Row Source Operation ------- --------------------------------------------------- 2 SORT UNIQUE (cr=33608 r=30076 w=0 time=6704029 us) 20 HASH JOIN (cr=33608 r=30076 w=0 time=6703101 us) 10 TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=31718 us) 2990 TABLE ACCESS FULL ORDERS (cr=33570 r=30040 w=0 time=6646420 us)
You can see that Oracle did a (conventional) hash join between the CUSTOMERS and ORDERS tables, and then performed a sort for uniqueness to filter out the duplicate records. This sort was necessary because the query is supposed to show a customer only once, even if they have more than one order placed within the last three days. (A sort to satisfy the ORDER BY clause is also necessary, but Oracle tackles this at the same time as the sort for uniqueness and thus it does not appear as a separate step in the execution plan.)
Using what we have discussed so far, you might want to rewrite the query using a semi-join:
SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_type = 'Gold' AND EXISTS ( SELECT 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ) ORDER BY C.short_name;
The execution plan for this query becomes:
Rows Row Source Operation ------- --------------------------------------------------- 2 SORT ORDER BY (cr=33608 r=29921 w=0 time=6422770 us) 2 HASH JOIN SEMI (cr=33608 r=29921 w=0 time=6422538 us) 10 TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=61290 us) 2990 TABLE ACCESS FULL ORDERS (cr=33570 r=29921 w=0 time=6345754 us)
The “SEMI” that you see on the second line indicates that a semi-join access path is being used instead of a conventional join. A semi-join is more efficient here for two reasons: (1) The search through the ORDERS table for a given customer can stop as soon as the first qualifying order is found, and (2) There is no need to sort the results to remove duplicate customer records in the case where a customer had multiple qualifying orders. (You can see that the sort for uniqueness, which took 928 microseconds, has been replaced by a sort to satisfy the ORDER BY clause, which took only 232 microseconds.)
Rewriting the query to use a semi-join access path has perhaps been a novel experience, but, so far, it hasn’t done us a substantial amount of good. We brought the elapsed time down about 4%, but that doesn’t seem too significant, and it could just be the result of chance variation in our test environment. But suppose you are aware of the fact that very few of your customers are gold-status. Perhaps a nested loops semi-join would be more efficient. You can add an NL_SJ hint and see the results for yourself:
SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_type = 'Gold' AND EXISTS ( SELECT /*+ NL_SJ */ 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ) ORDER BY C.short_name; Rows Row Source Operation ------- --------------------------------------------------- 2 SORT ORDER BY (cr=833 r=725 w=0 time=358431 us) 2 NESTED LOOPS SEMI (cr=833 r=725 w=0 time=358232 us) 10 TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=2210 us) 2 TABLE ACCESS BY INDEX ROWID ORDERS (cr=795 r=725 w=0 time=355822 us) 780 INDEX RANGE SCAN ORDERS_N1 (cr=15 r=13 w=0 time=5601 us)(object id 34176)
Indeed, the nested loops semi-join brought the consistent reads down to 833 from 33608 and reduced the execution time by over 90%. So it appears that Oracle chose a hash semi-join when a nested loops semi-join was more efficient. Once we get Oracle to see the light with the NL_SJ hint, we end up with a much more efficient query than we started with. We achieved this performance gain by leveraging Oracle’s nested loops semi-join access path.
Semi-Join Example #2
I recently worked on a project where applying the concepts presented in this section brought the execution time for a key query in my client’s application down from 46 minutes to 1.1 seconds. A screen in the application allowed a user to enter a project owner and up to five participants, and a list of assignments involving all of the specified people would be displayed. The users found that the response time was tolerable if they only selected one participant, but it went downhill from there as they added more participants. The original version of the query looked like this:
SELECT DISTINCT A.name, A.code, A.description, A.item_id, A.assignment_id, FI.string0, FI.string1 FROM relationships R, assignments A, format_items FI, relationships R1, relationships R2, relationships R3, relationships R4, relationships R5 WHERE R.user_id = 134546 AND R.account_id = 134545 AND R.type_code = 0 AND A.item_id = R.item_id AND FI.item_id = A.item_id AND R1.item_id = A.item_id AND R1.status = 5 AND R1.user_id = 137279 AND R2.item_id = A.item_id AND R2.status = 5 AND R2.user_id = 134555 AND R3.item_id = A.item_id AND R3.status = 5 AND R3.user_id = 134546 AND R4.item_id = A.item_id AND R4.status = 5 AND R4.user_id = 137355 AND R5.item_id = A.item_id AND R5.status = 5 AND R5.user_id = 134556 ORDER BY A.name ASC;
(I have changed the names of tables and columns, stripped away extraneous columns and predicates, and eliminated bind variables in order to both protect my client’s confidentiality and make the example easier to understand.) The execution plan looked like this:
Rows Row Source Operation ------- --------------------------------------------------- 642 SORT UNIQUE (cr=23520269 r=34 w=0 time=2759937104 us) 64339104 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=23520269 r=34 w=0 time=1838051782 us) 95184881 NESTED LOOPS (cr=7842642 r=23 w=0 time=907238095 us) 2710288 NESTED LOOPS (cr=2266544 r=23 w=0 time=103840003 us) 317688 NESTED LOOPS (cr=484734 r=11 w=0 time=23494451 us) 50952 NESTED LOOPS (cr=43280 r=10 w=0 time=2688237 us) 4146 NESTED LOOPS (cr=19016 r=3 w=0 time=988374 us) 1831 NESTED LOOPS (cr=13353 r=0 w=0 time=608296 us) 4121 HASH JOIN (cr=7395 r=0 w=0 time=399488 us) 2046 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=299181 us) 17788 INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=81158 us)(object id 34528) 3634 TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=0 w=0 time=25536 us) 1831 TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=163252 us) 1831 INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=115113 us)(object id 34316) 4146 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5663 r=3 w=0 time=349554 us) 4264 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3678 r=0 w=0 time=224390 us)(object id 34345) 50952 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=24264 r=7 w=0 time=1538051 us) 70976 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=8428 r=0 w=0 time=630831 us)(object id 34345) 317688 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=441454 r=1 w=0 time=19584539 us) 1631032 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=108302 r=0 w=0 time=7213989 us)(object id 34345) 2710288 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=1781810 r=12 w=0 time=72015925 us) 4237104 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=643298 r=1 w=0 time=29018764 us)(object id 34345) 92474592 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=5576098 r=0 w=0 time=503165776 us)(object id 34345)
It is no wonder that this query took so long when five participants were selected. With each join to the RELATIONSHIPS table, the Cartesian product gets more and more out of hand. By the final sort to eliminate the duplicates, we have over 64 million rows to reckon with.
This query struck me right away as a good candidate for a nested loops semi-join. We want a list of assignments that meet certain criteria. Whether a specific user has one qualifying relationship record or 10, it makes no difference. We are only concerned with the existence of at least one qualifying relationship record for each of the specified users.
Rewriting the query with EXISTS clauses to semi-join to the RELATIONSHIPS table five times is pretty easy. The challenge here is that the query does have a DISTINCT operator, and this defeats Oracle’s semi-join access paths. Not to worry—an inline view with a NO_MERGE hint saved the day. The revised query looked like:
SELECT /*+ NO_MERGE (M) */ DISTINCT M.name, M.code, M.description, M.item_id, M.assignment_id, M.string0, M.string1 FROM ( SELECT A.name, A.code, A.description, A.item_id, A.assignment_id, FI.string0, FI.string1 FROM relationships R, assignments A, format_items FI WHERE R.user_id = 134546 AND R.account_id = 134545 AND R.type_code = 0 AND A.item_id = R.item_id AND FI.item_id = A.item_id AND EXISTS (SELECT 1 FROM relationships R1 WHERE R1.item_id = A.item_id AND R1.status = 5 AND R1.user_id = 137279) AND EXISTS (SELECT 1 FROM relationships R2 WHERE R2.item_id = A.item_id AND R2.status = 5 AND R2.user_id = 134555) AND EXISTS (SELECT 1 FROM relationships R3 WHERE R3.item_id = A.item_id AND R3.status = 5 AND R3.user_id = 134546) AND EXISTS (SELECT 1 FROM relationships R4 WHERE R4.item_id = A.item_id AND R4.status = 5 AND R4.user_id = 137355) AND EXISTS (SELECT 1 FROM relationships R5 WHERE R5.item_id = A.item_id AND R5.status = 5 AND R5.user_id = 134556) ) M ORDER BY M.name ASC;
The execution plan was as follows:
Rows Row Source Operation ------- --------------------------------------------------- 642 SORT UNIQUE (cr=36315 r=89 w=0 time=1107054 us) 1300 VIEW (cr=36315 r=89 w=0 time=1085116 us) 1300 NESTED LOOPS SEMI (cr=36315 r=89 w=0 time=1082232 us) 1314 NESTED LOOPS SEMI (cr=32385 r=89 w=0 time=1002330 us) 1314 NESTED LOOPS SEMI (cr=28261 r=89 w=0 time=904654 us) 1314 NESTED LOOPS SEMI (cr=22822 r=89 w=0 time=737705 us) 1322 NESTED LOOPS SEMI (cr=18730 r=89 w=0 time=651196 us) 1831 NESTED LOOPS (cr=13353 r=89 w=0 time=530670 us) 4121 HASH JOIN (cr=7395 r=89 w=0 time=347584 us) 2046 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=186820 us) 17788 INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=43770 us)(object id 34528) 3634 TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=89 w=0 time=91899 us) 1831 TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=141416 us) 1831 INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=100207 us)(object id 34316) 1322 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5377 r=0 w=0 time=92046 us) 2472 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3664 r=0 w=0 time=61077 us)(object id 34345) 1314 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4092 r=0 w=0 time=63478 us) 1582 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2647 r=0 w=0 time=40433 us)(object id 34345) 1314 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5439 r=0 w=0 time=133620 us) 11011 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2639 r=0 w=0 time=65312 us)(object id 34345) 1314 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4124 r=0 w=0 time=75657 us) 2234 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=49373 us)(object id 34345) 1300 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=3930 r=0 w=0 time=58651 us) 1300 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=38832 us)(object id 34345)
The execution plan starts out the same as before, with a hash join between RELATIONSHIPS and ASSIGNMENTS followed by a nested loops join to FORMAT_ITEMS. At this point there are 1831 rows in the candidate result set. Whereas the original query does five conventional joins to the RELATIONSHIPS table and the 1831 rows balloon to over 64 million, the revised query does semi-joins instead and whittles the candidate rows down to 1300. Joining 1300 rows is a lot faster than joining 64 million rows, so you can see how the semi-join access path really speeds up this query.
The NO_MERGE hint and the inline view (aliased as “M”) causes Oracle to separate the DISTINCT operation from the bulk of the query. This is how we get around the fact that a DISTINCT operation defeats semi-join access paths. In the second version of the query, Oracle executes the query as if the DISTINCT keyword were not there. Then, as a final step, it applies the DISTINCT operation to the interim result set that it has stored in memory (the output of the VIEW operator).
If you are wondering why the NO_MERGE hint is necessary, recall what we discussed earlier about the heuristic Oracle uses when deciding whether or not to merge a subquery into the main body of a query. Oracle assumes that merging is a good thing and will do it whenever possible. Thus without the NO_MERGE hint Oracle would merge the inline view into the DISTINCT operation (eliminating the inline view), and the semi-join access path would be defeated by the presence of the DISTINCT operation.
This example demonstrates the value of the semi-join access paths. Your gains may not be as spectacular as we have seen here, but hopefully you can see from this discussion how proper use of semi-joins can make a certain class of queries much more efficient. This example also underscores the importance of knowing how to properly use semi-joins. If you did not remember that semi-join access paths are defeated by the DISTINCT operation, you might have changed the example query to use EXISTS clauses and wondered why the query still ran slowly.
The NOT EXISTS and NOT IN Constructs
In this section we will discuss how Oracle evaluates NOT EXISTS and NOT IN clauses, prerequisites for using Oracle’s anti-join access paths, and hints that influence anti-join query optimization. Then we’ll look at a few examples of how anti-join access paths can be used to make some queries more efficient.
How Oracle Evaluates NOT EXISTS and NOT IN Clauses
NOT EXISTS and NOT IN clauses are used to retrieve rows from a table for which no match is found in another table. In the early days, Oracle would use the subquery of a NOT EXISTS or NOT IN clause as a filter. For each candidate row retrieved by the main body of the query, Oracle would execute the subquery. If the subquery retrieved one or more rows, then the candidate row would be discarded. Conceptually this approach seems similar to a nested loops anti-join.
Somewhere around Oracle 7.3, the merge anti-join and hash anti-join access paths were added. These enabled Oracle to leverage the merge and hash join algorithms when performing anti-joins. However, Oracle would only use these algorithms if the always_anti_join instance parameter was set appropriately or if a hint was added to the query. In Oracle 8i this still appears to be true. Oracle 9i, however, does away with the always_anti_join instance parameter and instead the optimizer evaluates the costs of the different algorithms and chooses the one with the lowest cost.
In an earlier section of this paper we discussed when the EXISTS clause should be used versus when the IN clause should be used. The two are functionally equivalent, and actively choosing one over the other is usually done in the hopes that this will help the optimizer choose a better execution plan. When talking about NOT EXISTS and NOT IN, however, it is important to note that these two constructs are not functionally equivalent—there is a significant difference between the two in how null values are handled.
If the subquery of a NOT IN clause returns at least one null value, then the NOT IN predicate evaluates to false. The NOT EXISTS construct, meanwhile, concerns itself only with whether a row is returned or not, and thus does not do anything differently if a null value is returned. This may seem a bit abstract, so let’s clarify the point with an example.
Suppose you want a list of empty departments—departments that have no employees. You could write this query with a NOT EXISTS clause or a NOT IN clause. We saw the NOT EXISTS case earlier:
SELECT D.deptno, D.dname FROM dept D WHERE NOT EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno;
Written with a NOT IN clause, the query becomes:
SELECT D.deptno, D.dname FROM dept D WHERE D.deptno NOT IN ( SELECT E.deptno FROM emp E ) ORDER BY D.deptno;
The queries look like they would be functionally equivalent, but if you are using the EMP and DEPT tables from the SCOTT schema of an Oracle 9i database, then the two queries are in fact different. Insert a row into the EMP table with a null DEPTNO, run the two versions of the query, and you will see the difference. The NOT EXISTS version will show you that department 40 has no employees. However, the NOT IN version of the query will say there are no empty departments. Why? Because the subquery of the NOT IN clause retrieves a row with a null value, causing the NOT IN clause to always evaluate to false. As strange as this NOT IN nuance may seem, it is fully documented in Metalink document 28934.1.
The difference between NOT EXISTS and NOT IN with respect to handling of nulls is an important one—not just from the standpoint of differing functionality, but also from a performance perspective as well. To evaluate a NOT IN subquery that is capable of returning nulls, Oracle cannot scan an index because an implicit NVL() function must be applied and this would defeat the index. (Null values aren’t stored in single-column non-unique indexes, anyway.) We’ll see the impact of this implicit NVL() function in the examples later in this section.
If the subquery of a NOT IN clause is physically incapable of returning a null value (because the columns involved have NOT NULL constraints or because predicates have been added to the WHERE clause requiring non-null values), then the query with the NOT IN clause will return the same results as a query with an equivalent NOT EXISTS clause—and indexes won’t be defeated in the process.
If we wanted to write the query that shows empty departments with a NOT IN clause that gave the same results as the NOT EXISTS version, we would need to add a predicate to the subquery that rules out null values. The resulting query is:
SELECT D.deptno, D.dname FROM dept D WHERE D.deptno NOT IN ( SELECT E.deptno FROM emp E WHERE E.deptno IS NOT NULL ) ORDER BY D.deptno;
To choose whether to use NOT EXISTS or NOT IN in a query, therefore, you must first consider how you want null values to be handled. If you need the presence of a null value in the subquery to cause the expression to always evaluate to false, then you need to use the NOT IN construct. Otherwise you have a choice between NOT EXISTS and NOT IN (with the extra WHERE predicate if the subquery would otherwise be capable of returning a null value).
If you don’t need the special null value semantics of NOT IN, then your choice of whether to use NOT EXISTS or NOT IN hinges on what sort of access path you think will be best for the query. In a conventional join, the nested loops algorithm is desirable when the predicates on the first table are very selective and the join columns in the second table are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not very selective or the join columns in the second table are not selectively indexed. A similar rule of thumb applies to anti-joins. That is to say, merge and hash anti-joins are most valuable in situations where the first table has a lot of candidate rows or the join columns are not selectively indexed.
If you think your query is going to be better off with a merge or hash anti-join than a nested loops anti-join or a filter operation, then you might want to code your query with a NOT IN clause instead of a NOT EXISTS clause. Whenever the Oracle documentation discusses anti-join access paths, it always seems to do so in the context of the NOT IN construct with no mention of NOT EXISTS. While sometimes Oracle 9i will choose an anti-join access path for a query with a NOT EXISTS clause, it does seem that Oracle 9i might be more likely to do so if NOT IN is used instead of NOT EXISTS. In Oracle 8i, I believe the query optimizer will never choose an anti-join access path for a query with a NOT EXISTS clause (even if always_anti_join is set and a hint appears in the subquery).
Meanwhile, if you don’t think your query will benefit from a merge or hash anti-join, then you might be best off with the good old filter technique instead of the nested loops anti-join access path. It appears that, at least in Oracle 9i, the nested loops anti-join might not be as efficient as it could be. In the example of the query for empty departments that we have been discussing, the filter technique will rule out a department as soon as the first employee in that department is located. The nested loops anti-join, meanwhile, appears to locate all employees in the department before ruling it out. This is unnecessary work. Whether a department has one employee or 100, it does not qualify as an empty department. So Oracle should cut to the chase and move on to the next department as soon as the first employee is found. The filter technique does just that, but the nested loops anti-join (at least in the cases I have seen) does not. The second example in this section will demonstrate this.
The upshot is that you must first consider how null values should be handled when deciding whether to use NOT EXISTS or NOT IN. If you need the special semantics provided by NOT IN, then your decision has been made. Otherwise, you should next consider whether or not the query might benefit from a merge or hash anti-join. If so, then you probably ought to choose NOT IN. If you decide to go with NOT IN but do not want the expression to evaluate to false if a null value is found, then make sure the subquery cannot return a null value. If there is no chance that the query will benefit from a merge or hash anti-join and the special semantics of NOT IN are not desired, then you might want to select the NOT EXISTS construct so that there is a better chance Oracle will perform. an efficient filter instead of an inefficient nested loops anti-join.
Prerequisites and Hints that Impact Anti-Joins
The key prerequisite for the anti-join access paths is that the subquery of a NOT IN clause cannot be capable of returning a null value. As mentioned before, this means that either the columns being selected must have NOT NULL constraints or there must be predicates in the WHERE clause of the subquery to ensure there are no nulls. Even if every row in the table has a non-null value in a nullable column, you must still add the extra predicate to the WHERE clause or else Oracle will refuse to use the merge and hash anti-join access paths.
Assuming that an anti-join access path is possible, Oracle chooses the anti-join algorithm to use while evaluating execution plans. In Oracle 9i an anti-join can be performed using the nested loops, merge, or hash join algorithms. As with a conventional join, Oracle will choose the join algorithm with the lowest cost, or it may choose to use the filter technique instead of an anti-join altogether. In Oracle 8i an anti-join can be performed using the merge or hash join algorithms, but as with semi-joins, Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the always_anti_join instance parameter dictates which approach Oracle 8i should use. (Oracle 8i does not implement the nested loops anti-join access path, so if the always_anti_join parameter is set to nested_loops, the filter technique is used instead.) The always_anti_join instance parameter is obsolete in Oracle 9i.
Oracle provides the NL_AJ, MERGE_AJ, and HASH_AJ hints in order for you to manipulate the anti-join process if you need to. As with the semi-join hints, the hint is applied to the subquery and not the main body of the query itself.
Anti-Join Example #1
Suppose you want a list of customers who have not placed an order within the last ten days. You might start with a query that looks like:
SELECT C.short_name, C.customer_id FROM customers C WHERE NOT EXISTS ( SELECT 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 10 ) ORDER BY C.short_name;
The execution plan looks like:
Rows Row Source Operation ------- --------------------------------------------------- 11 SORT ORDER BY (cr=18707 r=301 w=0 time=22491917 us) 11 FILTER (cr=18707 r=301 w=0 time=22491555 us) 1000 TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=15277 us) 989 VIEW (cr=18669 r=265 w=0 time=22365647 us) 989 HASH JOIN (cr=18669 r=265 w=0 time=22347234 us) 100000 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2207 r=208 w=0 time=277584 us)(object id 34555) 5338680 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16462 r=57 w=0 time=9036516 us)(object id 34556)
You can see that the query took over 22 seconds to run and performed 18,707 logical reads. Oracle did not use an anti-join access path in this execution plan, instead opting to read every row in the CUSTOMERS table and use the subquery as a filter by executing it once for each customer. Although Oracle found a clever way to hash join two indexes on the ORDERS table and thereby make the subquery very efficient, the query overall is still resource intensive because the subquery must execute 1000 times—once for each customer.
This query seems like a good candidate for an anti-join access path because executing the subquery once and performing a merge or hash anti-join might be more efficient than performing the subquery once for each customer as a filter operation. We could rewrite the query using a NOT IN clause instead of NOT EXISTS to see if Oracle chooses an anti-join access path:
SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id NOT IN ( SELECT O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 ) ORDER BY C.short_name;
Unfortunately, the results are quite disappointing—elapsed time shot up to 695 seconds! The execution plan looks like:
Rows Row Source Operation ------- --------------------------------------------------- 11 SORT ORDER BY (cr=5360749 r=4870724 w=0 time=695232973 us) 11 FILTER (cr=5360749 r=4870724 w=0 time=695232475 us) 1000 TABLE ACCESS FULL CUSTOMERS (cr=38 r=129 w=0 time=61614 us) 989 TABLE ACCESS BY INDEX ROWID ORDERS (cr=5360711 r=4870595 w=0 time=695088325 us) 5359590 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16520 r=0 w=0 time=22299946 us)(object id 34556)
You can see that Oracle did not choose an anti-join access path. Instead, Oracle is still reading every row from the CUSTOMERS table and executing the subquery once for each of the 1000 customers. Moreover, Oracle is no longer hash joining the two ORDERS indexes to make the subquery very efficient. For each customer, Oracle is now grabbing all orders from the last 10 days and sifting through them to find those belonging to the customer currently being processed. It turns out that the CUSTOMER_ID column in the ORDERS table is nullable, and so Oracle conceptually rewrites the query in order to implement the special null value semantics of NOT IN. The rewritten query looks like:
SELECT C.short_name, C.customer_id FROM customers C WHERE NOT EXISTS ( SELECT 1 FROM orders O WHERE O.order_date > SYSDATE – 10 AND NVL (O.customer_id, C.customer_id) = C.customer_id ) ORDER BY C.short_name;
You can see how the implicit NVL() function that Oracle inserts during the transformation would be necessary in order to cause the predicate to evaluate to false if a null value is returned. You can also see how this defeats the index on the CUSTOMER_ID column of the ORDERS table.
You might wonder why the CUSTOMER_ID column in the ORDERS table can be null. Let’s suppose the developers tell you that this handles a special case and you should not worry about it. The developers want the query to ignore nulls. Thus you can add an extra predicate to the WHERE clause of the subquery in order to meet the prerequisite for anti-join access paths. The resulting query is:
SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id NOT IN ( SELECT O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id IS NOT NULL ) ORDER BY C.short_name;
The execution plan becomes:
Rows Row Source Operation ------- --------------------------------------------------- 11 SORT ORDER BY (cr=311 r=132 w=98 time=1464035 us) 11 HASH JOIN ANTI (cr=311 r=132 w=98 time=1463770 us) 1000 TABLE ACCESS FULL CUSTOMERS (cr=38 r=34 w=0 time=37976 us) 20910 VIEW (cr=273 r=98 w=98 time=1318222 us) 20910 HASH JOIN (cr=273 r=98 w=98 time=1172207 us) 20910 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=58 r=0 w=0 time=43870 us)(object id 34556) 100000 INDEX FAST FULL SCAN ORDERS_N1_CUST_ID (cr=215 r=0 w=0 time=196953 us)(object id 34555)
You can see that Oracle now chooses to perform. a hash anti-join. The CUSTOMERS table gets a full table scan as before, but the critical difference is that now the subquery executes just once. The two indexes on the ORDERS table are hash joined once in order to develop a list of order dates for each customer, and this is hash anti-joined to the CUSTOMERS table in order to yield the list of customers with no orders in the last 10 days. The result is a query that completes in 1.46 seconds and performs just 311 logical reads.
In this example we did not need to use hints to get Oracle to find the most efficient execution plan. All we had to do was write the query in a way that made it eligible for the anti-join access paths. (This entailed using NOT IN instead of NOT EXISTS, and adding a predicate to the subquery’s WHERE clause to make it impossible for the subquery to return a null value.)
Anti-Join Example #2
Along the lines of the previous example, suppose you needed to find out if a specific customer has not placed an order within the last 10 days. Your query might look like:
SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id = 22 AND NOT EXISTS ( SELECT 1 FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id = C.customer_id ) ORDER BY C.short_name;
The execution plan looks like:
Rows Row Source Operation ------- --------------------------------------------------- 0 TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=6 r=0 w=0 time=341 us) 0 INDEX UNIQUE SCAN CUSTOMERS_PK (cr=6 r=0 w=0 time=333 us)(object id 34552) 1 TABLE ACCESS BY INDEX ROWID ORDERS (cr=4 r=0 w=0 time=170 us) 2 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=65 us)(object id 34555)
As you can see, Oracle uses a filter operation instead of an anti-join. This is not surprising, because we used the NOT EXISTS construct instead of NOT IN. The query performed 6 logical reads and completed in 341 microseconds. Customer 22 has 110 orders in the system. By looking at the row counts in the execution plan, you can see that Oracle stopped the filter process as soon as it found the first order placed by the customer within the last 10 days. There is no need to look at all of the orders placed by this customer because we’ve already found one that excludes the customer from the result set.
This is an example of a query with an anti-join where the main query is extremely selective—we are only looking at one customer record. This begs the question: Would this be a good opportunity to use the nested loops anti-join access path? It is hard to imagine an execution plan for this query that is more efficient than what we have just seen, but let’s try. The query rewritten with a NOT IN clause and an NL_AJ hint looks like:
SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id = 22 AND C.customer_id NOT IN ( SELECT /*+ NL_AJ */ O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id IS NOT NULL ) ORDER BY C.short_name;
The execution plan is:
Rows Row Source Operation ------- --------------------------------------------------- 0 SORT ORDER BY (cr=115 r=86 w=0 time=16824 us) 0 NESTED LOOPS ANTI (cr=115 r=86 w=0 time=16724 us) 1 TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=3 r=2 w=0 time=467 us) 1 INDEX UNIQUE SCAN CUSTOMERS_PK (cr=2 r=2 w=0 time=429 us)(object id34552) 10 TABLE ACCESS BY INDEX ROWID ORDERS (cr=112 r=84 w=0 time=16213 us) 110 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=308 us)(object id 34555)
You can see that Oracle obeyed our hint and used the nested loops anti-join access path. Unfortunately, the number of logical reads went up from 6 to 115 and the elapsed time went up by a factor of 50 to 16824 microseconds. Why? The row counts in the execution plan tell the story. Oracle looked at all 110 orders from the customer and found that 10 were placed less than 10 days ago. Oracle did a lot of unnecessary work. In this regard the filter access path seems much smarter than the nested loops anti-join access path. This behavior. was seen on an Oracle 9.2.0.4 database. Let’s hope Oracle fixes this in a future release. Until then, I can’t think of a case where the nested loops anti-join access path would be useful.
By the way, it turns out that Oracle would have chosen the nested loops anti-join access path even without the NL_AJ hint. So it turns out that this query was more efficient when written with a NOT EXISTS than a NOT IN. Again, hopefully this is just an anomaly that will be fixed in a future release of Oracle.
Anti-Join Example #3
I recently used the concepts presented in this section to bring a client’s query down from 33 minutes (where one CPU on the server was pegged) to less than four seconds. It turned out to be another case of the implicit NVL() function in a NOT IN clause defeating an index. This time it was a query that showed a count of calls to the customer service center placed by users who did not belong to corporate customers. The original query looked like this:
SELECT COUNT(*) FROM calls C WHERE C.requestor_user_id NOT IN ( SELECT CM.member_user_id FROM corp_members CM );
(Once again names have been changed to protect the innocent.) The execution plan looked like this:
Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=12784272 r=1864678 w=0 time=1978321835 us) 0 FILTER (cr=12784272 r=1864678 w=0 time=1978321817 us) 184965 TABLE ACCESS FULL CALLS (cr=3588 r=1370 w=0 time=979769 us) 61032 TABLE ACCESS FULL CORP_MEMBERS (cr=12780684 r=1863308 w=0 time=2948544268 us)
It appears that Oracle is reading rows from the CALLS table one at a time and, for each row, is scanning the CORP_MEMBERS table until it finds a row with a matching MEMBER_USER_ID. Since there are over 180,000 calls in the system, it takes over 180,000 partial scans of the CORP_MEMBERS table to complete this query. (The CORP_MEMBERS table has 584 blocks below its high water mark.)
A quick check of the CORP_MEMBERS table shows that the MEMBER_USER_ID column is indexed. However, the column is also nullable and so Oracle cannot use the index because of the implicit NVL() function associated with the NOT IN clause.
Since the client saw records in the CORP_MEMBERS table with a null MEMBER_USER_ID as an unrelated special case, they wanted the query to ignore such records. So my first step was to rewrite the query with the extra WHERE predicate in the subquery that makes the NOT IN clause ignore nulls:
SELECT COUNT(*) FROM calls C WHERE C.requestor_user_id NOT IN ( SELECT CM.member_user_id FROM corp_members CM WHERE CM.member_user_id IS NOT NULL );
The resulting execution plan was:
Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=3790 r=3906 w=420 time=5450615 us) 0 HASH JOIN ANTI (cr=3790 r=3906 w=420 time=5450591 us) 184965 TABLE ACCESS FULL CALLS (cr=3588 r=3480 w=0 time=646554 us) 81945 INDEX FAST FULL SCAN CORP_MEMBERS_USER_ID (cr=202 r=6 w=0 time=113178 us)(object id 34580)
Now not only did Oracle use the index on the MEMBER_USER_ID column when accessing the CORP_MEMBERS table, but more importantly Oracle also switched to a hash anti-join access path. This means that Oracle only had to access the CORP_MEMBERS table once instead of more than 180,000 times. The execution time dropped by 99% to 5.45 seconds.
The users were thrilled with the improvement and the problem was considered resolved. But to better understand Oracle’s data access paths and join methods, I benchmarked two more versions of the query. First I added a MERGE_AJ hint to cause a merge anti-join instead of a hash anti-join. This version of the query performed the same number of logical reads as the hash anti-join but used a little more CPU time. In this situation it looked like the merge anti-join used up more CPU time on sorting than did the hash anti-join.
I also tried rewriting the query with a NOT EXISTS clause to see how it would perform. using a filter operation instead of an anti-join access path. This version of the query looked like:
SELECT COUNT(*) FROM calls C WHERE NOT EXISTS ( SELECT 1 FROM corp_members CM WHERE CM.member_user_id = C.requestor_user_id );
The execution plan was:
Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=125652 r=3489 w=0 time=3895569 us) 0 FILTER (cr=125652 r=3489 w=0 time=3895547 us) 184965 TABLE ACCESS FULL CALLS (cr=3588 r=3489 w=0 time=906699 us) 61032 INDEX RANGE SCAN CORP_MEMBERS_USER_ID (cr=122064 r=0 w=0 time=1842121 us)(object id 34580)
This version of the query performed a lot more logical reads than the hash anti-join version, but actually used less CPU time. This query appeared to hammer on the blocks of the CORP_MEMBERS_USER_ID index very heavily. Most likely the whole index segment was sitting in the buffer cache and thus the logical reads were relatively inexpensive. Meanwhile, the large sort required by a hash anti-join was avoided and the result was a net reduction in CPU time.
These examples demonstrate the value of the anti-join access paths and the significance of the way NOT IN treats null values. Again, your gains may not be as spectacular as those shown here, but hopefully you can see from this discussion how the proper use of anti-joins and NOT EXISTS and NOT IN clauses can make a certain class of queries much more efficient. These examples also demonstrate the importance of the difference between NOT EXISTS and NOT IN, and the apparent behavior. (although not fully documented as far as I can tell) that Oracle typically will not apply anti-join access paths to a NOT EXISTS clause.
Conclusion
The semi-join and anti-join are two special ways of joining data from multiple tables in a query. Although SQL does not have a special symbol or keyword to designate a semi- or anti-join, we use the EXISTS, IN, NOT EXISTS, and NOT IN constructs to indicate when one of these joins is desired. Over the years Oracle has gotten better at optimizing semi-joins and anti-joins, and new data access paths make it possible to write very efficient queries that use these two types of joins. However, there are certain prerequisites to be met before Oracle can use these access paths, and sometimes hints or specific coding styles are necessary in order for the query optimizer to find the best execution plan possible.
Understanding how semi-joins and anti-joins work—and how Oracle implements the relevant data access paths—will enable you to write very efficient SQL and dramatically speed up an entire class of queries. This in turn will make your employer (or client) very happy.
About the Author
Roger Schrag, OCP, has been an Oracle DBA and application architect for over 15 years. He began his career at Oracle Corporation on the Oracle Financials development team. In 1995, he founded Database Specialists, Inc., a boutique consulting firm specializing in Oracle database technology, remote administration, and performance tuning. Since that time, Roger has focused his expertise in the area of performance optimization. Roger is a frequent speaker at Oracle OpenWorld and the International Oracle Users Group (IOUG) Live conferences, where he has frequently been voted in the Top 10% of speakers. Roger has been an Oracle Masters Class instructor for the IOUG and is the current president of the Northern California Oracle Users Group (NoCOUG). He can be reached at rschrag@dbspecialists.com.
Still Looking for Help on this Subject? Get a Consultation
We would be happy to talk with you about our services and how our senior-level database team might help you. Call Database Specialists at 415-344-0500 or 888-648-0500 or fill out a free consultation request form.
Complimentary Newsletter
If you'd like to receive our complimentary monthly newsletter with database tips and new white paper announcements, sign up for The Specialist.
Copyright © 2005 Database Specialists, Inc. http://www.dbspecialists.com