LeetCode 182 - Duplicate Emails¶
Database Language: SQL Server
Difficulty:
Problem Description¶
Input¶
Table: Person¶
Column Name | Type |
---|---|
id | int |
varchar |
id
is the primary key (column with unique VALUES) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Requirement¶
Write a solution to report all the duplicate emails. Note that it's guaranteed that the email field is not NULL.
Return the result table in any order.
The result format is in the following example.
Examples¶
Example 1¶
Input¶
Person table:
id | |
---|---|
1 | a@b.com |
2 | c@d.com |
3 | a@b.com |
Output¶
a@b.com |
Explanation¶
a@b.com is repeated two times.
SQL Schema¶
CREATE TABLE [dbo].[Person] (id INT PRIMARY KEY, email VARCHAR(255));
TRUNCATE TABLE [dbo].[Person];
INSERT INTO [dbo].[Person] (id, email) VALUES ('1', 'a@b.com');
INSERT INTO [dbo].[Person] (id, email) VALUES ('2', 'c@d.com');
INSERT INTO [dbo].[Person] (id, email) VALUES ('3', 'a@b.com');
Solutions¶
Solution #1 - Using GROUP BY ... HAVING¶
To be able to determine duplicate emails, we have to be able to count how many times each email occurs and based on the result, if the number of times an email occurs is more than 1, then that means that the email is duplicated. To count the number of times an email exists in a table, the COUNT aggregate function will be used together with the GROUP BY:
SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
GROUP BY email
EmailCount | |
---|---|
a@b.com | 2 |
c@d.com | 1 |
Now that we have the number of times each email occurs, we are just interested on those emails where the email count is more than 1. We cannot just add a WHERE clause either before the GROUP BY email
as this is will generate an error. Adding the WHERE EmailCount > 1
before the GROUP BY email
generates the following error:
SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
WHERE EmailCount > 1
GROUP BY email;
Query 1 ERROR: Msg: 207, Line 3, State: 1, Level: 16
Invalid column name 'EmailCount'.
Even if the EmailCount
column is replaced in the WHERE
clause with the COUNT(email)
expression, an error is still generated:
SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
WHERE COUNT(email) > 1
GROUP BY email;
Query 1 ERROR: Msg: 147, Line 3, State: 1, Level: 15
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.
On the other hand, adding the WHERE EmailCount > 1
after the GROUP BY email
generates a different error:
SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
GROUP BY email
WHERE EmailCount > 1
Query 1 ERROR: Msg: 156, Line 4, State: 1, Level: 15
Incorrect syntax near the keyword 'WHERE'.
The correct way to filter out rows based on the result of an aggregate function such as the COUNT() aggregate function is to use the HAVING
clause after the GROUP BY
.
SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
GROUP BY email
HAVING EmailCount > 1;
EmailCount | |
---|---|
a@b.com | 2 |
But the requirement only wants the email in the result set and not include the number of times the email occurs in the table. In this case, the COUNT(email) AS EmailCount
needs to be removed. Now that the EmailCount
column has been removed, this column cannot be referenced in the HAVING
clause anymore. So instead of using the EmailCount
column in the HAVING
clause, the aggregate function used to create the EmailCount
column will be used in the HAVING
clause:
# Final Solution 1 Query - Using GROUP BY ... HAVING
SELECT email AS Email
FROM Person
GROUP BY email
HAVING COUNT(email) > 1
The query plan for this query is as follows:
|--Filter(WHERE:([Expr1002]>(1)))
|--Compute Scalar(DEFINE:([Expr1002]=CONVERT_IMPLICIT(int,[Expr1005],0)))
|--Stream Aggregate(GROUP BY:([leetcode].[dbo].[Person].[email]) DEFINE:([Expr1005]=COUNT([leetcode].[dbo].[Person].[email])))
|--Sort(ORDER BY:([leetcode].[dbo].[Person].[email] ASC))
|--Clustered Index Scan(OBJECT:([leetcode].[dbo].[Person].[PK_Person]))
The fastest runtime for this query is as follows:
- Runtime: 383ms
- Beats: 88.51% as of July 16, 2024
Solution #2 - Using INNER JOIN¶
Another way of finding duplicate emails is to self join the Person
to itself using the INNER JOIN
and joining on the email
column. To identify if an email exists more than once in the Person
table using a self join, then the id
column of the joined tables must not match. Here's how the query will look like:
# Final Solution 2 Query - Using INNER JOIN
SELECT DISTINCT P1.Email
FROM Person P1 INNER JOIN Person P2
ON P1.Email = P2.Email AND
P1.id != P2.id
The DISTINCT
clause is added because if an email
exists more than once, it will appear (n * (n - 1)) times in the output (where n is the number of times the email is duplicated).
Here's the query plan for this query:
|--Stream Aggregate(GROUP BY:([P1].[email]))
|--Merge Join(Inner Join, MANY-TO-MANY MERGE:([P1].[email])=([P2].[email]), RESIDUAL:([leetcode].[dbo].[Person].[email] as [P1].[email]=[leetcode].[dbo].[Person].[email] as [P2].[email] AND [leetcode].[dbo].[Person].[id] as [P1].[id]<>[leetcode].[dbo].[Person].[id] as [P2].[id]))
|--Sort(ORDER BY:([P1].[email] ASC))
| |--Clustered Index Scan(OBJECT:([leetcode].[dbo].[Person].[PK_Person] AS [P1]))
|--Sort(ORDER BY:([P2].[email] ASC))
|--Clustered Index Scan(OBJECT:([leetcode].[dbo].[Person].[PK_Person] AS [P2]))
And here's the fastest runtime of the INNER JOIN
solution:
- Runtime: 406ms
- Beats: 78.33% as of July 16, 2024
Solution Runtime Comparison¶
Here's the comparison of the fastest runtime for each of the solutions.
Solution # | Runtime | Beats |
---|---|---|
1 - Using GROUP BY ... HAVING | 383ms | 88.51% |
2 - Using INNER JOIN | 406ms | 78.33% |
As expected, the runtime of using an INNER JOIN
is much slower than the GROUP BY ... HAVING
solution because the INNER JOIN
involves reading the Person
twice and joining to itself while the GROUP BY ... HAVING
solution involves a single pass on the Person
table.