LeetCode 182 - Duplicate Emails¶

Database Language: PostgreSQL

Difficulty:

Problem Description¶

Input¶

Table: Person¶

Column Name	Type
id	int
email	varchar

id is the primary key (column with unique VALUES) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.

Requirement¶

Write a solution to report all the duplicate emails. Note that it's guaranteed that the email field is not NULL.

Return the result table in any order.

The result format is in the following example.

Examples¶

Example 1¶

Input¶

Person table:

id	email
1	a@b.com
2	c@d.com
3	a@b.com

Output¶

Email
a@b.com

Explanation¶

a@b.com is repeated two times.

SQL Schema¶

CREATE TABLE IF NOT EXISTS Person (id INT PRIMARY KEY, email VARCHAR(255));

TRUNCATE TABLE Person;
INSERT INTO Person (id, email) VALUES ('1', 'a@b.com');
INSERT INTO Person (id, email) VALUES ('2', 'c@d.com');
INSERT INTO Person (id, email) VALUES ('3', 'a@b.com');

Solutions¶

Solution #1 - Using GROUP BY ... HAVING¶

To be able to determine duplicate emails, we have to be able to count how many times each email occurs and based on the result, if the number of times an email occurs is more than 1, then that means that the email is duplicated. To count the number of times an email exists in a table, the COUNT aggregate function will be used together with the GROUP BY:

SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
GROUP BY email

Email	EmailCount
a@b.com	2
c@d.com	1

Now that we have the number of times each email occurs, we are just interested on those emails where the email count is more than 1. We cannot just add a WHERE clause either before the GROUP BY email as this is will generate an error. Adding the WHERE EmailCount > 1 before the GROUP BY email generates the following error:

SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
WHERE COUNT(email) > 1
GROUP BY email;

ERROR:  aggregate functions are not allowed in WHERE
LINE 3: WHERE COUNT(email) > 1
              ^ 

SQL state: 42803
Character: 69

On the other hand, adding the WHERE EmailCount > 1 after the GROUP BY email generates a different error:

SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
GROUP BY email
WHERE COUNT(email) > 1

ERROR:  syntax error at or near "WHERE"
LINE 4: WHERE EmailCount > 1;
        ^ 

SQL state: 42601
Character: 78

The correct way to filter out rows based on the result of an aggregate function such as the COUNT() aggregate function is to use the HAVING clause after the GROUP BY.

SELECT email AS Email, COUNT(email) AS EmailCount
FROM Person
GROUP BY email
HAVING COUNT(email) > 1;

Email	EmailCount
a@b.com	2

But the requirement only wants the email in the result set and not include the number of times the email occurs in the table. In this case, the COUNT(email) AS EmailCount simply needs to be removed:

# Final Solution 1 Query - Using GROUP BY ... HAVING
SELECT email AS Email
FROM Person
GROUP BY email
HAVING COUNT(email) > 1

The query plan for this query is as follows:

HashAggregate  (cost=12.10..13.85 rows=47 width=516)
  Group Key: email
  Filter: (count(email) > 1)
  ->  Seq Scan on person  (cost=0.00..11.40 rows=140 width=516)

The fastest runtime for the query above is as follows:

Runtime: 204ms
Beats: 94.28% as of July 15, 2024

Solution #2 - Using INNER JOIN¶

Another way of finding duplicate emails is to self join the Person to itself using the INNER JOIN and joining on the email column. To identify if an email exists more than once in the Person table using a self join, then the id column of the joined tables must not match. Here's how the query will look like:

# Final Solution 2 Query - Using INNER JOIN
SELECT DISTINCT P1.Email
FROM Person P1 INNER JOIN Person P2
  ON P1.Email = P2.Email AND
     P1.id != P2.id

The DISTINCT clause is added because if an email exists more than once, it will appear (n * (n - 1)) times in the output (where n is the number of times the email is duplicated).

Here's the query plan for this query:

HashAggregate  (cost=27.17..28.56 rows=139 width=516)
  Group Key: p1.email
  ->  Hash Join  (cost=13.15..26.83 rows=139 width=516)
        Hash Cond: ((p1.email)::text = (p2.email)::text)
        Join Filter: (p1.id <> p2.id)
        ->  Seq Scan on person p1  (cost=0.00..11.40 rows=140 width=520)
        ->  Hash  (cost=11.40..11.40 rows=140 width=520)
              ->  Seq Scan on person p2  (cost=0.00..11.40 rows=140 width=520)

And here's the fastest runtime of the INNER JOIN solution:

Runtime: 201ms
Beats: 96.69% as of July 15, 2024

Solution Runtime Comparison¶

Here's the comparison of the fastest runtime for each of the solutions.

Solution #	Runtime	Beats
1 - Using GROUP BY ... HAVING	204ms	94.28%
2 - Using INNER JOIN	201ms	96.69%