187. Repeated DNA Sequences - Explanation

Problem Link



Prerequisites

Before attempting this problem, you should be comfortable with:

  • Sliding Window - Maintaining a fixed-size window that slides across the string to extract substrings
  • Hash Set / Hash Map - Tracking seen sequences and counting occurrences for duplicate detection
  • String Manipulation - Extracting substrings efficiently for comparison and storage

1. Hash Set

Intuition

We need to find all 10-letter sequences that appear more than once. A hash set naturally tracks what we have seen before. As we slide a window of length 10 across the string with index l, we check if the current substring cur was already encountered. If so, it is a repeated sequence. Using two sets (one for seen sequences and one for res results) avoids adding duplicates to our answer.

Algorithm

  1. Initialize two sets: seen to track encountered sequences and res to store repeated ones.
  2. Iterate through the string with a sliding window of size 10 using index l.
  3. For each position, extract the 10-character substring cur.
  4. If it exists in seen, add it to res.
  5. Add the current substring to seen.
  6. Return res as a list.
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        seen, res = set(), set()

        for l in range(len(s) - 9):
            cur = s[l: l + 10]
            if cur in seen:
                res.add(cur)
            seen.add(cur)
        return list(res)

Time & Space Complexity

  • Time complexity: O(n)O(n)
  • Space complexity: O(n)O(n)

2. Hash Map

Intuition

Instead of using two sets, we can use a hash map mp to count occurrences of each sequence. This lets us add a sequence to the res exactly when its count reaches 2, ensuring we only report it once regardless of how many additional times it appears.

Algorithm

  1. If the string length is less than 10, return an empty list.
  2. Initialize a hash map mp to count occurrences and a result list res.
  3. Slide a window of size 10 across the string using index l.
  4. For each substring cur, increment its count in mp.
  5. When the count becomes exactly 2, add the substring to res.
  6. Return the res list.
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        if len(s) < 10:
            return []

        mp = defaultdict(int)
        res = []
        for l in range(len(s) - 9):
            cur = s[l: l + 10]
            mp[cur] += 1
            if mp[cur] == 2:
                res.append(cur)

        return res

Time & Space Complexity

  • Time complexity: O(n)O(n)
  • Space complexity: O(n)O(n)

3. Rabin-Karp Algorithm (Double Hashing)

Intuition

Storing full 10-character strings as keys can be memory intensive. The Rabin-Karp algorithm computes a rolling hash for each window, allowing us to represent each sequence as a number instead. Double hashing (using two different hash bases) minimizes collision probability, making numeric comparisons reliable. As the window slides with index i, we efficiently update the hashes hash1 and hash2 by removing the contribution of the outgoing character and adding the incoming one.

Algorithm

  1. If the string length is less than 10, return an empty list.
  2. Precompute the power values power1 and power2 for both hash bases (raised to the 9th power for the leading digit).
  3. Initialize two rolling hashes hash1 and hash2 to zero.
  4. Iterate through the string with index i, updating both hashes for each character.
  5. Once the window reaches size 10, combine both hashes into a single key.
  6. Use a hash map cnt to count occurrences of each key.
  7. When a key appears the second time, add the corresponding substring to res.
  8. Slide the window by subtracting the outgoing character's hash contribution.
  9. Return the res list.
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        n = len(s)
        if n < 10:
            return []

        cnt = defaultdict(int)
        res = []
        base1, base2 = 31, 37
        hash1 = hash2 = 0
        power1, power2 = 1, 1
        MOD1, MOD2 = 685683731, 768258391

        for i in range(9):
            power1 *= base1
            power2 *= base2
            power1 %= MOD1
            power2 %= MOD2

        for i in range(n):
            hash1 = (hash1 * base1 + ord(s[i])) % MOD1
            hash2 = (hash2 * base2 + ord(s[i])) % MOD2

            if i >= 9:
                key = (hash1 << 31) | hash2
                cnt[key] += 1
                if cnt[key] == 2:
                    res.append(s[i - 9 : i + 1])

                hash1 = (hash1 - power1 * ord(s[i - 9]) % MOD1 + MOD1) % MOD1
                hash2 = (hash2 - power2 * ord(s[i - 9]) % MOD2 + MOD2)

        return res

Time & Space Complexity

  • Time complexity: O(n)O(n)
  • Space complexity: O(n)O(n)

4. Bit Mask

Intuition

DNA sequences use only four characters (A, C, G, T), each of which can be encoded with just 2 bits. A 10-character sequence therefore fits in 20 bits, well within a single integer. By treating each sequence as a mask, we avoid storing strings entirely and get fast integer operations for comparisons and hashing.

Algorithm

  1. If the string length is less than 10, return an empty list.
  2. Map each nucleotide to a 2-bit value: A=0, C=1, G=2, T=3 in map mp.
  3. Initialize a mask to zero and a hash map cnt for counting.
  4. For each character at index i, shift the mask left by 2 bits and OR with the character's value.
  5. Mask the result with 0xFFFFF to keep only the rightmost 20 bits (10 characters).
  6. Once the window reaches size 10, increment the count for this mask.
  7. When a mask appears the second time, add the corresponding substring to res.
  8. Return the res list.
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        if len(s) < 10:
            return []

        mp = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
        seen, res = set(), set()
        mask = 0
        for i in range(len(s)):
            mask = ((mask << 2) | mp[s[i]]) & 0xFFFFF
            if i >= 9:
                if mask in seen:
                    res.add(s[i - 9: i + 1])
                else:
                    seen.add(mask)

        return list(res)

Time & Space Complexity

  • Time complexity: O(n)O(n)
  • Space complexity: O(n)O(n)

Common Pitfalls

Adding Duplicates to the Result

When a sequence appears three or more times, it should only be added to the result once. Using a single set without tracking whether the sequence was already added to the result causes duplicates in the output.

Off-by-One Errors in Window Boundaries

The sliding window must extract exactly 10 characters. Common mistakes include using s[l:l+9] instead of s[l:l+10], or iterating with range(len(s) - 10) instead of range(len(s) - 9), which skips valid windows or causes index errors.

Incorrect Bit Mask Width in the Optimized Approach

When using bit manipulation, each nucleotide requires 2 bits, so a 10-character sequence needs 20 bits. Using the wrong mask value (such as 0xFFFF for 16 bits instead of 0xFFFFF for 20 bits) causes hash collisions between different sequences.