We need to find all 10-letter sequences that appear more than once. A hash set naturally tracks what we have seen before. As we slide a window of length 10 across the string with index l, we check if the current substring cur was already encountered. If so, it is a repeated sequence. Using two sets (one for seen sequences and one for res results) avoids adding duplicates to our answer.
seen to track encountered sequences and res to store repeated ones.l.cur.seen, add it to res.seen.res as a list.Instead of using two sets, we can use a hash map mp to count occurrences of each sequence. This lets us add a sequence to the res exactly when its count reaches 2, ensuring we only report it once regardless of how many additional times it appears.
mp to count occurrences and a result list res.l.cur, increment its count in mp.res.res list.Storing full 10-character strings as keys can be memory intensive. The Rabin-Karp algorithm computes a rolling hash for each window, allowing us to represent each sequence as a number instead. Double hashing (using two different hash bases) minimizes collision probability, making numeric comparisons reliable. As the window slides with index i, we efficiently update the hashes hash1 and hash2 by removing the contribution of the outgoing character and adding the incoming one.
power1 and power2 for both hash bases (raised to the 9th power for the leading digit).hash1 and hash2 to zero.i, updating both hashes for each character.key.cnt to count occurrences of each key.key appears the second time, add the corresponding substring to res.res list.class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
n = len(s)
if n < 10:
return []
cnt = defaultdict(int)
res = []
base1, base2 = 31, 37
hash1 = hash2 = 0
power1, power2 = 1, 1
MOD1, MOD2 = 685683731, 768258391
for i in range(9):
power1 *= base1
power2 *= base2
power1 %= MOD1
power2 %= MOD2
for i in range(n):
hash1 = (hash1 * base1 + ord(s[i])) % MOD1
hash2 = (hash2 * base2 + ord(s[i])) % MOD2
if i >= 9:
key = (hash1 << 31) | hash2
cnt[key] += 1
if cnt[key] == 2:
res.append(s[i - 9 : i + 1])
hash1 = (hash1 - power1 * ord(s[i - 9]) % MOD1 + MOD1) % MOD1
hash2 = (hash2 - power2 * ord(s[i - 9]) % MOD2 + MOD2)
return resDNA sequences use only four characters (A, C, G, T), each of which can be encoded with just 2 bits. A 10-character sequence therefore fits in 20 bits, well within a single integer. By treating each sequence as a mask, we avoid storing strings entirely and get fast integer operations for comparisons and hashing.
mp.mask to zero and a hash map cnt for counting.i, shift the mask left by 2 bits and OR with the character's value.0xFFFFF to keep only the rightmost 20 bits (10 characters).mask.mask appears the second time, add the corresponding substring to res.res list.class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
if len(s) < 10:
return []
mp = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
seen, res = set(), set()
mask = 0
for i in range(len(s)):
mask = ((mask << 2) | mp[s[i]]) & 0xFFFFF
if i >= 9:
if mask in seen:
res.add(s[i - 9: i + 1])
else:
seen.add(mask)
return list(res)When a sequence appears three or more times, it should only be added to the result once. Using a single set without tracking whether the sequence was already added to the result causes duplicates in the output.
The sliding window must extract exactly 10 characters. Common mistakes include using s[l:l+9] instead of s[l:l+10], or iterating with range(len(s) - 10) instead of range(len(s) - 9), which skips valid windows or causes index errors.
When using bit manipulation, each nucleotide requires 2 bits, so a 10-character sequence needs 20 bits. Using the wrong mask value (such as 0xFFFF for 16 bits instead of 0xFFFFF for 20 bits) causes hash collisions between different sequences.