Repeated DNA Sequences (M)
题目
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
题意
给定一个字符串s,找出其中所有出现至少2次的长度为10的子串。
思路
比较直接的方法是使用两个HashSet去处理,一个保存已经遍历过的子串,另一个保存答案子串。
在此基础上可以使用位运算进行优化。分别用二进制的00、01、10、11来表示'A'、'C'、'G'、'T',则一个长度为10的字符串就可以用一个长度为20的二进制数字来表示,每一次获取新的子串只需要将原来的二进制串左移2位,并将最低的两位换成新加入的字符,类似于滑动窗口的操作。其他步骤与HashSet方法相同。
代码实现
Java
HashSet
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> one = new HashSet<>();
Set<String> two = new HashSet<>();
for (int i = 0; i < s.length() - 9; i++) {
String t = s.substring(i, i + 10);
if (two.contains(t)) {
continue;
} else if (one.contains(t)) {
two.add(t);
} else {
one.add(t);
}
}
return new ArrayList<>(two);
}
}
位运算优化
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
if (s.length() < 10) {
return new ArrayList<>();
}
Set<String> two = new HashSet<>();
Set<Integer> one = new HashSet<>(); // key类型换成整数
int[] hash = new int[26];
hash['A' - 'A'] = 0;
hash['C' - 'A'] = 1;
hash['G' - 'A'] = 2;
hash['T' - 'A'] = 3;
int cur = 0;
// 创建初始的长度为9的子串
for (int i = 0; i < 9; i++) {
cur = cur << 2 | hash[s.charAt(i) - 'A'];
}
for (int i = 9; i < s.length(); i++) {
// 每次只需要保留低20位
cur = cur << 2 & 0xfffff | hash[s.charAt(i) - 'A'];
if (one.contains(cur)) {
two.add(s.substring(i - 9, i + 1));
} else {
one.add(cur);
}
}
return new ArrayList<>(two);
}
}