All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
解法:
考虑到只有4种字母,ACGT,固定10位字符,所以排列组合数是一定的,仅有4^10 = 1048576
种可能。因此,可以让10位字符序列做一个Hash。
A = 00
C = 01
G = 10
T = 11
将他们拼接起来,变成一个数value
。
如: AAACCC = 00 00 00 01 01 01 (二进制) = 21 (十进制)
然后遍历整个序列,每10个字符生成一个value
。因为只需要判断10个字符的值,每个字符占2位,所以我们只需要20位的bit.
code中sum*4相当于左移2位。
/** * 本代码由九章算法编辑提供。没有版权欢迎转发。 * - 九章算法致力于帮助更多中国人找到好的工作,教师团队均来自硅谷和国内的一线大公司在职工程师。 * - 现有的面试培训课程包括:九章算法班,系统设计班,BAT国内班 * - 更多详情请见官方网站:http://www.jiuzhang.com/ */ public class Solution { public int encode(String s) { int sum = 0; for (int i = 0; i < s.length(); i++) { if (s.charAt(i) == 'A') { sum = sum * 4; } else if (s.charAt(i) == 'C') { sum = sum * 4 + 1; } else if (s.charAt(i) == 'G') { sum = sum * 4 + 2; } else { sum = sum * 4 + 3; } } return sum; } public List<String> findRepeatedDnaSequences(String s) { HashSet<Integer> hash = new HashSet<Integer>(); HashSet<String> dna = new HashSet<String>(); for (int i = 9; i < s.length(); i++) { String subString = s.substring(i - 9, i + 1); int encoded = encode(subString); if (hash.contains(encoded)) { dna.add(subString); } else { hash.add(encoded); } } List<String> result = new ArrayList<String>(); for (String d: dna) { result.add(d); } return result; } }