Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
Rosalind_0808 60.919540
方法一:
# -*- coding: utf-8 -*- # to open FASTA format sequence file: s=open('Computing_GC_Content.txt','r').readlines() # to create two lists, one for names, one for sequences name_list=[] seq_list=[] data='' # to put the sequence from several lines together for line in s: line=line.strip() for i in line: if i == '>': name_list.append(line[1:]) if data: seq_list.append(data) #将每一行的的核苷酸字符串连接起来 data='' # 合完后data 清零 break else: line=line.upper() if all([k==k.upper() for k in line]): #验证是不是所有的都是大写 data=data+line seq_list.append(data) # is there a way to include the last sequence in the for loop? GC_list=[] for seq in seq_list: i=0 for k in seq: if k=="G" or k=='C': i+=1 GC_cont=float(i)/len(seq)*100.0 GC_list.append(GC_cont) m=max(GC_list) print name_list[GC_list.index(m)] # to find the index of max GC print "{:0.6f}".format(m) # 保留6位小数
方法二:
# -*- coding: utf-8 -*- def parse_fasta(s): results = {} strings = s.strip().split('>') # Python split()通过指定分隔符对字符串进行切片,如果参数num 有指定值,则仅分隔 num 个子字符串 for s in strings: if len(s) == 0: continue # 如果字符串长度为0,就跳出循环。 parts = s.split() label = parts[0] bases = ''.join(parts[1:]) results[label] = bases return results def gc_content(s): n = len(s) m = 0 for c in s: if c == 'G' or c == 'C': m += 1 return 100 * (float(m) / n) if __name__ == "__main__": small_dataset = """ >Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC >Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT """ #large_dataset = open('datasets/rosalind_gc.txt').read() results = parse_fasta(small_dataset) results = dict([(k, gc_content(v)) for k, v in results.iteritems()]) # 这里iteritem()和item()功能是一样的 # 前一个results输出,名称+序列,后一个results输出,名称+百分比 highest_k = None highest_v = 0 for k, v in results.iteritems(): if v > highest_v: highest_k = k highest_v = v # 输出GC含量高的 print highest_k print '%f%%' % highest_v
方法三:
# -*- coding: utf-8 -*- ### 5. Computing GC Content ### from operator import itemgetter from collections import OrderedDict seqTest = OrderedDict() gcContent = OrderedDict() with open('Computing_GC_Content.txt', 'rt') as f: for line in f: line = line.rstrip() if line.startswith('>'): seqName = line[1:] seqTest[seqName] = '' continue seqTest[seqName] += line.upper() for ke, val in seqTest.items(): totalLength = len(val) gcNumber = val.count('G') + val.count('C') gcContent[ke] = (float(gcNumber) / totalLength)*100 sortedGCContent = sorted(gcContent.items(), key=itemgetter(1)) largeName = sortedGCContent[-1][0] largeGCContent = sortedGCContent[-1][1] print ('most GC ratio gene is %s and it is %s ' % (largeName, largeGCContent))