题目:
- Find the best globe alignment of sequences TTCGGGGATG and TATATGGGTCG using daynamic programming.
- Find the local alignments of sequences ATGGTTCCTTGGTA and GGAGTATATTTATGTAC using dynamic programming.
思路分析:
匹配的优劣主要有匹配的得分来衡量,如匹配正确‘GG’为1分,匹配错误如‘GC’为-1分,而一方缺失如‘G~’为-2分,‘~~’非法。总得分为最后匹配完成后所有碱基对的总和。
直接使用暴力搜索显然是时间复杂性不允许的,所以可以考虑使用动态规划法,而匹配的结果可能不止一个,这又涉及到图论中有向图两点间路径的遍历问题,可以考虑采用回溯法,数据结构方面,使用MATLAB结构体数组表示图,使用栈来实现回溯法。
而输入和结果输出方面使用简单的MATLAB图形界面编程即可,不必过分在意。
实现步骤:
总函数为Align,可以将所有函数文件放在一个文件Align.m里,Align函数只要包含输入部分程序,然后调用GlobeAlignment和LocalAlignment函数即可。
GlobeAlignment的功能是实现碱基链的总体匹配,调用了下面四个函数:
Vmat = getStruct(Squence_1,Squence_2); Vmat= globeStruct(Vmat,Squence_1,Squence_2); [GlobeResult,Vmat] = globeSolve(Vmat); printAlignResult(Vmat,GlobeResult);
其中,getStruct函数主要构建二维结构体数组,每个数组元素的结构体元素如下:
alignValue:分值,如x(1,1).alignValue = 1;
alignChar:匹配元素,类型为二维数组,如[A,C];
maxDrec:可联通元素位置,初始为[],如对于第(1,1)个元素,maxDrec可能是[ ];
isGetted:是否访问过的记录,但不是只要访问过就改记录,而是出栈之后记录,而只要当连续入栈出栈时才依据其作判断。
globeStruct函数主要是对矩阵的前三个量进行赋值,alignValue和alignDrec的赋值规则及最优轨迹的寻找如下图1至图6所示:
GlobeSolve函数为核心程序,是在以上步骤完成后,进行的回溯法遍历起点到终点的轨迹,从而输出所有的最佳匹配。整个算法流程如下(以上图为例):
1.栈内元素(坐标)为(1,1)(栈顶元素),(2,2),(3,3),(4,3),(5,4);
2.(1,1)出栈,检查(2,2)点除(1,1)点外的可达点,没有则退栈;
3.现在栈顶为(3,3),再次退栈,栈顶为(4,3);
4.(3,2)入栈,为栈顶,两可通点任选一个比如(2,2)入栈;
5.(1,1)入栈,输出整个栈,(1,1)再次出栈;
6.(2,1)入栈,(1,1)入栈,输出栈;
7.这一步很关键,(1,1),(1,2)出栈后,按照上述规则(2,2)又会入栈,这样不只是影响起点出栈和遍历结束,加入初始最优路径是(1,1),(2,1),(3,2),(4,3),(5,4),则程序将反复在这两条轨迹之间遍历,不能遍历所以路径,所以这一步要加控制语句,即当紧接出栈后的入栈时,要检查一下准备入栈的元素的是否入过栈的记录,比如图中(2,1)出栈后,检查(2,2)已入过栈,则不再入栈,直接在退栈。这样做的依据是程序的递归性决定了某个访问过的分支起点(图中为(2,1))不可能再次访问。
其他部分程序思想较为简单,且程序中已有注释,不再解释。
由于LocalAlignment的思想与GlobeAligment类似,由于时间因素,没有实现。
结果表现:
直接运行Align命令或者函数程序得到下面选择对话框:
选择GlobeAlignment,得到输入对话框,输入两个碱基链:
之后得到结果:
对于题目中的序列,得到如下结果:
所以这两个序列的最佳匹配唯一。
结果分析与程序反思:
值得注意的是一般的回溯法的入栈和入栈时考虑不够周全,所以有可能不能全部遍历;在空间复杂性和时间复杂上,可能使用C语言、Java等会更加理想;代码的简洁性还可以提高。
程序:Align.m
1 function Align
2 %author@Tiger Zhang
3 %To Find the best globe alignment of sequences
4 %Using dynamic programming
5 %% function used:
6 % GlobeAlignment(Squence_1,Squence_2)
7 % [GlobeResult,VmatNew] = globeSolve(Vmat)
8 % MatNew = cleanVector(Mat,Vec)
9 % printAlignValue(Vmat); 测试用
10 % printAlignResult(Vmat,ResultCell);
11 % VmatNew = globeStruct(Vmat,Squence_1,Squence_2)
12 % Vmat = getSturct(Squence_1,Squence_2)
13 % Score = fScore(Squence_1,Squence_2)
14 % score = fAlign(a,b)
15 % [char_a,char_b] = posiToChar(baseListPosi, j,Posi_a,Posi_b)
16
17 %% 输入部分:使用GUI进行输入
18 [Method_Select,isok]=listdlg('liststring',{'GlobeAlignment','LocalAlignment'},...
19 'listsize',[250 120],'OkString','OK','CancelString','Cancel',...
20 'promptstring','Alignment Method','name','Choose the Method', ...
21 'selectionmode','multiple');
22 %Method_Select = 'red';
23 Squence = inputdlg({'The first Squence','The second Squence'}, ...
24 'Input and Alignent',[1 50;1 50]);
25 % options.position = [100 100 300 200];
26 Squence_1 = Squence{1};
27 Squence_2 = Squence{2};
28 if length(Squence_1) < length(Squence_2)
29 Temp_squence = Squence_1;
30 Squence_1 = Squence_2;
31 Squence_2 = Temp_squence;
32 end
33 %% 动态规划匹配程序
34 if Method_Select == 2
35 %LocalAlignment
36 LocalAlignment(Squence_1,Squence_2);
37 else
38 %GlobeAlignment
39 GlobeAlignment(Squence_1,Squence_2);
40 end
41
42 %% 显示结果部分:
43 % h=waitbar(0,'开始绘图');
44 % pause(1); %延迟1秒
45 % ha=get(h,'children');
46 % hac=get(ha,'children');
47 % hapa=findall(hac,'type','patch');
48 % set(hapa,'Edgecolor','g','FaceColor','b');
49 % for i=1:100
50 %
51 % waitbar(i/100,h,['已完成' num2str(i) '%']);
52 % pause(0.1);
53 % end
54
55 end
56 %------------------------------------------------------------------------------
57
58 function GlobeAlignment(Squence_1,Squence_2)
59 %% 说明(little long):The Dynamic Programming consists of the
60 % following three essential components:
61 % 1.The recursive argument;
62 % 2.The tabular computation;
63 % 3.The traceback:reconstruct the best alignment
64 % 大体实现思路是根据字符串大小建立二维结构体矩阵,每个结构体包含两个元素,分别是:
65 % alignValue:分值,如x(1,1).alignValue = 1;
66 % alignChar:匹配元素,类型为二维数组,如[A,C];
67 % maxDrec:可联通元素位置,初始为[],如对于第(1,1)个元素,maxDrec可能是[0,0;];
68 % 之后根据递归规则填充矩阵alignValue,直到完毕;然后是记录从终点回溯过程中
69 % 增加分值的联通路线,即改变maxDrec的值和维数;
70 % 最后从终点开始在可联通路线中遍历,相当于在一个三叉树中寻找两结点之间的路径。
71
72 %初始化规则:
73 % V[0,0] = 0;
74 % V[i+1,0] = V[i,0] + fAlign(s[i+1],-);
75 % V[0,i+1] = V[0,j] + fAlign(-,t[j+1]);
76 % Squence_1 ='AAAG';%'TTCGGGGATG';%
77 % Squence_2 ='ACG';% 'TATATGGGTCG';%
78 if length(Squence_1) <length(Squence_2)
79 temp_s = Squence_2;
80 Squence_2 = Squence_1;
81 Squence_1 = temp_s;
82 end
83 Vmat = getStruct(Squence_1,Squence_2);
84 Vmat= globeStruct(Vmat,Squence_1,Squence_2);
85 [GlobeResult,Vmat] = globeSolve(Vmat);
86 printAlignResult(Vmat,GlobeResult);
87
88 end
89 %-------------------------------------------------------------------
90 function LocalAlignment(Squence_1,Squence_2)
91 %% 局部匹配算法其实和全局匹配类似,这里直接省略了
92 t1 =sprintf(' The Local Alignment method is just similiar to the
');
93 t2 = sprintf(' Globe Alignment Method,so I did not implement it.');
94 alignPrint=dialog('name','AlignResult','position', ...
95 [300 300 620 200]','Resize','on');
96 uicontrol('parent',alignPrint,'style','text','string',t1, ...
97 'position',[10 130 600 25],'fontsize',15);
98 uicontrol('parent',alignPrint,'style','text','string',t2, ...
99 'position',[10 90 600 25],'fontsize',15);
100 uicontrol('parent',alignPrint,'style','pushbutton','position',...
101 [260 30 40 25],'string','exit','callback','delete(gcbf)','fontsize',14);
102 end
103
104 %--------------------------------------------------------------------
105 function [GlobeResult,VmatNew] = globeSolve(Vmat)
106 %% 求解全局最佳匹配
107 %思路:在标记好可连通性后,采用有向图求两点间所有
108 %路径的回溯方法
109 [length_1,length_2] = size(Vmat);
110 for i = 1:length_1
111 for j = 1:length_2
112 Vmat(i,j).maxDrec = unique(Vmat(i,j).maxDrec,'rows');
113 end
114 end
115 i = length_1;j = length_2;
116 %这部分将之前的那条路径可连通性赋予
117 GlobeResult = cell(1);
118 %% 以下是使用回溯法求解路径
119 % 要使用到栈,这里用一个可变维数的数组代替栈
120 %大体上先找到一条路线,打印栈,然后让重点出栈,这时看看栈顶元素有没有
121 % 通往其他点的路径,如果没有再出栈,如果有则入栈,直到到达终点
122 % (终点出栈前要打印栈)或者端点这时再让端点出栈,如此反复,直到栈为空
123 % 结束程序,这时所有的结果都已经找到了;
124 [a, b] = size(Vmat);
125 alignStack = [a, b];%初始栈为空;
126 %找到某条路径
127 while (b +a) > 2 %注意matlab的while时刻检查者变量大小
128 while size(Vmat(a,b).maxDrec,1)==0
129 alignStack(1,:) = [];
130 a = alignStack(1,1);
131 b = alignStack(1,2);
132 end
133 m = Vmat(a,b).maxDrec(1,1);
134 n = Vmat(a,b).maxDrec(1,2);
135 alignStack = [m, n; alignStack];
136 a = m;
137 b = n;
138 end
139 GlobeResult{1} = alignStack; %保存路径,最后输出
140 %% 开始回溯
141 Temp = alignStack(1,:);
142 alignStack(1,:) =[];
143 i_1 = alignStack(1,1);
144 j_1 = alignStack(1,2);
145 Count = 1;
146 isOutInStack = 0;
147 while size(alignStack,1) > 1
148
149 judgeVia = cleanVector(Vmat(i_1,j_1).maxDrec,Temp);
150 if size(judgeVia,1) == 0
151 Temp = alignStack(1,:);
152 alignStack(1,:) = [];
153 Vmat(Temp(1),Temp(2)).isGetted = 0;
154 isOutInStack =1;
155
156 else
157 if isOutInStack==0;
158 alignStack = [judgeVia(1,:);alignStack];
159 elseif isOutInStack == 1
160 Temp = alignStack(1,:);
161 length_ju = size(judgeVia,1);
162 popCount = 0;
163 for k = 1:length_ju
164 if Vmat(judgeVia(k,1),judgeVia(k,2)).isGetted==1
165 alignStack = [judgeVia(k,:);alignStack];
166 popCount = 1;
167 isOutInStack = 0;
168 end
169 end
170 if popCount ==0
171 Temp = alignStack(1,:);
172 alignStack(1,:) = [];
173 Vmat(Temp(1),Temp(2)).isGetted = 0;
174 isOutInStack =1;
175 end
176 end
177 end
178 i_1 = alignStack(1,1);
179 j_1 = alignStack(1,2);
180
181 if i_1==length_1&&length_2==j_1
182 break;
183 end
184
185 if i_1 == 1&&j_1==1
186 align_temp = GlobeResult{end};
187 if Count==5
188 break;
189 end
190 if Count >=2
191 if sum(sum(align_temp-alignStack))==0;
192 break;
193 end
194 end
195 Count = Count +1;
196 GlobeResult{Count} =alignStack;
197 end
198 VmatNew = Vmat;
199 end
200 end
201 %------------------------------------------------------------------
202 function MatNew = cleanVector(Mat,Vec)
203 length_Mat = size(Mat,1);
204 if length_Mat==0
205 MatNew = [];
206 else
207 for i = 1:length_Mat
208 if (Mat(length_Mat +1 - i,:) - Vec).^2==0
209 Mat(length_Mat +1 - i,:) = [];
210 end
211 end
212 MatNew = Mat;
213 end
214
215 end
216 %-------------------------------------------------------------------
217 function printAlignResult(Vmat,ResultCell)
218 %% 打印输出结果
219 lengthOfRes = length(ResultCell);
220 baseList_up = [];
221 baseList_down = [];
222 %% 使用gui来表现结果
223 alignPrint=dialog('name','AlignResult','position', ...
224 [300 200 400 500]','Resize','on');
225 %options.Resize='on';
226 start_set = 430;
227
228 for i = 1:lengthOfRes
229 baseListPosi = ResultCell{i};
230 length_list = size(baseListPosi,1);
231 for j = 1:length_list
232 a = baseListPosi(j,1);
233 b = baseListPosi(j,2);
234 Posi_a = Vmat(a,b).alignChar(1);
235 Posi_b = Vmat(a,b).alignChar(2);
236 [char_a,char_b] = posiToChar(baseListPosi, j,Posi_a,Posi_b);
237 baseList_up = [baseList_up,char_a];
238 baseList_down = [baseList_down,char_b];
239 end
240
241 t = sprintf('The %dth best alignment:
',i);
242 uicontrol('parent',alignPrint,'style','text','string', ...
243 t,'position',[50 start_set 200 20],'fontsize',12);
244 uicontrol('parent',alignPrint,'style','text','string', ...
245 baseList_up,'position',[50 start_set-25 200 20],'fontsize',12);
246 uicontrol('parent',alignPrint,'style','text','string', ...
247 baseList_down,'position',[50 start_set-45 200 20],'fontsize',12);
248
249 start_set = start_set - 80;
250
251 baseList_up = [];
252 baseList_down = [];
253 end
254 uicontrol('parent',alignPrint,'style','pushbutton','position',...
255 [80 10 50 20],'string','exit','callback','delete(gcbf)');
256
257 end
258
259 %--------------------------------------------------------
260 function [char_a,char_b] = posiToChar(baseListPosi, j,Posi_a,Posi_b)
261 if j ==1
262 char_a = '';
263 char_b = '';
264 elseif j > 1
265 if baseListPosi(j,1) == baseListPosi(j-1,1)
266 char_a = '~';
267 char_b = Posi_b;
268 elseif baseListPosi(j,2) == baseListPosi(j-1,2)
269 char_a = Posi_a;
270 char_b = '~';
271 else
272 char_a = Posi_a;
273 char_b = Posi_b;
274 end
275
276 end
277 end
278
279 %-----------------------------------------------------------------%
280
281 function printAlignValue(Vmat)
282 %% 打印结构体矩阵的alignValue
283 [length_1,length_2] = size(Vmat);
284 for i = 1:length_1
285 for j = 1:length_2
286 fprintf('%d ',Vmat(i,j).alignValue);
287 end
288 fprintf('
');
289 end
290 end
291
292 function VmatNew = globeStruct(Vmat,Squence_1,Squence_2)
293 %% GlobeAlignment时的矩阵赋值操作
294 Squence_1 = ['~',Squence_1];
295 Squence_2 = ['~',Squence_2];
296 [length_1,length_2] = size(Vmat);
297 %----边界赋值--------------------------------------------------
298 for i = 1:length_1-1
299 Vmat(i+1,1).alignValue = Vmat(i,1).alignValue + ...
300 fAlign(Squence_1(i+1),Squence_2(1));
301 Vmat(i+1,1).maxDrec = [i,1;Vmat(i+1,1).maxDrec];
302 end
303 for j = 1:length_2-1
304 Vmat(1,j+1).alignValue = Vmat(1,j).alignValue + ...
305 fAlign(Squence_1(1),Squence_2(j+1));
306 Vmat(1,1+j).maxDrec = [1,j; Vmat(1,1+j).maxDrec];
307 end
308 %----内部赋值-------------------------------------------------
309 % Vmat(length_1,length_2).alignValue = 1000;
310 i_temp = 2;j_temp =2;
311 Vmat(1,1).alignValue = 0;
312 Vmat(1,1).maxDrec = [];
313 while i_temp<=length_1&&j_temp<=length_2
314 Vmat = getValues(Vmat, i_temp, j_temp);
315 for i = i_temp+1:length_1
316 Vmat = getValues(Vmat, i, j_temp);
317 end
318
319 for j = j_temp+1:length_2
320 Vmat = getValues(Vmat, i_temp, j);
321 end
322 if j_temp == length_2-1&&i_temp < length_1
323 i_temp = i_temp +1;
324 else
325 if i_temp <= length_1
326 i_temp = i_temp + 1;
327 end
328 if j_temp <= length_2
329 j_temp = j_temp + 1;
330 end
331 end
332
333
334 end
335 VmatNew = Vmat;
336 %%
337 end
338
339 function VmatNew = getValues(Vmat, i, j)
340 aValue = Vmat( i - 1, j - 1).alignValue + ...
341 fAlign(Vmat(i, j).alignChar(1),Vmat(i,j).alignChar(2));
342 bValue = Vmat(i - 1, j).alignValue + ...
343 fAlign(Vmat(i, j).alignChar(1),'~');
344 cValue = Vmat(i, j - 1).alignValue + ...
345 fAlign('~',Vmat(i, j).alignChar(2));
346 AlignValue = max([aValue,bValue,cValue]);
347 if aValue == AlignValue&&i - 1>0&&j -1>0
348 Vmat(i,j).maxDrec = [i-1,j-1;Vmat(i,j).maxDrec];
349 end
350 if bValue == AlignValue&&i-1>0
351 Vmat(i,j).maxDrec = [i-1,j;Vmat(i,j).maxDrec];
352 end
353 if cValue == AlignValue&&j-1>0
354 Vmat(i,j).maxDrec = [i,j-1;Vmat(i,j).maxDrec];
355 end
356
357 Vmat(i,j).alignValue = AlignValue;
358 VmatNew = Vmat;
359
360 end
361
362 function Vmat = getStruct(Squence_1,Squence_2)
363 %% 产生并初始化结构体
364 Squence_1 = ['~',Squence_1];
365 Squence_2 = ['~',Squence_2];
366 length_1 = length(Squence_1);
367 length_2 = length(Squence_2);
368 for i = 1:length_1
369 for j = 1:length_2
370 Vmat(i,j).alignValue = 0;
371 Vmat(i,j).alignChar = [Squence_1(i),Squence_2(j)];
372 Vmat(i,j).maxDrec = [];
373 Vmat(i,j).isGetted = 1;
374 end
375 end
376 end
377
378 function Score = fScore(Squence_1,Squence_2)
379 %% 计算两匹配模式总得分
380 lengthTwo = length(Squence_1);
381 Score = 0;
382 for i = 1:lengthTwo
383 Score = Score + fAlign(Squence_1(i),Squence_2(i));
384 end
385 end
386
387 function score = fAlign(a,b)
388 %% 用于计算单个碱基比对分数
389 if a =='-'&&b =='~'
390 error('Error:~ align ~ is not allowed');
391 elseif a=='~'||b=='~'
392 score = -2;
393 elseif a==b
394 score = 1;
395 else
396 score = -1;
397 end
398 end