Filter FASTA files

Use a regular expression for filtering sequences by id from a FASTA file, e.g. just certain chromosomes from a genome. There are other tools as part of bigger packages to install (and no regex support), mostly awk-based awkward (sorry for the pun) bash solutions, and scripts using packages that one needs to install and with still no support for regular expressions. This however is a simple, straightforward little python script for a simple task. It doesn’t do anything else and doesn’t need anything but a stock python installation. Based on the FASTA reader snippet.

Download here.

Usage:

python FASTAfilter.py [-h] regex infile outfile

From a FASTA-file with multiple >entries, filter by sequence ids using a
regex.

positional arguments:
regex Regex to filter entry ids, e.g. ‘chr[1-4]’. Note that the id does not contain the initial > character.
infile A FASTA input file, usually with multiple entries.
outfile The new file with only the matching entries.

optional arguments:
-h, –help show this help message and exit

INSTALL:

cd /data/software
wget http://dm516.user.srcf.net/fastafilter/FASTAfilter.zip
unzip FASTAfilter.zip
easy_install argparse

USAGE:

python FASTAfilter.py [1-9,10,11,12,13,14,15,16,17,18,X]
/dat2/INPUT.fa
/dat2/OUTPUT.fa

Error:

Traceback (most recent call last):
File "FASTAfilter.py", line 3, in <module>
import argparse
ImportError: No module named argparse

Solution:

run "easy_install argparse" as root user.

http://dm516.user.srcf.net/?p=314

相关阅读:
1641. 统计字典序元音字符串的数目
1688. 比赛中的配对次数
核心思路
面试题 16.17. 连续数列
70. 爬楼梯
面试题 08.01. 三步问题
剑指Offer 42. 连续子数组的最大和
设计模式之原型模式
代理模式之动态代理
设计模式之禅(六大设计原则)

原文地址：https://www.cnblogs.com/emanlee/p/4574884.html