The csb.bio.sequence
module defines the base interfaces of our sequence
and sequence alignment objects: AbstractSequence
and AbstractAlignment
.
This module provides also a number of useful enumerations, like
SequenceTypes
and SequenceAlphabets
.
AbstractSequence
has a number of implementations. These are of course
interchangeable, but have different intents and may differ significantly
in performance. The standard Sequence
implementation is what you are
after if all you need is high performance and efficient storage (e.g.
when you are parsing big files). Sequence
objects store their underlying
sequences as strings. RichSequence
-s on the other hand will store their
residues as ResidueInfo
objects, which have the same basic interface as
the csb.bio.structure.Residue
objects. This of course comes at the
expense of degraded performance. A ChainSequence
is a special case of a
rich sequence, whose residue objects are actually real
csb.bio.structure.Residue
-s.
Basic usage:
>>> seq = RichSequence('id', 'desc', 'sequence', SequenceTypes.Protein)
>>> seq.residues[1](1)
<ResidueInfo [1](1): SER>
>>> seq.dump(sys.stdout)
>desc
SEQUENCE
See AbstractSequence
in the API docs for details.
AbstractAlignment
defines a table-like interface to access the data
in an alignment:
>>> ali = SequenceAlignment.parse(">a\nABC\n>b\nA-C")
>>> ali[0, 0](0,-0)
<SequenceAlignment> # a new alignment, constructed from row #1, column #1
>>> ali[0, 1:3](0,-1_3)
<SequenceAlignment> # a new alignment, constructed from row #1, columns #2..#3
which is just a shorthand for using the standard 1-based interface:
>>> ali.rows[1](1)
<AlignedSequenceAdapter: a, 3> # row #1 (first sequence)
>>> ali.columns[1](1)
(<ColumnInfo a [1](1)(1): ALA>, <ColumnInfo b [1](1)(1): ALA>) # residues at column #1
See AbstractAlignment
in our API docs for all details and more examples.
There are a number of AbstractAlignment
implementations defined here.
SequenceAlignment
is the default one, nothing surprising. A3MAlignment
is a more special one: the first sequence in the alignment is a master
sequence. This alignment is usually used in the context of HHpred. More
important is the StructureAlignment, which is an alignment of
csb.bio.structure.Chain
objects. The residues in every aligned sequence
are really the csb.bio.structure.Residue
objects taken from those chains.
CSB provides parsers and writers for sequences and alignments in FASTA
format, defined in csb.bio.io.fasta
. The most basic usage is:
>>> parser = SequenceParser()
>>> parser.parse_file('sequences.fa')
<SequenceCollection> # collection of L{AbstractSequence}s
This will load all sequences in memory. If you are parsing a huge file, then you could efficiently read the file sequence by sequence:
>>> for seq in parser.read('sequences.fa'):
... # seq is an L{AbstractSequence}
BaseSequenceParser
is the central class in this module, which defines
a common infrastructure for all sequence readers. SequenceParser
is a
standard implementation, and PDBSequenceParser
is specialized to read
FASTA sequences with PDB headers.
For parsing alignments, have a look at SequenceAlignmentReader
and
StructureAlignmentFactory
.
Finally, this module provides a number of OutputBuilder
-s, which know
how to write AbstractSequence
and AbstractAlignment
objects to FASTA
files:
>>> with open('file.fa', 'w') as out:
builder = OutputBuilder.create(AlignmentFormats.FASTA, out)
builder.add_alignment(alignment)
builder.add_sequence(sequence)
...
or you could instantiate any of the OutputBuilder
-s directly.