The csb.bio.sequence module defines the base interfaces of our sequence
and sequence alignment objects: AbstractSequence and AbstractAlignment.
This module provides also a number of useful enumerations, like
SequenceTypes and SequenceAlphabets.
AbstractSequence has a number of implementations. These are of course
interchangeable, but have different intents and may differ significantly
in performance. The standard Sequence implementation is what you are
after if all you need is high performance and efficient storage (e.g.
when you are parsing big files). Sequence objects store their underlying
sequences as strings. RichSequence-s on the other hand will store their
residues as ResidueInfo objects, which have the same basic interface as
the csb.bio.structure.Residue objects. This of course comes at the
expense of degraded performance. A ChainSequence is a special case of a
rich sequence, whose residue objects are actually real
csb.bio.structure.Residue-s.
Basic usage:
>>> seq = RichSequence('id', 'desc', 'sequence', SequenceTypes.Protein)
>>> seq.residues[1](1)
<ResidueInfo [1](1): SER>
>>> seq.dump(sys.stdout)
>desc
SEQUENCE
See AbstractSequence in the API docs for details.
AbstractAlignment defines a table-like interface to access the data
in an alignment:
>>> ali = SequenceAlignment.parse(">a\nABC\n>b\nA-C")
>>> ali[0, 0](0,-0)
<SequenceAlignment> # a new alignment, constructed from row #1, column #1
>>> ali[0, 1:3](0,-1_3)
<SequenceAlignment> # a new alignment, constructed from row #1, columns #2..#3
which is just a shorthand for using the standard 1-based interface:
>>> ali.rows[1](1)
<AlignedSequenceAdapter: a, 3> # row #1 (first sequence)
>>> ali.columns[1](1)
(<ColumnInfo a [1](1)(1): ALA>, <ColumnInfo b [1](1)(1): ALA>) # residues at column #1
See AbstractAlignment in our API docs for all details and more examples.
There are a number of AbstractAlignment implementations defined here.
SequenceAlignment is the default one, nothing surprising. A3MAlignment
is a more special one: the first sequence in the alignment is a master
sequence. This alignment is usually used in the context of HHpred. More
important is the StructureAlignment, which is an alignment of
csb.bio.structure.Chain objects. The residues in every aligned sequence
are really the csb.bio.structure.Residue objects taken from those chains.
CSB provides parsers and writers for sequences and alignments in FASTA
format, defined in csb.bio.io.fasta. The most basic usage is:
>>> parser = SequenceParser()
>>> parser.parse_file('sequences.fa')
<SequenceCollection> # collection of L{AbstractSequence}s
This will load all sequences in memory. If you are parsing a huge file, then you could efficiently read the file sequence by sequence:
>>> for seq in parser.read('sequences.fa'):
... # seq is an L{AbstractSequence}
BaseSequenceParser is the central class in this module, which defines
a common infrastructure for all sequence readers. SequenceParser is a
standard implementation, and PDBSequenceParser is specialized to read
FASTA sequences with PDB headers.
For parsing alignments, have a look at SequenceAlignmentReader and
StructureAlignmentFactory.
Finally, this module provides a number of OutputBuilder-s, which know
how to write AbstractSequence and AbstractAlignment objects to FASTA
files:
>>> with open('file.fa', 'w') as out:
builder = OutputBuilder.create(AlignmentFormats.FASTA, out)
builder.add_alignment(alignment)
builder.add_sequence(sequence)
...
or you could instantiate any of the OutputBuilder-s directly.