Subpackage: bbcflib.track

Documentation here.

bbcflib.track.track(path, format=None, **kwargs)[source]

Guess file format and return a Track object of the corresponding subclass (e.g. BedTrack).

Parameters:
  • path – (str) name of/path to a track-like file. If the file does not exist yet, a new track-like file of the requested format will be created at this location on closure, if data is added to the track (using write()).
  • format – (str) format of the file. If not provided, the format is set according to the file’s extension.
  • **kwargs

    (dict) parameters of the Track subclass’ constructor. Typically assembly or chrmeta.

bbcflib.track.convert(source, target, chrmeta=None, info=None, mode='write', clip=False)[source]

Converts a file from one format to another. Format can be explicitly specified:

convert(('file1','bed'), ('file2','sql')) ,

otherwise it is guessed first from file extension:

convert('file1.bed', 'file2.sql')

or in the worst case, by reading the first lines of the file.

Parameters:
  • source – (str or tuple) path to the source file, or tuple of the form (path, format).
  • target – (str or tuple) path to the target file, or tuple of the form (path, format).
  • chrmeta – (dict) to specify manually ‘chrmeta’ for both input and output tracks. [None]
  • info – (dict) info that will be available as an attribute of the output track. [None]
  • mode – (str) writing mode: either ‘write’, ‘append’ or ‘overwrite’. [‘write’]
bbcflib.track.strand_to_int(strand='')[source]

Convert +/- into 1/-1 notation for DNA strands.

bbcflib.track.int_to_strand(num=0)[source]

Convert 1/-1 into +/- notation for DNA strands.

bbcflib.track.format_float(f=0.0)[source]

Return a formatted string from a float or a string representing a float. Limit to 4 decimals after the comma.

bbcflib.track.format_int(i=0)[source]

Return a formatted string from an integer or a string representing an integer.

bbcflib.track.ucsc_to_ensembl(stream)[source]

Shifts start coordinates 1 base to the right, to map UCSC to Ensembl annotation.

bbcflib.track.ensembl_to_ucsc(stream)[source]

Shifts start coordinates 1 base to the left, to map Ensembl to UCSC annotation.

class bbcflib.track.Track(path, **kwargs)[source]

Bases: object

Metaclass regrouping the track properties. Subclasses for each specific format are respectively in track/text.py, track/bin.py, track/sql.py, and are instanciated when track.track() is called on a file.

path

Path to the file the Track was generated from.

filehandle

The Python opened file object from the file found in self.path. Can read() and write() it.

format

Format of the file the track was generated from.

fields

Fields defining the info contained in the track items.

assembly

GenRep assembly ID.

chrmeta

A dictionary with information about the species’ chromosomes, or a genrep assembly name.

info

A dictionary with meta-data about the track, e.g. data type, such as:

{'datatype': 'signal'}
column_by_name(fields=, []num=True)[source]

Finds a column with name in fields. Returns its index (if `num`is True) or its name.

name[source]

Returns an appropriate name for the track

class bbcflib.track.FeatureStream(data, fields=None)[source]

Bases: object

Contains an iterator yielding features, and an extra fields attribute. It can be constructed from either an iterator, a cursor, a list or a tuple.

Example:

stream = FeatureStream([('chr',1,2),('chr',3,4)])
stream = FeatureStream((('chr',1,2),('chr',3,4)))
stream = FeatureStream(iter([('chr',1,2),('chr',3,4)]))

def gen():
    for k in range(2):
        yield ('chr',2*k+1,2*k+2)

stream = FeatureStream(gen())

Example of usage:

>>> stream = FeatureStream([('chr',1,2),('chr',3,4)], fields=['chromosome','start','end'])
>>> stream.next()
('chr', 1, 2)
>>> stream.next()
('chr', 3, 4)
>>> stream.data
<listiterator object at 0x10183b650>

>>> stream = FeatureStream([('chr',1,2),('chr',3,4)], fields=['chromosome','start','end'])
>>> for s in stream: print s
('chr', 1, 2)
('chr', 3, 4)
data

An iterator, cursor, list or tuple. Each item is a tuple with as many members as the number of fields.

fields

The list of field names.

__iter__()[source]

iter(self) returns self.data, which is an iterator itself.

next()[source]

Iterating over the stream is iterating over its data.

class bbcflib.track.sql.SqlTrack(path, **kwargs)[source]

Bases: bbcflib.track.Track

Track class for sqlite3 files (extension ”.sql” or ”.db”). Additional attributes:

readonly

If True, tables will not be updated to reflect, e.g. the chrmeta or info attributes.

connection

The sqlite3 file connection.

cursor

The sqlite3 connection cursor.

types

The field types as defined in the sqlite3 tables.

tables[source]

Returns the complete list of SQL tables.

get_range(selection=None, fields=None)[source]

Returns the range of values for the given selection. If fields is None, returns min and max positions, otherwise min and max field values.

read(selection=None, fields=None, order='start, end', **kw)[source]
Parameters:
  • selection – list of dict of the type [{‘chr’:’chr1’,’start’:(12,24)},{‘chr’:’chr3’,’end’:(25,45)},...], where tuples represent ranges, or a FeatureStream.
  • fields – (list of str) list of field names (columns) to read.
  • order – (str, comma-separated) fields with respect to which the result must be sorted. [‘start,end’]
class bbcflib.track.text.TextTrack(path, **kwargs)[source]

Bases: bbcflib.track.Track

Generic Track class for text files (extension ”.txt” or ”.text”). Additional attributes:

separator

Character separating fields in the file (default ” ”).

intypes

Dictionary with keys field names and values functions that will be called on each item when reading the file (e.g. check the entry type).

outtypes

Dictionary with keys field names and values functions that will be called on each item when writing the file (e.g. format numerics to strings).

header

Indicates the presence of a header. * None to skip all consecutive lines starting with “browser”, “track” or “#” (default). * False if there is no header. * True if it is made of a standard unique line with the same number of fields as

the rest of the file (R-like). Then the header will be used to guess the track fields.
  • An int N to indicate that the first N lines of the file should be skipped.
  • A string to indicate that all header lines start with this string.
written

Boolean indicating whether the self.filehandle has already been written. If it has, the default writing mode chages from ‘write’ to ‘append’ (used after writing a header, for instance).

When reading a file, all lines beginning with “browser”, “track” or “#” are skipped. The info attribute will be filled with “key=value” pairs found on a “track” line at the top of the file. The open method takes the argument mode which can be ‘read’ (default), ‘write’, ‘append’ or ‘overwrite’. Path can also be a url, or a gzipped file.

open(mode='read')[source]

Unzip (if necessary) and open the file with the given mode.

Parameters:mode – (str) one of ‘read’, ‘write’, ‘append’ or ‘overwrite’
read(selection=None, fields=None, skip=False, **kw)[source]
Parameters:
  • selection – list of dict of the type [{‘chr’:’chr1’,’start’:(12,24)},{‘chr’:’chr3’,’end’:(25,45)},...], where tuples represent ranges, or a FeatureStream.
  • fields – (list of str) list of field names (columns) to read.
  • skip – (bool) assuming that lines are grouped by chromosome name, increases reading speed when looping over selections of several/all chromosomes. The first time lines corresponding to a chromosome are read, their position in the file is recorded (self.index). In the next iterations, only lines corresponding to chromosomes either yet unread or present in selection will be read. [False]
write(source, fields=None, mode='write', chrom=None, **kw)[source]

Add data to the track. Effectively writes in the related file.

Parameters:
  • source – (FeatureStream) data to be added to the track.
  • fields – list of field names.
  • mode – (str) file opening mode - one of ‘write’,’overwrite’,’append’. [‘write’]
  • chrom – (str) a chromosome name.
make_header(*args, **kw)[source]

If self is an empty track, this function can be used to write a header in place of the first line of its related file. Info can be given as a dictionary info, as keyword arguments to the function, or as a string. If info is a string, it will be written as it is. In other cases, the header line will start with ‘track’ and each pair of key/value in info is added as ‘key=value’. Example:

make_header(type='bedGraph',name='aaa')
# or
make_header(info={'type':'bedGraph','name':'aaa'})

writes on top of the file:

track type=bedgraph name=aaa
Parameters:
  • info – (dict) information to be written. Keys can be: ‘name’,’description’,’visibility’,’color’,’itemRgb’.
  • mode – (str) writing mode - one of ‘write’,’overwrite’,’append’.
get_range(selection=None, fields=None)[source]

Returns the range of values for the given selection. If fields is None, returns min and max positions, otherwise min and max field values.

class bbcflib.track.text.BedTrack(path, **kwargs)[source]

Bases: bbcflib.track.text.TextTrack

TextTrack class for Bed files (extension ”.bed”).

Default fields are:

['chr','start','end','name','score','strand',
'thick_start','thick_end','item_rgb',
'block_count','block_sizes','block_starts']

This list will be shortened depending on the number of items found in the first line of the file.

class bbcflib.track.text.BedGraphTrack(path, **kwargs)[source]

Bases: bbcflib.track.text.TextTrack

TextTrack class for BedGraph files (extension ”.bedGraph” or ”.bedgraph”).

Fields are:

['chr','start','end','score']
class bbcflib.track.text.SgaTrack(path, **kwargs)[source]

Bases: bbcflib.track.text.TextTrack

TextTrack class for SGA files (extension ”.sga”).

Fields are:

['chr','start','end','name','strand','score'] (when read)

['chr','name','end','strand','score']         (when written)

Scores are rounded to the upper integer when written (but are supposed to be integer originally).

class bbcflib.track.text.WigTrack(path, **kwargs)[source]

Bases: bbcflib.track.text.TextTrack

TextTrack class for Wig files (extension ”.wig”).

Fields are:

['chr','start','end','score']
class bbcflib.track.text.GffTrack(path, **kwargs)[source]

Bases: bbcflib.track.text.TextTrack

TextTrack class for GFF files (extension ”.gff” or ”.gtf”).

Fields are:

['chr','source','name','start','end','score','strand','frame','attributes']

with 9th field optional.

class bbcflib.track.text.SamTrack(path, **kwargs)[source]

Bases: bbcflib.track.text.TextTrack

TextTrack class for SAM files (extension ”.sam”).

Fields are:

['name','flag','chr','start','end','mapq','cigar','rnext','pnext','tlen','seq','qual']

according to the SAM specification <http://samtools.sourceforge.net/SAM1.pdf>_. Here, ‘name’ is the read name (QNAME); ‘chr’ holds for the reference sequence name (RNAME); ‘start’ is the leftmost mapping position (POS); ‘end’ is ‘start’ plus the length of the read. These can be followed by a few optional tags that can be specified as follows:

track("myfile.sam", tags=['XA','MD','NM'])
class bbcflib.track.bin.BinTrack(path, **kwargs)[source]

Bases: bbcflib.track.Track

Generic Track class for binary files.

class bbcflib.track.bin.BigWigTrack(path, **kwargs)[source]

Bases: bbcflib.track.bin.BinTrack

BinTrack class for BigWig files (extension ”.bigWig”, ”.bigwig” or ”.bw”).

Fields are:

['chr','start','end','score']

will use bedGraphToBigWig (write) and bigWigToBedGraph (read) and use the BedGraphTrack class.

read(selection=None, fields=None, **kw)[source]
Parameters:
  • selection – list of dict of the type [{‘chr’:’chr1’,’start’:(12,24)},{‘chr’:’chr3’,’end’:(25,45)},...], where tuples represent ranges.
  • fields – (list of str) list of field names.
class bbcflib.track.bin.BamTrack(path, **kwargs)[source]

Bases: bbcflib.track.bin.BinTrack

BinTrack class for Bam files (extension ”.bam”).

Fields are:

['chr','start','end','score','name','strand','flag','seq','qual','cigar','tags','paired','positions']

‘score’: mapping quality (MAPQ). ‘name’: read ID. ‘qual’: Phred-scaled read quality (ASCII+33, same as in fastq). ‘cigar’: CIGAR string (match / mismatch / indel etc.). ‘tags’: dictionary of tags, e.g. {‘NH’:12, ...}. ‘paired’: 0 if unpaired, 1 if first read of a pair, 2 if second. ‘positions’: list of positions the read mapped to.

Uses pysam to read the binary bam file and extract the relevant fields. Write is not implemented in this class.

read(selection=None, fields=None, **kw)[source]
Parameters:
  • selection – list of dict of the type [{‘chr’:’chr1’,’start’:(12,24)},{‘chr’:’chr3’,’end’:(25,45)},...], where tuples represent ranges.
  • fields – (list of str) list of field names.
count(regions, on_strand=False, strict=True, readlen=None)[source]

Counts the number of reads falling in a given set of regions. Returns a FeatureStream with one element per region, its score being the number of reads overlapping (even partially) this region.

Parameters:
  • regions – any iterable over of tuples of the type (chr,start,end).
  • on_strand – (bool) restrict to reads on same strand as region.
  • strict – (bool) restrict to reads entirely contained in the region.
  • readlen – (int) set readlen if strict == True.
Return type:

FeatureStream with fields (at least) [‘chr’,’start’,’end’,’score’].

coverage(region, strand=None)[source]

Calculates the number of reads covering each base position within a given region. Returns a FeatureStream where the score is the number of reads overlapping this position.

Parameters:region – tuple (chr,start,end). chr has to be present in the BAM file’s header. start and end are 0-based coordinates, counting from the beginning of feature chr.
Strand:if not None, computes a strand-specific coverage (‘+’ or 1 for forward strand, ‘-‘ or -1 for reverse strand).
Return type:FeatureStream with fields [‘chr’,’start’,’end’,’score’].
PE_fragment_size(region, midpoint=False, end=False)[source]

Retrieves fragment sizes from paired-end data, and returns a bedgraph-style track:

(chr,start,end,score) = genomic coordinates, average fragment size covering the coordinate
Parameters:
  • region – tuple (chr,start,end). chr has to be present in the BAM file’s header. start and end are 0-based coordinates, counting from the beginning of feature chr and can be omitted.
  • midpoint – attribute length to fragment midpoint (as opposed to all positions within fragment)
  • end – attribute length to fragment left or right end (by setting end=”left” or end=”right”)
Return type:

FeatureStream with fields [‘chr’,’start’,’end’,’score’].

Previous topic

Module: bbcflib.gdv

Next topic

Subpackage: bbcflib.gfminer

This Page

Websites