Subpackage: bbcflib.gfminer

This packages provides algorithms working on FeatureStream. It is divided into three major groups depending on the algorithm’s return type:

Algorithms can be used via a direct import:

from bbcflib.gfminer import stream
for chrom in track1.chrmeta.keys():
    catstream = stream.concatenate([track1.read(chrom),track2.read(chrom)])

or via the global run function. The latter will take file names as input parameters, while a direct call requires FeatureStream to be created beforehands.

Most functions in gfminer take one or more lists of FeatureStream as parameters, plus additional algorithm-specific parameters.

bbcflib.gfminer.run(**kwargs)[source]

Wrapper function to execute any operation contained in this package, directly from file inputs. Arguments are:

Parameters:
  • operation – (str) the name of the function to be called.
  • output – (str) a filename or a directory to write the results into.
  • assembly – (str) a genome assembly identifier if needed.
  • chromosome – (str) a chromosome name if operation must be restricted to a single chromsome.
  • ... – additional parameters passed to operation.

Example:

run(operation="score_by_feature",
    output="score_output.bed", chromosome="chr1",
    trackScores="density_file.sql", trackFeatures="genes.sql")
bbcflib.gfminer.common.add_name_field(stream)[source]

Adds a unique name to each record in the stream.

bbcflib.gfminer.common.apply(stream, fields, functions)[source]

Applies custom transformations to the respective fields.

Parameters:
  • stream – FeatureStream object.
  • fields – (list of str) list of fields to transform in the output.
  • functions – list of functions to apply to the respective fields.
Return type:

FeatureStream, or list of FeatureStream objects

bbcflib.gfminer.common.cobble(*args, **kwargs)[source]

Fragments overlapping features in stream and applies aggregate[f] function to each field f in common fragments. stream has to be sorted w.r.t. ‘chr’ (if any), ‘start’ and ‘end’.

Example:

[('chr1',10,15,'A',1),('chr1',13,18,'B',-1),('chr1',18,25,'C',-1)]

yields

('chr1', 10, 13, 'A', 1)
('chr1', 13, 15, 'A|B', 0)
('chr1', 15, 18, 'B', -1)
('chr1', 18, 25, 'C', -1)

This is to avoid having overlapping coordinates of features from both DNA strands, which some genome browsers cannot handle for quantitative tracks.

Parameters:
  • stream – FeatureStream object.
  • stranded – (bool) if True, only features of the same strand are cobbled. [False]
  • scored – (bool) if True, each fragment will be attributed a fraction of the original score, based on its length. [False]
Return type:

FeatureStream

bbcflib.gfminer.common.concat_fields(stream, infields, outfield='name', separator='|', as_tuple=False)[source]

Concatenate fields of a stream. Ex.:

(‘chr1’, 12, ‘aa’, ‘bb’) -> (‘chr1’, 12, ‘aa|bb’) # as_tuple=False (‘chr1’, 12, ‘aa’, ‘bb’) -> (‘chr1’, 12, (‘aa’,’bb’)) # as_tuple=True

Parameters:
  • stream – FeatureStream object.
  • infields – (list of str) list of fields to concatenate.
  • outfield – (str) name of the new field created by concatenation of infields (can be an already existing one). [‘name’]
  • separator – (str) char to add between entries from concatenated fields. [‘|’]
  • as_tuple – (bool) join concatenated field entries in a tuple instead of a separator in a single string. [False]
Return type:

FeatureStream object.

bbcflib.gfminer.common.copy(stream, n=2)[source]

Return n independant copies of stream. Has to be called before iterating over stream, otherwise it will copy only the remaining items of stream. Will load at once the whole stream in memory.

bbcflib.gfminer.common.duplicate(stream, infield, outfields)[source]

Duplicate one of stream‘s fields. If outfields has more than one element, the field is copied as many times.

Parameters:
  • stream – FeatureStream object.
  • infield – (str) name of the field to be duplicated.
  • outfields – (str, or list of str) the new field(s) to be created.
bbcflib.gfminer.common.fusion(*args, **kwargs)[source]

Fuses overlapping features in stream and applies aggregate[f] function to each field f. stream has to be sorted w.r.t. ‘chr’ (if any), ‘start’ and ‘end’.

Example:

[('chr1',10,15,'A',1),('chr1',13,18,'B',-1),('chr1',18,25,'C',-1)]

yields

('chr1', 10, 18, 'A|B', 0)
('chr1', 18, 25, 'C', -1)
Parameters:
  • stream – FeatureStream object.
  • stranded – (bool) if True, only features of the same strand are fused. [False]
Return type:

FeatureStream

bbcflib.gfminer.common.generic_merge(x)[source]

Sum numeric values; concatenate str values; stack tuples; None & None returns None.

bbcflib.gfminer.common.map_chromosomes(stream, chromosomes, keep=False)[source]

Translate the chromosome identifiers in stream into chromosome names of the type ‘chr5’.

Parameters:
  • stream – FeatureStream object.
  • chromosomes – a dictionary of chromosomes, such as genrep.Assembly.chromosomes.
  • keep – (bool) keep all features (True) or only those which chromosome identifier is recognized (False). [False]
bbcflib.gfminer.common.no_merge(x)[source]

Assuming all elements of x are identical (chr) or irrelevant, return the first non-null element.

bbcflib.gfminer.common.normalize(M, method)[source]

Normalize the vectors of a matrix M using the given method. To apply it to streams, use gfminer.stream.normalize.

Parameters:
  • M – (list of lists, or numpy array) matrix M to normalize.
  • method

    normalization method: * 'total' divides every score vector by its sum (total number of reads) x 10^7 . * 'deseq' applies DESeq’s normalization (“size factors”) - considering every track

    as belonging to a different group.
    • 'quantile' applies quantile normalization.
bbcflib.gfminer.common.ordered(fn)[source]

Decorator. Keeps the original order of fields for a stream passing through one of gfminer functions that take and return a FeatureStream, or a list of FeatureStream objects.

bbcflib.gfminer.common.reorder(stream, fields, last=False)[source]

Reorders stream.fields so that fields come first.

Parameters:
  • stream – FeatureStream object.
  • fields – list of field names.
  • last – (bool) if True, reorders fields so that fields come last.
Return type:

FeatureStream

bbcflib.gfminer.common.score_threshold(stream, threshold=0.0, lower=False, strict=False, fields='score')[source]

Filter the features of a track which score is above or below a certain threshold.

Parameters:
  • stream – FeatureStream, or list of FeatureStream objects.
  • threshold – (float) threshold above/below which features are retained
  • lower – (bool) higher (False) or lower (True) bound.
  • strict – (bool) strictly above/below threshold.
  • fields – (str or list of str) names of the fields to apply the filter to.
Return type:

FeatureStream, or list of FeatureStream objects

bbcflib.gfminer.common.select(stream, fields=None, selection={})[source]

Keeps only specified fields from a stream, and/or only elements matching selection.

Parameters:
  • stream – FeatureStream.
  • fields – (list of str) list of fields to keep in the output.
  • selection – (dict {field:val}) keep only lines s.t. field has a value equal to val, or is an element of val. E.g. select(f,None,{‘chr’:[‘chr1’,’chr2’]}). val can also be a function returning True or False when applied to an element of the field; if True, the element is kept.
Return type:

FeatureStream, or list of FeatureStream objects.

bbcflib.gfminer.common.sentinelize(stream, sentinel=9223372036854775807)[source]

Append sentinel at the end of iterable (avoid StopIteration error).

bbcflib.gfminer.common.shuffled(*args, **kwargs)[source]

Return a stream of randomly located features of the same length and annotation as these of the original stream.

Parameters:
  • stream – FeatureStream object.
  • chrlen – (int) chromosome length. [9223372036854775807]
  • repeat_number – (int) repeat_number random features are yielded per input feature. [1]
  • sorted – (bool) whether or not to sort the output stream. [True]
Return type:

FeatureStream

bbcflib.gfminer.common.sorted_stream(stream, chrnames=[], fields=['chr', 'start', 'end'], reverse=False)[source]

Sorts a stream according to fields values. Will load the entire stream in memory. The order of names in chrnames is used to sort the ‘chr’ field if available.

Parameters:
  • stream – FeatureStream object.
  • chrnames – list of chrmosome names.
  • fields – list of field names. [[‘chr’,’start’,’end’]]
  • reverse – reverse order. [False]
Return type:

FeatureStream

bbcflib.gfminer.common.split_field(stream, outfields, infield='name', separator=';', header_split=None, strip_input=False)[source]

Split one field of a stream containing multiple information, into multiple fields. Ex.:

(‘chr1’, 12, ‘aa;bb;cc’) -> (‘chr1’, 12, ‘aa’, ‘bb’, ‘cc’)

(‘chr1’, 12, ‘name=aa;strand=”+”;score=143;additional=”X”’) -> (‘chr1’, 12, ‘aa’, 243, ‘+’, ‘additional=”X”’)

Parameters:
  • stream – FeatureStream object.
  • outfields – (list of str) list of new fields to be created.
  • infield – (str) name of the field to be split. [‘name’]
  • separator – (str) char separating the information in infield‘s entries. [‘;’]
  • header_split – if split entries are field_name/field_value pairs, provides the separator to split them.
  • strip_input – (bool) if True for a field of name/value pairs, will remove from the original field the values that have been succesfully parsed.
bbcflib.gfminer.common.strand_merge(x)[source]

Return 1 (resp.-1) if all elements in x are 1 (resp.-1), 0 otherwise.

bbcflib.gfminer.common.unroll(stream, regions, fields=['score'])[source]

Creates a stream of end-start items with appropriate fields values at every base position. For example, unroll([(10,12,0.5,'a'), (14,15,1.2,'b')], regions=(9,16)) returns:

FeatureStream([(0,),(0.5,'a'),(0.5,'a'),(0,),(0,),(1.2,'b'),(0,)])
                9      10        11      12   13     14      15
Parameters:
  • stream – FeatureStream object.
  • regions – either a pair (start,end) or an ordered list of such pairs or a FeatureStream interpreted as bounds of the region(s) to return.
  • fields – list of field names in addition to ‘start’,’end’. [[‘score’]]
Return type:

FeatureStream

BedTools

Binding of the entire collection of BedTools.

The generic binding is:

def bedtools(tool, args=None)

with parameters tool, the name of the tool, and args, a string (“-i file”), a list ([“-i”,”file”]) or a dictionary ({“i”: “file”}) of command-line options passed to tool.

Each individual tool has its own call, like:

annotateBed(ex,bedfile,files,wait=True,via='local',**kw)

with obligatory arguments bedfile and files (see the BedTools documentation), and any additional optional arguments via **kw. If wait is True, then the function will wait for completion and return the output filename, otherwise it runs a nonblocking job (with the parameter via) and returns a tuple (bein.Future, filename).

bbcflib.gfminer.bedtools.annotateBed(ex, bedfile, files, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.bamToBed(ex, bamfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.bamToFastq(ex, bamfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.bed12ToBed6(ex, bedfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.bedToBam(ex, bedfile, genomefile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.bedToIgv(ex, bedfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.bedpeToBam(ex, bedfile, genomefile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.closestBed(ex, afile, bfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.clusterBed(ex, bedfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.complementBed(ex, bedfile, genomefile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.coverageBed(ex, bfile, afile=None, bamfile=None, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.expandCols(ex, bedfile, column, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.fastaFromBed(ex, bedfile, fastafile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.flankBed(ex, bedfile, genomefile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.genomeCoverageBed(ex, genomefile, bedfile=None, bamfile=None, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.getOverlap(ex, bedfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.groupBy(ex, bedfile, groupcol, opcol, operation, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.intersectBed(ex, bfile, afile=None, bamfile=None, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.linksBed(ex, bedfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.mapBed(ex, afile, bfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.maskFastaFromBed(ex, bedfile, fastafile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.mergeBed(ex, bedfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.multiBamCov(ex, bedfile, bamfiles, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.multiIntersectBed(ex, bedfiles, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.nucBed(ex, bedfile, fastafile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.pairToBed(ex, bfile, afile=None, bamfile=None, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.pairToPair(ex, afile, bfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.randomBed(ex, genomefile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.shuffleBed(ex, bedfile, genomefile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.slopBed(ex, bedfile, genomefile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.sortBed(ex, bedfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.subtractBed(ex, afile, bfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.tagBam(ex, bedfiles, labels, bamfile, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.unionBedGraphs(ex, files, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.windowBed(ex, bfile, afile=None, bamfile=None, wait=True, via='local', **kw)[source]
bbcflib.gfminer.bedtools.windowMaker(ex, bedfile=None, genomefile=None, wait=True, via='local', **kw)[source]

Submodule: bbcflib.gfminer.stream

The stream module contains algorithms which produce FeatureStream out of one or more FeatureStream objects.

bbcflib.gfminer.stream.intervals.concatenate(*args, **kwargs)[source]

Returns one stream containing all features from a list of tracks, ordered by fields.

Parameters:
  • trackList – list of FeatureStream objects.
  • fields – (list of str) list of fields to keep in the output (at least [‘start’,’end’]).
  • remove_duplicates – (bool) whether to remove items that are identical in several of the tracks in trackList. [False]
  • group_by – (list of str) if specified, elements having all values for these fields in common will be merged into a singe element. Other fields are merged according to aggregate if specified, or common.generic_merge by default.
Aggregate:

(dict) for each field name given as a key, its value is the function to apply to the vector containing all different values for this field in order to merge them. E.g. {'score': lambda x: sum(x)} will return the sum of all scores in the output.

Return type:

FeatureStream

bbcflib.gfminer.stream.intervals.selection(trackList, selection)[source]

For each stream in trackList, keep only items satisfying the selection‘s filters. A selection is entered as a dictionary which keys are field names, and values are the scope of possible entries for each field. Example:

sel = {'chr':['chrI','chrII'], 'start':(1,10000), 'end':(5000,15000), 'count':range(30), ...}
selection(stream, selection=sel)

All filters in a selection must be satisfied for an item to pass through it (AND operator). To give alternative conditions (OR operator), one must give several such selections in a list:

sel = [{‘chr’:’chrI’, ‘start’:(1,10000)}, {‘chr’:’chrI’, ‘end’:(1000000,1500000)}]

Values can be tuples (range of values), lists (of possible values), or a single element.

Parameters:
  • trackList – FeatureStream, or list of FeatureStream objects.
  • selection – (dict, or list of dict) the filter described above.
bbcflib.gfminer.stream.intervals.overlap(*args, **kwargs)[source]

For each stream in trackList, keep only items overlapping at least one element of trackFeatures. The input streams need to be ordered w.r.t ‘chr’, ‘start’ and ‘end’. To be applied chromosome by chromosome. If several tracks are given in either trackList or trackFeatures, they will be concatenated into one.

Parameters:
  • trackList – FeatureStream - the elements to be filtered. If a list of streams is provided, they will be merged (using concatenate).
  • trackFeatures – FeatureStream - the filter. If a list fo streams is provided, they will be merged (using concatenate).
  • strict – (bool) if True, only score regions from trackList that entirely contain a feature region of trackFeatures will be returned. [False]
  • annotate – (bool) if True, supplementary annotation (and the corresponding fields) from trackFeatures will be added to the result. [False]
  • flatten – (func) one of None, common.fusion or common.cobble. Function to be applied to trackFeatures before all. [common.cobble]
Return type:

FeatureStream

bbcflib.gfminer.stream.intervals.neighborhood(*args, **kwargs)[source]

Given streams of features and four integers before_start, after_end, after_start and before_end, this will return one or two features for every input feature:

  • Only before_start and after_end are given:

    (start, end, ...) -> (start-before_start, end+after_end, ...)
    
  • Only before_start and after_start are given:

    (start, end, ...) -> (start-before_start, start+after_start, ...)
    
  • Only after_end and before_end are given:

    (start, end, ...) -> (end-before_end, end+after_end, ...)
    
  • If all four parameters are given, a pair of features is generated:

    (start, end, ...) -> (start-before_start, start+after_start, ...)
                         (end-before_end, end+after_end, ...)
    
  • If the boolean parameter on_strand is set to True, then start and end are understood relative to orientation:

    (start, end, -1, ...) -> (start-after_end, start+before_end, -1, ...)
                             (end-after_start, end+before_start, -1, ...)
    (start, end, +1, ...) -> (start-before_start, start+after_start, +1, ...)
                             (end-before_end, end+after_end, +1, ...)
    
Parameters:
  • trackList – list of FeatureStream objects.
  • before_start – (int) number of bp before the feature start.
  • after_end – (int) number of bp after feature end.
  • after_start – (int) number of bp after the feature start.
  • before_end – (int) number of bp before the feature end.
  • on_strand – (bool) True to respect strand orientation. [False]
Return type:

FeatureStream

bbcflib.gfminer.stream.intervals.combine(*args, **kwargs)[source]

Applies a custom function to a list of tracks, such as union, intersection, etc., and return a single result track. The input streams need to be ordered w.r.t ‘chr’, ‘start’ and ‘end’. To be applied chromosome by chromosome.

Only fields of the first track are kept. Values for a common field are merged by default according to common.strand_merge,`common.no_merge` and common.generic_merge, respectively for strand, chromosome and all others.

Parameters:
  • trackList – list of FeatureStream objects.
  • fn – boolean function to apply, such as bbcflib.gfminer.stream.union.
  • win_size – (int) window size, in bp.
  • aggregate – (dict) for each field name given as a key, its value is the function to apply to the vector containing all trackList’s values for this field in order to merge them. E.g. {'score': lambda x: sum(x)/len(x)} will return the average of all trackList‘s scores in the output.
Return type:

FeatureStream

bbcflib.gfminer.stream.intervals.exclude(x, indexList)[source]

Returns True if x[n] is False for all n in indexList and x[n] is True for at least another n; returns False otherwise.

bbcflib.gfminer.stream.intervals.require(x, indexList)[source]

Returns True if x[n] is True for all n in indexList and x[n] is True for at least another n; returns False otherwise.

bbcflib.gfminer.stream.intervals.disjunction(x, indexList)[source]

Returns True if either all True elements of x are in indexList or none of them.

bbcflib.gfminer.stream.intervals.intersection(x)[source]

Boolean ‘AND’.

bbcflib.gfminer.stream.intervals.union(x)[source]

Boolean ‘OR’.

bbcflib.gfminer.stream.intervals.intersect(trackList, **kw)[source]

Return all regions covered by an item of every track in trackList, returning only fields that are common to all tracks in trackList. It is a short name for calling combine with fn=intersection and its other optional keyword arguments. Example:

X1: ___[     chr,A,+,6     ]________[     chr,C,+,3     ]_________
X2: ________[     chr,B,+,8     ]________[     aaa,C,-,4     ]____
R:  ________[ chr,A|B,+,14 ]_____________[ chr,C|C,0,7 ]__________
bbcflib.gfminer.stream.intervals.segment_features(trackList, nbins=10, upstream=None, downstream=None)[source]

Split every feature of a track into nbins equal segments, and optionally adds upstream and downstream flanks. Flanks are specified as a pair (distance, number_of_bins). If the distance is < 1, it is interpreted as a fraction of the feature’s length. A new field ‘bin’ giving the index of each fragment produced from a feature is added to the output track. If the track shows no field ‘strand’, all features are considered as being on the forward strand.

Parameters:
  • trackList – FeatureStream, or list of FeatureStream objects.
  • nbins – (int) number of bins. [10]
  • upstream – (tuple (float,int)) upstream flank.
  • downstream – (tuple (float,int)) downstream flank.
Return type:

FeatureStream, or list of FeatureStream objects

bbcflib.gfminer.stream.scores.merge_scores(*args, **kwargs)[source]

Creates a stream with per-base average of several score tracks:

X1: __________666666666______
X2: _____2222222222__________
R:  _____11111444443333______
Parameters:
  • trackList – list of FeatureStream objects.
  • method – (str) type of average: one of ‘arithmetic’,’geometric’, or ‘sum’ (no average).
Return type:

FeatureStream

bbcflib.gfminer.stream.scores.filter_scores(trackScores, trackFeatures, method='sum', strict=False, annotate=False, flatten=<function cobble at 0x2b3e156606e0>)[source]

Extract from trackScores only the regions overlapping trackFeatures‘s regions. Warning: both score and features streams must be sorted! (use common.sorted_stream if necessary). Example:

X: _____#########__________#############_______
Y: __________666666666___2222776_444___________
R: __________6666__________22776_444___________

Note: trackFeatures is cobbled by default (to avoid score duplications). An alternative is fusion, or nothing. If strand information is present in both trackScores and trackFeatures, only scores inside a region of the same strand are kept.

Parameters:
  • trackScores – (FeatureStream) one -sorted- score track. If a list of streams is provided, they will be merged (using merge_scores).
  • trackFeatures – (FeatureStream) one -sorted- feature track. If a list of streams is provided, they will be merged (using concatenate).
  • method – (str) merge_scores method argument, in case trackScores is a list. [‘sum’]
  • strict – (bool) if True, only score regions from trackScores that are strictly contained in a feature region of trackFeatures will be returned. [False]
  • annotate – (bool) if True, supplementary annotation (and the corresponding fields) from trackFeatures will be added to the result. [False]
  • flatten – (func) one of None, common.fusion or common.cobble. Function to be applied to trackFeatures before all. [common.cobble]
Return type:

FeatureStream

bbcflib.gfminer.stream.scores.score_by_feature(trackScores, trackFeatures, method='mean')[source]

For every feature from trackFeatures, get the list of all scores it contains and apply an operation method on this list (by default, scores are averaged). Warning: both score and feature streams must be sorted! (use common.sorted_stream is necessary). The output is a stream similar to trackFeatures but with an additional score field for each stream in trackScores:

method = 'mean':

X: ------##########--------------##########------
Y: ___________666666666__________6666666666______
R: ______[   3.   ]______________[   6.   ]______


method = 'sum':

X : ------##########--------------##########------
Y1: ___________666666666__________6666666666______
Y2: ___222222_____________________333_____________
R : ______[  30,6  ]______________[  60,9  ]______
Parameters:
  • trackScores – (list of) one or several -sorted- score track(s) (FeatureStream).
  • trackFeatures – (FeatureStream) one -sorted- feature track.
  • method – (str of function): operation applied to the list of scores from one feature. Can be one of ‘sum’,’mean’,’median’,’min’,’max’, or a custom function.
Return type:

FeatureStream

bbcflib.gfminer.stream.scores.window_smoothing(*args, **kwargs)[source]

Given a (list of) signal track(s) trackList, a window_size L (in base pairs by default, or in number of features if featurewise is True), and a step_size, return as many signal tracks with, at each position p (multiple of step_size), the average score in the window [p-L/2, p+L/2]:

X: __________666666666666____________
R: ______12345666666666654321________ (not exact scores here)
Parameters:
  • trackList – FeatureStream, or list of FeatureStream objects.
  • window_size – (int) window size in bp.
  • step_size – (int) step length (one score returned per step_size positions). [1]
  • stop_val – (int) sequence length. [sys.maxint]
  • featurewise – (bool) bp (False), or number of features (True). [False]
Return type:

FeatureStream

Example of windows, window_size=9, step_size=3:

[0,1,2,3,4,5,6,7,8,9), [3,4,5,6,7,8,9,10,11,12), ...

bbcflib.gfminer.stream.scores.normalize(trackList, method='total', field='score')[source]

Normalizes the scores in every stream from trackList using the given method. It assumes that each of the streams represents the same features, i.e. the n-th element of one stream corresponds to the n-th element of another.

[!] This function will temporarily store everything in memory.

Parameters:
  • trackList – FeatureStream, or list of FeatureStream objects.
  • method

    normalization method: * 'total' divides every score vector by its sum (total number of reads) x 10^7 . * 'deseq' applies DESeq’s normalization (“size factors”) - considering every track

    as belonging to a different group.
    • 'quantile' applies quantile normalization.
  • field – (str) name of the field containing the scores (must be the same for all streams).
bbcflib.gfminer.stream.annotate.getNearestFeature(*args, **kwargs)[source]

For each element of features, searches the nearest element of annotations and returns a stream similar to features, with additional annotation fields, e.g.:

('chr5',12,14) -> ('chr5',12,14,'geneId|geneName','location_type','distance').

If there are several genes, they are separated by ‘_’: geneId1|geneName1_geneId2|geneName2. For each gene, location_type is one of:

  • ‘Intergenic’ if there are no genes within a distance thresholdInter,
  • ‘Included’ if the feature is included in the gene,
  • ‘Promot’ if the feature is upstream and within thresholdInter of the gene start,
  • ‘Upstream’ if the feature is upstream and beyond the promoter of the gene,
  • ‘3UTR’ if the feature is downstream and within `thresholdUTR`% of the distance to the next downstream gene,
  • ‘Downstream’ otherwise.

These annotations can be concatenated with ‘_’ as well. The distance to each gene is negative if the feature is included, positive otherwise.

Parameters:
  • features – (FeatureStream) features track.
  • annotations – (FeatureStream) gene annotation track (e.g. as obtained with assembly.gene_track()).
  • thresholdPromot – (int) associates the promoter of each gene which promoter is within this distance of the feature. Above the threshold, associates only the closest. [2000]
  • thresholdInter – (int) no gene beyond this distance will be considered. [100000]
  • thresholdUTR – (int) in case the feature is surrounded by two eligible genes on the same strand: if distance to gene1’s 3’UTR upstream is less than *thresholdUTR*% of the distance between gene1 and gene2, associated to 3’UTR of gene1, else to promoter of gene2. [10]
Return type:

FeatureStream (..., str, str, str).

          <--                   feat                    -->
      ______| thresholdPromot  ++++++   thresholdPromot |______
-----|______|-------------------------------------------|______|----------
      gene 1                                             gene 2

                                  feat
      ______  thresholdInter     ++++++        thresholdInter   ______
-----|______|----------...------------------...----------------|______|---
      gene 1                                                    gene 2

                   feat
     -->          ++++++               -->
     |______  10%             90%     |______
-----|______|------|------------------|______|-----  (attributed to gene1)
      gene 1      thresholdUTR         gene 2

Submodule: bbcflib.gfminer.numeric

The numeric module contains algorithms which return numeric objects (typically numpy.arrays) from one or more FeatureStream objects.

bbcflib.gfminer.numeric.signal.score_array(trackList, fields=['score'])[source]

Returns a numeric array with the fields columns from each input track and a vector of row labels, taken from the name field which must match in all tracks.

bbcflib.gfminer.numeric.signal.vec_reduce(x)[source]

Substracts the average and divides by the standard deviation.

bbcflib.gfminer.numeric.signal.correlation(trackList, regions, limits=(-1000, 1000), with_acf=False)[source]

Calculates the cross-correlation between two streams and returns a vector containing the correlation at each lag in this order (L/R for resp. limits[0], limits[1]): [L,L+1,...,R-1,R]. If more than two tracks are given in trackList, returns a list of correlation vectors, one for every distinct pair of tracks. If with_acf is True, self-correlations will also be included in the list.

A negative lag indicates that track 2 is shifted to the right w.r.t track 1, a positive lag - to the left. So to get the correlation at lag +4, one has to look at the 4-L th element of the array; for lag -4, at the -4-L th element.

Example:

|_____ /^\ _________|         lag 0
|______________/^\__|


|_____ /^\ _________|         lag -8
   ->   |______________/^\__|


        |_____ /^\ _________| lag +8
|______________/^\__|  <-
Parameters:
  • trackList – list of FeatureStream objects
  • regions – a tuple (start,end) or a FeatureStream with the bounds of the regions to consider (see unroll). In the latter case, all regions will be concatenated.
  • limits – (tuple (int,int)) maximum lag to consider. [-1000,1000]
  • with_acf – (bool) include auto-correlations. [False]
Return type:

list of floats, or list of lists of floats.

bbcflib.gfminer.numeric.regions.feature_matrix(trackScores, trackFeatures, segment=False, method='mean', **kw)[source]

Return an array with as many lines as there are features in trackFeatures, and as many columns as there are score tracks in trackScores. Each element in the matrix thus corresponds to the (average) score of some genomic feature.

If segment is True, each feature will be segmented into bins using bbcflib.gfminer.stream.intervals.segment_features (additional parameters in **kw will be passed to this function). Then each element of the array is itself an array with nbins lines and one column for each track in trackScores.

If segment is False, then each element of the array is an array with one element for each track in trackScores.

Example:

              gene1                 gene2
X: -----#####|#####|#####--------###|###|###-----  (features)
Y: _____________666|66666________666|666|666_____  (scores1)
Z: _____22222|22222|22222________________________  (scores2)

With segment=True, nbins=3:

      Y   Z
R: [[[0.  2.],    # bin0              [2.  2.],    # bin1  } gene 1
     [6.  2.]],   # bin2 /
    [[6.  0.],    # bin0              [6.  0.],    # bin1  } gene2
     [6.  0.]]]   # bin2 /

With segment=False:

      Y   Z
R:  [[3.  2.]
     [6.  0.]]

Note: the whole segmented features track will be loaded in memory.

Parameters:
  • trackScores – (FeatureStream, or list of FeatureStream objects) score track(s).
  • trackFeatures – (FeatureStream) feature track.
  • segment – (bool) segment each feature into bins.[False]
  • method – (str) Operation applied to the list of scores for one feature. It is the method argument to stream.score_by_feature - one of ‘sum’,’mean’,’median’,’min’,’max’.
  • **kw

    arguments to pass to segment_features (nbins,`upstream`,`downstream`).

Return type:

tuple (numpy.ndarray of strings, numpy.ndarray of floats)

bbcflib.gfminer.numeric.regions.summed_feature_matrix(trackScores, trackFeatures, method='mean', **kw)[source]

Each feature in trackFeatures is segmented into bins using bbcflib.gfminer.stream.segment_features (with parameters passed from **kw). This creates a matrix with a column for each track in trackScores and a row for each bin in the segmented features. The values of a matrix entry is the score from one track in trackScores in one bin summed over all features.

Example:

              gene1                 gene2
X: -----#####|#####|#####--------###|###|###-----  (features, nbins=3)
Y: _____________666|66666________666|666|666_____
Z: _____22222|22222|22222________________________

     Y   Z
R: [[3.  1.],   # bin 0
    [4.  1.],   # bin 1
    [6.  1.]]   # bin 2

Note: the whole segmented features track will be loaded in memory.

Parameters:
  • trackScores – (FeatureStream, or list of FeatureStream objects) score track(s).
  • trackFeatures – (FeatureStream) feature track.
  • method – (str) Operation applied to the list of scores for one feature. It is the method argument to stream.score_by_feature - one of ‘sum’,’mean’,’median’,’min’,’max’.
  • **kw

    arguments to pass to segment_features (nbins,`upstream`,`downstream`).

Return type:

numpy.ndarray, int (number of features)

Submodule: bbcflib.gfminer.figure