Module: bbcflib.genrep

This module provides an interface to GenRep repositories. It provides two classes: the Assembly class provides a representation of a particular entry in GenRep, the GenRep class allows to switch to any potential GenRep repository and handles all queries. To retrieve an Assembly named ce6, we write:

from bbcflib import genrep
a = genrep.Assembly( assembly='ce6' )

To switch to another instance of genrep:

g = genrep.GenRep( url=my_url, root=my_path )
if g.assemblies_available( 'ce6' ):
    a = genrep.Assembly( assembly='ce6', genrep=g )

Assemblies in GenRep are also assigned unique integer IDs. The unique integer ID for assembly ce6 is 14. We can use these IDs anywhere we would use the name, so the third line in the prevous code could equally well be written:

a = genrep.Assembly(14)
class bbcflib.genrep.Assembly(assembly=None, genrep=None, intype=0, fasta=None, annot=None, ex=None, via='local', bowtie2=False)[source]

A representation of a GenRep assembly. To get an assembly from the repository, call the Assembly constructor with either the integer assembly ID or the string assembly name. This returns an Assembly object:

a = Assembly(3)
b = Assembly('mm9')

An Assembly has the following fields:

id

An integer giving the assembly ID in GenRep.

name

A string giving the name of the assembly in GenRep.

index_path

The absolute path to the bowtie/bowtie2/SOAPsplice index for this assembly.

chromosomes

A dictionary of chromosomes in the assembly. The dictionary values are tuples of the form (chromsome id, RefSeq locus, RefSeq version), and the values are dictionaries with the keys ‘name’ and ‘length’.

bbcf_valid

Boolean.

updated_at
created_at

datetime objects.

nr_assembly_id
genome_id
source_id
intype

All integers. intype is ‘0’ for genomic data, ‘1’ for exons, ‘2’ for transcripts, ‘3’ for junctions.

source_name
md5
annot_track(annot_type='gene', chromlist=None, biotype=['protein_coding'])[source]

Return an iterator over all annotations of a given type in the genome.

Parameters:annot_type – (str) one of ‘gene’,’transcript’,’exon’,’CDS’.
Chrom_list:(list of str) return only features in the specified chromosomes.
Biotype:(list of str, or None) return only features with the specified biotype(s). If None, all biotypes are selected.
Return type:track.FeatureStream
annotations_path[source]

Return the path an annotation file if available (e.g. microbiome).

build_assembly(ex, assembly, fasta, annot, via, bowtie2=False)[source]

Build an Assembly object from files.

chrmeta[source]

Return a dictionary of chromosome meta data of the type {'chr1': {'length': 249250621},'chr2': {'length': 135534747},'chr3': {'length': 135006516}}

chrnames[source]

Return a list of chromosome names.

create_exome_gtf()[source]

Creates a GTF file representing the exonic structure of the genome (similar to the Ensembl one except only “exon” types are present - not “CDS”, “start codon”, etc.). This file is required to run programs such as rnacounter or HTSeq. It is based on GenRep’s data, so that SQL and GTF data are always consistent, which is not the case if one downloads the GTF directly from Ensembl. Returns the name of the newly created file.

exon_track(chromlist=None, biotype=['protein_coding'])[source]

Return a FeatureStream over all coding exons annotation in the genome: (‘chr’, start, end, ‘exon_id|gene_id|gene_name’, strand, phase).

fasta_by_chrom[source]

Returns a dictionary of single chromosome fasta files.

fasta_from_regions(regions, out=None, path_to_ref=None, chunk=50000, shuffled=False, ex=None, intype=0)[source]

Get a fasta file with sequences corresponding to the features in the bed or sqlite file.

Returns a tuple (out,size) where out is the name of the output file (or a dict) and size is the total size of the extracted sequence.

Parameters:
  • regions – (str or dict or list) bed or sqlite file name, or sequence of features. If regions is a dictionary {‘chr’: [[start1,end1],[start2,end2]]} or a list [[‘chr’,start1,end1],[‘chr’,start2,end2]], will simply iterate through its items instead of loading a track from file.
  • out – (str, filehandle or dict) output file name or filehandle. If out is a (possibly empty) dictionary, will return the updated dictionary.
  • path_to_ref – (str or dict) path to a fasta file containing the whole reference sequence, or a dictionary {chr_name: path} as returned by Assembly.untar_genome_fasta.
  • chunk – (int) buffer size (length of the sequence kept in memory before writing).
  • intype – (int) if 2, only transcribed sequences are returned (slices of mature RNAs). In this case, the fasta headers have the form “>assembly_id|transcript_id|genomic_coordinates”. For each of the given regions, one sequence per intersecting cDNA sequence is reported. [0]
Return type:

(str,int) or (dict,int)

fasta_path(intype=None, chromosome=None)[source]

Return the path to the compressed fasta file, for the whole assembly or for a single chromosome.

gene_coordinates(id_list)[source]

Creates a BED-style stream from a list of gene ids.

gene_track(chromlist=None, biotype=['protein_coding'])[source]

Return a FeatureStream over all protein coding genes annotation in the genome: (‘chr’, start, end, ‘gene_id|gene_name’, strand).

get_exon_mapping()[source]

Return a dictionary {exon_id: Exon instance}

get_features_from_gtf(h, chr=None, method='dico')[source]

Return a dictionary data of the form {key:[[values],[values],...]} containing the result of an SQL request which parameters are given as a dictionary h. All [values] correspond to a line in the SQL.

Parameters:
  • chr – (str, or list of str) chromosomes on which to perform the request. By default, every chromosome is searched.
  • method – “dico” or “boundaries”: ?

Available keys for h, and possible values:

  • “keys”: “$,$,...” (fields to SELECT and pass as a key of data)
  • “values”: “$,$,...” (fields to SELECT and pass as respective values of data)
  • “conditions”: “$:#,$:#,...” (filter (SQL WHERE))
  • “uniq”: “whatever” (SQL DISTINCT if specified, no matter what the -string- value is)
  • “at_pos”: “12,36,45,1124,...” (to select only features overlapping this list of positions)

where

  • $ holds for any column/field name in the database
  • # holds for any value in the database

Available database fields:

biotype, type, start, end, strand, frame, gene_id, gene_name, transcript_id, exon_id, exon_number.

Note: giving several field names to “keys” permits to select unique combinations of these fields. The corresponding keys of data are a concatenation (by ‘;’) of these fields.

get_gene_mapping()[source]

Return a dictionary {gene_id: Gene instance} Note that the gene’s length is not the sum of the lengths of its exons.

Returns urls to features. Example:

assembly.get_links({'name':'ENSMUSG00000085692', 'type':'gene'})

returns the dictionary {"Ensembl":"http://ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000085692"}. If params is a string, then it is assumed to be the name parameter with type=gene.

get_sqlite_url()[source]

Return the url of the sqlite file containing gene annotations.

get_transcript_mapping()[source]

Return a dictionary {transcript_id: Transcript instance}

gtf_to_sql(gtf_path, sql_path=None)[source]

Generate an SQL database based on a GTF file from Ensembl, with some additional treatment (feature start, missing exon IDs, etc.)

map_chromosome_names(names)[source]

Finds keys in the chromosomes dictionary that corresponds to the names or ids given as names. Returns a dictionary, such as:

assembly.map_chromosome_names([3,5,6,47])

{'3': (2701, u'NC_001135', 4),
 '47': None,
 '5': (2508, u'NC_001137', 2),
 '6': (2580, u'NC_001138', 4)}
set_assembly(assembly)[source]

Reset the Assembly attributes to correspond to assembly.

Parameters:assembly – integer giving the assembly ID, or a string giving the assembly name.
sqlite_path[source]

Return the path to the sqlite file containing genes annotations.

statistics(output=None, frequency=False, matrix_format=False, ex=None)[source]

Return (di-)nucleotide counts or frequencies for an assembly, writes in file output if provided. Example of result:

{
    "TT": 13574667
    "GG": 3344762
    "CC": 3365555
    "AA": 13571722
    "A": 32370285
    "TA": 6362526
    "GT": 4841536
    "AC": 4846697
    "N": 0
    "C": 17781115
    "TC": 6228639
    "GA": 6231575
    "CG": 3131283
    "GC: 3340219
    "CT": 5079814
    "AG": 5075950
    "G": 17758095
    "TG": 6206098
    "CA": 6204462
    "AT": 8875914
    "T": 32371931
}

Total = A + T + G + C

If matrix_format is True, output is like:

>Assembly: sacCer2
1   0.309798640038793   0.308714120881750   0.190593944221299   0.190893294858157
transcript_track(chromlist=None, biotype=['protein_coding'])[source]

Return a FeatureStream over all protein coding transcripts annotation in the genome: (‘chr’, start, end, ‘transcript_id|gene_name’, strand).

untar_genome_fasta(path_to_ref=None, convert=True)[source]

Untar reference sequence fasta files. Returns a dictionary {chr_name: file_name}

Parameters:
  • path_to_ref – (str) path to the fasta file of the reference sequence (possibly .tar).
  • convert – (bool) True if chromosome names need conversion from id to chromosome name.
class bbcflib.genrep.GenRep(url=None, root=None, config=None, section='genrep')[source]

Create an object to query a GenRep repository.

GenRep is the in-house repository for sequence assemblies for the BBCF in Lausanne. This is an object that wraps its use in Python in an idiomatic way.

Create a GenRep object with the base URL to the GenRep system, and the root path of GenRep’s files. For instance:

g = GenRep('genrep.epfl.ch', '/path/to/genrep/indices')

To get an assembly from the repository, call the assembly method with either the integer assembly ID or the string assembly name. This returns an Assembly object:

a = g.assembly(3)
b = g.assembly('mm9')

You can also pass this to the Assembly call directly:

a = Assembly(assembly='mm9',genrep=g)
assemblies_available(assembly=None, filter_valid=True)[source]

Returns a list of tuples (assembly_name, species) available on genrep, or tells if an assembly with name assembly is available.

assembly(assembly, intype=0)[source]

Backward compatibility

get_genrep_objects(url_tag, info_tag, filters=None, params=None)[source]

Get a list of GenRep objets.

Parameters:
  • url_tag – the GenrepObject type (plural)
  • info_tag – the GenrepObject type (singular)
  • filters – a dict that is used to filter the response
  • params – to add some parameters to the query from GenRep.

Example:

To get the genomes related to ‘Mycobacterium leprae’ species:

species = get_genrep_objects('organisms', 'organism', {'species':'Mycobacterium leprae'})[0]
genomes = get_genrep_objects('genomes', 'genome', {'organism_id':species.id})
get_motif_PWM(genome_id, motif_name, output=None)[source]

Retieves a motif PWM from its genome_id and name, and saves in the file named as output if not None.

get_sequence(chr_id, coord_list, path_to_ref=None, chr_name=None, ex=None)[source]

Parse a slice request to the repository.

Parameters:
  • chr_id – tuple of the type (3066, u'NC_003279', 6) (keys of Assembly.chromosomes).
  • coord_list – (list of (int,int)) sequences’ (start,end) coordinates.
  • path_to_ref – (str) path to a fasta file containing the whole reference sequence.
  • ex – an optional bein execution to use the sam_faidx program.

Fasta headers are assumed to be of the form “>3066_NC_003279.6 (...)”. If path_to_ref is given and the header is different, give any random value to chr_id and set chr_name to be the fasta header. E.g. chr_name=’chrI’ if the fasta has “>chrI”.

is_up()[source]

Check if genrep webservice is available

motifs_available(genome_id=None)[source]

List motifs available in genrep, returns a list like (first number is genome id):

[('6 ABF1', 'Saccharomyces cerevisiae S288c - ABF1'),
 ('6 ABF2', 'Saccharomyces cerevisiae S288c - ABF2'),
 ('6 ACE2', 'Saccharomyces cerevisiae S288c - ACE2'), ...]
class bbcflib.genrep.GenrepObject(info, key)[source]

Class wich will reference all different objects used by GenRep In general, you should never instanciate GenrepObject directly but call a method from the GenRep object.

__repr__()[source]
class bbcflib.genrep.GenRep(url=None, root=None, config=None, section='genrep')[source]

Create an object to query a GenRep repository.

GenRep is the in-house repository for sequence assemblies for the BBCF in Lausanne. This is an object that wraps its use in Python in an idiomatic way.

Create a GenRep object with the base URL to the GenRep system, and the root path of GenRep’s files. For instance:

g = GenRep('genrep.epfl.ch', '/path/to/genrep/indices')

To get an assembly from the repository, call the assembly method with either the integer assembly ID or the string assembly name. This returns an Assembly object:

a = g.assembly(3)
b = g.assembly('mm9')

You can also pass this to the Assembly call directly:

a = Assembly(assembly='mm9',genrep=g)
assemblies_available(assembly=None, filter_valid=True)[source]

Returns a list of tuples (assembly_name, species) available on genrep, or tells if an assembly with name assembly is available.

assembly(assembly, intype=0)[source]

Backward compatibility

get_genrep_objects(url_tag, info_tag, filters=None, params=None)[source]

Get a list of GenRep objets.

Parameters:
  • url_tag – the GenrepObject type (plural)
  • info_tag – the GenrepObject type (singular)
  • filters – a dict that is used to filter the response
  • params – to add some parameters to the query from GenRep.

Example:

To get the genomes related to ‘Mycobacterium leprae’ species:

species = get_genrep_objects('organisms', 'organism', {'species':'Mycobacterium leprae'})[0]
genomes = get_genrep_objects('genomes', 'genome', {'organism_id':species.id})
get_motif_PWM(genome_id, motif_name, output=None)[source]

Retieves a motif PWM from its genome_id and name, and saves in the file named as output if not None.

get_sequence(chr_id, coord_list, path_to_ref=None, chr_name=None, ex=None)[source]

Parse a slice request to the repository.

Parameters:
  • chr_id – tuple of the type (3066, u'NC_003279', 6) (keys of Assembly.chromosomes).
  • coord_list – (list of (int,int)) sequences’ (start,end) coordinates.
  • path_to_ref – (str) path to a fasta file containing the whole reference sequence.
  • ex – an optional bein execution to use the sam_faidx program.

Fasta headers are assumed to be of the form “>3066_NC_003279.6 (...)”. If path_to_ref is given and the header is different, give any random value to chr_id and set chr_name to be the fasta header. E.g. chr_name=’chrI’ if the fasta has “>chrI”.

is_up()[source]

Check if genrep webservice is available

motifs_available(genome_id=None)[source]

List motifs available in genrep, returns a list like (first number is genome id):

[('6 ABF1', 'Saccharomyces cerevisiae S288c - ABF1'),
 ('6 ABF2', 'Saccharomyces cerevisiae S288c - ABF2'),
 ('6 ACE2', 'Saccharomyces cerevisiae S288c - ACE2'), ...]
class bbcflib.genrep.Assembly(assembly=None, genrep=None, intype=0, fasta=None, annot=None, ex=None, via='local', bowtie2=False)[source]

A representation of a GenRep assembly. To get an assembly from the repository, call the Assembly constructor with either the integer assembly ID or the string assembly name. This returns an Assembly object:

a = Assembly(3)
b = Assembly('mm9')

An Assembly has the following fields:

id

An integer giving the assembly ID in GenRep.

name

A string giving the name of the assembly in GenRep.

index_path

The absolute path to the bowtie/bowtie2/SOAPsplice index for this assembly.

chromosomes

A dictionary of chromosomes in the assembly. The dictionary values are tuples of the form (chromsome id, RefSeq locus, RefSeq version), and the values are dictionaries with the keys ‘name’ and ‘length’.

bbcf_valid

Boolean.

updated_at
created_at

datetime objects.

nr_assembly_id
genome_id
source_id
intype

All integers. intype is ‘0’ for genomic data, ‘1’ for exons, ‘2’ for transcripts, ‘3’ for junctions.

source_name
md5
annot_track(annot_type='gene', chromlist=None, biotype=['protein_coding'])[source]

Return an iterator over all annotations of a given type in the genome.

Parameters:annot_type – (str) one of ‘gene’,’transcript’,’exon’,’CDS’.
Chrom_list:(list of str) return only features in the specified chromosomes.
Biotype:(list of str, or None) return only features with the specified biotype(s). If None, all biotypes are selected.
Return type:track.FeatureStream
annotations_path[source]

Return the path an annotation file if available (e.g. microbiome).

build_assembly(ex, assembly, fasta, annot, via, bowtie2=False)[source]

Build an Assembly object from files.

chrmeta[source]

Return a dictionary of chromosome meta data of the type {'chr1': {'length': 249250621},'chr2': {'length': 135534747},'chr3': {'length': 135006516}}

chrnames[source]

Return a list of chromosome names.

create_exome_gtf()[source]

Creates a GTF file representing the exonic structure of the genome (similar to the Ensembl one except only “exon” types are present - not “CDS”, “start codon”, etc.). This file is required to run programs such as rnacounter or HTSeq. It is based on GenRep’s data, so that SQL and GTF data are always consistent, which is not the case if one downloads the GTF directly from Ensembl. Returns the name of the newly created file.

exon_track(chromlist=None, biotype=['protein_coding'])[source]

Return a FeatureStream over all coding exons annotation in the genome: (‘chr’, start, end, ‘exon_id|gene_id|gene_name’, strand, phase).

fasta_by_chrom[source]

Returns a dictionary of single chromosome fasta files.

fasta_from_regions(regions, out=None, path_to_ref=None, chunk=50000, shuffled=False, ex=None, intype=0)[source]

Get a fasta file with sequences corresponding to the features in the bed or sqlite file.

Returns a tuple (out,size) where out is the name of the output file (or a dict) and size is the total size of the extracted sequence.

Parameters:
  • regions – (str or dict or list) bed or sqlite file name, or sequence of features. If regions is a dictionary {‘chr’: [[start1,end1],[start2,end2]]} or a list [[‘chr’,start1,end1],[‘chr’,start2,end2]], will simply iterate through its items instead of loading a track from file.
  • out – (str, filehandle or dict) output file name or filehandle. If out is a (possibly empty) dictionary, will return the updated dictionary.
  • path_to_ref – (str or dict) path to a fasta file containing the whole reference sequence, or a dictionary {chr_name: path} as returned by Assembly.untar_genome_fasta.
  • chunk – (int) buffer size (length of the sequence kept in memory before writing).
  • intype – (int) if 2, only transcribed sequences are returned (slices of mature RNAs). In this case, the fasta headers have the form “>assembly_id|transcript_id|genomic_coordinates”. For each of the given regions, one sequence per intersecting cDNA sequence is reported. [0]
Return type:

(str,int) or (dict,int)

fasta_path(intype=None, chromosome=None)[source]

Return the path to the compressed fasta file, for the whole assembly or for a single chromosome.

gene_coordinates(id_list)[source]

Creates a BED-style stream from a list of gene ids.

gene_track(chromlist=None, biotype=['protein_coding'])[source]

Return a FeatureStream over all protein coding genes annotation in the genome: (‘chr’, start, end, ‘gene_id|gene_name’, strand).

get_exon_mapping()[source]

Return a dictionary {exon_id: Exon instance}

get_features_from_gtf(h, chr=None, method='dico')[source]

Return a dictionary data of the form {key:[[values],[values],...]} containing the result of an SQL request which parameters are given as a dictionary h. All [values] correspond to a line in the SQL.

Parameters:
  • chr – (str, or list of str) chromosomes on which to perform the request. By default, every chromosome is searched.
  • method – “dico” or “boundaries”: ?

Available keys for h, and possible values:

  • “keys”: “$,$,...” (fields to SELECT and pass as a key of data)
  • “values”: “$,$,...” (fields to SELECT and pass as respective values of data)
  • “conditions”: “$:#,$:#,...” (filter (SQL WHERE))
  • “uniq”: “whatever” (SQL DISTINCT if specified, no matter what the -string- value is)
  • “at_pos”: “12,36,45,1124,...” (to select only features overlapping this list of positions)

where

  • $ holds for any column/field name in the database
  • # holds for any value in the database

Available database fields:

biotype, type, start, end, strand, frame, gene_id, gene_name, transcript_id, exon_id, exon_number.

Note: giving several field names to “keys” permits to select unique combinations of these fields. The corresponding keys of data are a concatenation (by ‘;’) of these fields.

get_gene_mapping()[source]

Return a dictionary {gene_id: Gene instance} Note that the gene’s length is not the sum of the lengths of its exons.

get_links(params)[source]

Returns urls to features. Example:

assembly.get_links({'name':'ENSMUSG00000085692', 'type':'gene'})

returns the dictionary {"Ensembl":"http://ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000085692"}. If params is a string, then it is assumed to be the name parameter with type=gene.

get_sqlite_url()[source]

Return the url of the sqlite file containing gene annotations.

get_transcript_mapping()[source]

Return a dictionary {transcript_id: Transcript instance}

gtf_to_sql(gtf_path, sql_path=None)[source]

Generate an SQL database based on a GTF file from Ensembl, with some additional treatment (feature start, missing exon IDs, etc.)

map_chromosome_names(names)[source]

Finds keys in the chromosomes dictionary that corresponds to the names or ids given as names. Returns a dictionary, such as:

assembly.map_chromosome_names([3,5,6,47])

{'3': (2701, u'NC_001135', 4),
 '47': None,
 '5': (2508, u'NC_001137', 2),
 '6': (2580, u'NC_001138', 4)}
set_assembly(assembly)[source]

Reset the Assembly attributes to correspond to assembly.

Parameters:assembly – integer giving the assembly ID, or a string giving the assembly name.
sqlite_path[source]

Return the path to the sqlite file containing genes annotations.

statistics(output=None, frequency=False, matrix_format=False, ex=None)[source]

Return (di-)nucleotide counts or frequencies for an assembly, writes in file output if provided. Example of result:

{
    "TT": 13574667
    "GG": 3344762
    "CC": 3365555
    "AA": 13571722
    "A": 32370285
    "TA": 6362526
    "GT": 4841536
    "AC": 4846697
    "N": 0
    "C": 17781115
    "TC": 6228639
    "GA": 6231575
    "CG": 3131283
    "GC: 3340219
    "CT": 5079814
    "AG": 5075950
    "G": 17758095
    "TG": 6206098
    "CA": 6204462
    "AT": 8875914
    "T": 32371931
}

Total = A + T + G + C

If matrix_format is True, output is like:

>Assembly: sacCer2
1   0.309798640038793   0.308714120881750   0.190593944221299   0.190893294858157
transcript_track(chromlist=None, biotype=['protein_coding'])[source]

Return a FeatureStream over all protein coding transcripts annotation in the genome: (‘chr’, start, end, ‘transcript_id|gene_name’, strand).

untar_genome_fasta(path_to_ref=None, convert=True)[source]

Untar reference sequence fasta files. Returns a dictionary {chr_name: file_name}

Parameters:
  • path_to_ref – (str) path to the fasta file of the reference sequence (possibly .tar).
  • convert – (bool) True if chromosome names need conversion from id to chromosome name.

Previous topic

Module: bbcflib.frontend

Next topic

Module: bbcflib.demultiplex

This Page

Websites