Module: bbcflib.mapseq

This module provides functions useful to map raw reads using bowtie. The most common workflow will first use map_reads which takes the following arguments:

  • 'ex': an execution environment to run jobs in,
  • 'fastq_file': the raw reads as a fastq file,
  • 'chromosomes': a dictionary with keys ‘chromosome_id’ as used in the bowtie indexes and values a dictionary with usual chromosome names and their lengths,
  • 'bowtie_index': the file prefix to the bowtie index,
  • 'maxhits': the maximum number of hits per read to keep in the output (default 5),
  • 'antibody_enrichment': an approximate enrichement ratio of protein bound sequences over controls (default 50),
  • 'name': sample name to be used in descriptions and file names,
  • 'remove_pcr_duplicates': whether to remove probable PCR amplification artifacts based on a Poisson confidence threshold (default True),
  • 'bwt_args': a list of specific arguments for bowtie (in addition to [“-Sam”,str(max(20,maxhits)),”–best”,”–strata”]).

The function map_groups will take a collection of sample as described in a job object from the frontend module and run fetch fastq files for each of them through using a daflims object, use an ‘Assembly’ object from genrep to get the bowtie indexes and run map_reads for each sample.

The second step in the workflow consists in generating genome-wide density maps with densities_groups which needs a ‘bein’ execution environment, the same job and assembly as above and a dictionary ‘file_dict’ as returned by map_groups. The function then runs parallel_density_sql on each ‘bam’ file to obtain a normalized read density profile as a ‘sqlite’ file.

Below is the script used by the frontend:

from bbcflib import daflims, genrep, frontend, gdv, common
from bbcflib.mapseq import *
M = MiniLIMS( limspath )
working_dir = '/path/to/scratch/on/cluster'
hts_key = 'test_key'
gl = { 'hts_mapseq': {'url': 'http://htsstation.epfl.ch/mapseq/'},
       'genrep_url': 'http://bbcftools.vital-it.ch/genrep/',
       'bwt_root': '/db/genrep/',
       'script_path': '/srv/chipseq/lib',
       'lims': {'user': 'alice',
                 'passwd': {'lgtf': 'bob123',
                            'gva': 'joe456'}},
        'gdv': {'url': 'http://svitsrv25.epfl.ch/gdv',
                'email': 'alice.ecila@somewhere.edu',
                'key': 'xxxxxxxxxxxxxxxxxxxxxxxx'} }
assembly_id = 'mm9'
htss = frontend.Frontend( url=gl['hts_mapseq']['url'] )
job = htss.job( hts_key )
g_rep = genrep.GenRep( gl['genrep_url'], gl['bwt_root'] )
assembly = genrep.Assembly( assembly=assembly_id, genrep=g_rep )
daflims1 = dict((loc,daflims.DAFLIMS( username=gl['lims']['user'],
                                      password=gl['lims']['passwd'][loc] ))
                for loc in gl['lims']['passwd'].keys())
with execution( M, description=hts_key, remote_working_directory=working_dir ) as ex:
    job = get_fastq_files( ex, job )
    run_fastqc( ex, job, via=via )
    mapped_files = map_groups( ex, job, assembly )
    pdf = add_pdf_stats( ex, mapped_files,
                         dict((k,v['name']) for k,v in job.groups.iteritems()),
                         gl['script_path'] )
    density_files = densities_groups( ex, job, mapped_files, assembly.chromosomes )
    gdv_project = gdv.new_project( gl['gdv']['email'], gl['gdv']['key'],
                                   job.description, assembly.id, gl['gdv']['url'] )
    add_pickle( ex, gdv_project, description='py:gdv_json' )
print ex.id
allfiles = get_files( ex.id, M )
print allfiles
bbcflib.mapseq.bowtie()[source]

Run bowtie with args to map reads against index. See http://bowtie-bio.sourceforge.net/index.shtml.

Returns the filename of bowtie’s output file. args gives the command line arguments to bowtie, and may be either a string or a list of strings.

bbcflib.mapseq.bowtie2()[source]

Run bowtie2 with args to map reads against index. Returns the name of bowtie’s output file. See http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.

Parameters:
  • index – (str) path to the bowtie2 index.
  • reads – (list or tuple) if unpaired, a list of paths to each of the fastq files; if paired, a tuple (path_to_R1, path_to_R2).
  • args – (str or list) command line arguments to bowtie - either a string (“-k 20 ...”) or a list of strings ([“-k”,”20”,...]).
  • as_single_end – (bool) ignore the paired-end nature of the fastq files (map as independent reads).
bbcflib.mapseq.bamstats()[source]

Wrapper to the bamstat program.

This program computes read mapping statistics on a bam file. The output will be parsed and converted to a dictionary.

bbcflib.mapseq.plot_stats()[source]

Wrapper to the pdfstats.R script which generates a pdf report of the mapping statistics.

The input is the dictionary return by the bamstats call. This is passed as a json file to the R script. Returns the pdf file created by the script.

bbcflib.mapseq.bam_to_density()[source]
Parameters:
  • bamfile – input BAM file
  • output – basename of output file
  • chromosome_accession – globally unique chromosome identifier
  • chromosome_name – specific chromosome in species context
  • nreads – If 0: normalise by total tag count*1e-7, if >0: uses 1e-7*nreads as factor, by default: no normalization
  • merge – only if -p is not specified, specify it with this value
  • read_extension – bam2wig argument ‘Tags (pseudo-)size’
  • se – Consider paired ends as single ends when generating the density
  • convert – whether or not to convert chromosome labels
  • sql – whether or not to create an SQL database
  • args – bam2wig arguments given directly by the user

Runs the bam2wig program on a bam file and normalizes for the total number of reads provided as argument ‘nreads’.

Returns the name of the output wig or sql file(s) (if ‘sql’ is True).

Use ‘convert’=False if the bam already uses chromosome names instead of ids.

bbcflib.mapseq.add_and_index_bam(ex, bamfile, description='', alias=None, via='local')[source]

Indexes bamfile and adds it to the repository.

The index created is properly associated to bamfile in the repository, so when you use the BAM file later, the index will also be copied into place with the correct name.

bbcflib.mapseq.add_bowtie_index(execution, files, description='', alias=None, index=None, bowtie2=False)[source]

Add an index of a list of FASTA files to the repository. Returns the prefix of the bowtie index.

Parameters:
  • files – (list of str) a list of file names of FASTA files. The files are indexed with bowtie-build, then a placeholder is written to the repository, and the six files of the bowtie index are associated to it. Using the placeholder in an execution will properly set up the whole index.
  • description – (str) MiniLIMS description for the index files.
  • alias – (str) an optional alias to give to the whole index so it may be referred to by name in future.
  • index – (str) lets you set the actual name of the index created.
  • bowtie2 – (bool) if False, use bowtie-build instead of bowtie2-build.
bbcflib.mapseq.add_nh_flag(samfile, out=None)[source]

Adds NH (Number of Hits) flag to each read alignment in samfile.

Scans a BAM file ordered by read name, counts the number of alternative alignments reported and writes them to a BAM file with the NH tag added.

If out is None, a random name is used.

bbcflib.mapseq.add_pdf_stats(ex, processed, group_names, script_path, description='mapping_report.pdf', via='local')[source]

Runs the ‘plot_stats’ function and adds its pdf output to the execution’s repository.

Arguments are the output of ‘map_groups’ (‘processed’), a dictionary of group_id to names used in the display, the path to the script used by ‘plot_stats’, and the ‘description’ to use in the repository.

Returns the name of the pdf file.

bbcflib.mapseq.densities_groups(ex, job_or_dict, file_dict, chromosomes, via='lsf')[source]

Arguments are:

  • 'ex': a ‘bein’ execution environment to run jobs in,
  • 'job_or_dict': a ‘Frontend’ ‘job’ object, or a dictionary with keys ‘groups’,
  • 'file_dict': a dictionary of files,
  • 'chromosomes': a dictionary with keys ‘chromosome_id’ as used in the bowtie indexes and values a dictionary with usual chromosome names and their lengths,

Returns a dictionary with keys group_id from the job object and values the files fo each group (‘bam’ and ‘wig’).

bbcflib.mapseq.get_bam_wig_files(ex, job, minilims=None, hts_url=None, suffix=['fwd', 'rev'], script_path='', fetch_unmapped=False, via='lsf')[source]

Will replace file references by actual file paths in the ‘job’ object. These references are either ‘mapseq’ keys or urls.

bbcflib.mapseq.get_fastq_files(ex, job, set_seed_length=True)[source]

Will replace file references by actual file paths in the ‘job’ object. These references are either ‘dafl’ run descriptions or urls. Argument ‘dafl’ is a dictionary of ‘Daflims’ objects (keys are the facility names). If ‘set_seed_length’ is true, a dictionary job.groups[gid][‘seed_lengths’] of seed lengths is constructed with values corresponding to 70% of the read length.

bbcflib.mapseq.map_groups(ex, job_or_dict, assembly, map_args=None, bowtie2=False, logfile=<open file '<stdout>', mode 'w' at 0x2b3dfcadb150>, debugfile=<open file '<stderr>', mode 'w' at 0x2b3dfcadb1e0>)[source]

Fetches fastq files and bowtie indexes, and runs the ‘map_reads’ function for a collection of samples described in a ‘Frontend’ ‘job’.

Arguments are:

Parameters:
  • ex – a ‘bein’ execution environment to run jobs in,
  • job_or_dict – a ‘frontend.Job’ object, or a dictionary with keys ‘groups’,
  • assembly – a ‘genrep.Assembly’ object.
  • map_args – a dictionary of arguments passed to map_reads.
  • bowtie2 – if False, bowtie will be used instead of bowtie2. The paths to both indexes are expected to differ only by the name of the last containing directory, namely bowtie or bowtie2 respectively.

Returns a dictionary with keys group_id from the job object and values dictionaries mapping run_id to the corresponding return value of the ‘map_reads’ function.

bbcflib.mapseq.map_reads(ex, fastq_file, chromosomes, bowtie_index, bowtie_2=False, maxhits=5, antibody_enrichment=50, keep_unmapped=True, remove_pcr_duplicates=True, bwt_args=None, as_single_end=False, via='lsf')[source]

Runs bowtie in parallel over lsf for the fastq_file input. Returns the full bamfile, its filtered version (see ‘remove_duplicate_reads’) and the mapping statistics dictionary (see ‘bamstats’).

The input file will be split into subfiles if it contains more than 10M lines. The ‘add_nh_flag’ function will be called to add the number of hits per read in the bowtie output. If ‘remove_pcr_duplicates’ is True, the ‘chromosomes’ and ‘maxhits’ arguments are passed to the ‘remove_duplicate_reads’ function and the ‘antibody_enrichment’ will be used as input to the ‘poisson_threshold’ function to compute its ‘pilesize’ argument.

The mapping statistics dictionary is pickled and added to the execution’s repository, as well as both the full and filtered bam files.

bbcflib.mapseq.parallel_bowtie(ex, index, reads, unmapped=None, n_lines=16000000, bowtie_args=, []add_nh_flags=False, bowtie_2=False, as_single_end=False, via='local')[source]

Run bowtie in parallel on pieces of reads.

Splits reads into chunks n_lines long, then runs bowtie with arguments bowtie_args to map each chunk against index. The results are converted to BAM and merged. The name of the single, merged BAM file is returned.

Bowtie does not set the NH flag on its SAM file output. If the add_nh_flags argument is True, this function calculates and adds the flag before merging the BAM files.

The via argument determines how the jobs will be run. The default, 'local', runs them on the same machine in separate threads. 'lsf' submits them via LSF.

bbcflib.mapseq.parallel_density_sql(ex, bamfile, chromosomes, nreads=1, merge=-1, read_extension=-1, se=False, convert=True, b2w_args=None, via='lsf')[source]

Runs ‘bam_to_density’ for every chromosome in the ‘chromosomes’ list.

Generates 1 or 2 files depending if ‘merge’>=0 (shift and merge strands into one track) or ‘merge’<0 (keep seperate tracks for each strand) and returns their basename.

bbcflib.mapseq.parallel_density_wig(ex, bamfile, chromosomes, nreads=1, merge=-1, read_extension=-1, se=False, convert=True, description='', alias=None, b2w_args=None, via='lsf')[source]

Runs ‘bam_to_density’ in parallel for every chromosome in the ‘chromosomes’ list with ‘sql’ set to False. Returns a single text wig file.

bbcflib.mapseq.poisson_threshold(mu, cutoff=0.95, max_terms=100)[source]

Calculate confidence threshold for Poisson distributions.

Returns the largest integer k such that, for a Poisson distribution random value X with mean ‘mu’, P(X <= k) <= ‘cutoff’. It will calculate no farther than k = ‘max_terms’. If it reaches that point, it raises an exception.

bbcflib.mapseq.pprint_bamstats(sample_stats, textfile=None)[source]

Pretty stdout-print for sample_stats.

Parameters:
  • sample_stats (dict) – The input is the dictionary returned by the bamstats call.
  • textfile (string) – If defined, output is printed as textfile instead.
bbcflib.mapseq.read_sets(reads, keep_unmapped=False)[source]

Groups the alignments in a BAM file by read.

reads should be an iterator over reads, such as the object
returned by pysam.Samfile. The SAM/BAM file must be sorted by read. read_sets removes all unmapped reads, and returns an iterator over lists of all AlignedRead objects consisting of the same read.
bbcflib.mapseq.remove_duplicate_reads(bamfile, chromosomes, maxhits=None, pilesize=1, convert=False)[source]

Filters a bam file for multi-hits above ‘maxhits’ and for duplicate reads beyond ‘pilesize’.

Reads with NH tag > maxhits are discarded, each genomic position will have at most ‘pilesize’ reads per library and per strand. If the ‘convert’ flag is True, the reference sequence ids are replaced by their names as provided in ‘chromosomes’.

bbcflib.mapseq.run_fastqc(ex, job, via='lsf')[source]

Returns the name of the report file.

Previous topic

Module: bbcflib.demultiplex

Next topic

Module: bbcflib.chipseq

This Page

Websites