Pipeline modules ================ Preprocessing modules --------------------- Diagnostics ^^^^^^^^^^^ These modules can be used to assess the quality of the PAR-CLIP library. FastQCModule """""""""""" Uses `FastQC `_ to create a diagnostic report of various quality aspects of the fastq reads. **Pipeline input:** ``fastq`` **Pipeline output:** ================ ================= ============================================================ Parameter Default value Description ================ ================= ============================================================ kmer_length 7 length of the kmers used for detecting overrepresented kmers outdir_name fastQC name of the output directory that stores the report extra_flags [] list of extra flags passed to the command line tool ================ ================= ============================================================ .. _module_bam_analysis: BamPPModule """"""""""" Postprocesses a bam alignment file and creates statistics of mutation rates and mismatch positions. Can clip mutations from the end of the reads. **Pipeline input:** ``bam`` **Pipeline output:** ``bam`` ==================== ================= ========================================================== Parameter Default value Description ==================== ================= ========================================================== remove_n_edge_mut 0 clip mutations that are at the n outermost bases in each alignment max_mut_per_read 1 discard reads that have more mutations than the given integer min_base_quality 0 discard reads that have at least one base with a quality lower than the given integer min_avg_ali_quality 20 discard reads that have an average quality lower than the given integer min_mismatch_quality 20 discard reads with mismatches of quality lower than the given integer dump_raw_data False dump raw data for diagnostic purposes outdir_name bam_analysis name of the output directory storing diagnostics and plots ==================== ================= ========================================================== SoftclipAnalysisModule """""""""""""""""""""" If the mapper allows soft-clipping this module shows the most frequently clipped sequences. The output can be used to understand common unclipped contamination. Creates two files ``3prime_clipped.txt`` and ``3prime_clipped.txt``. Interpretation of the output: :: clipped bases: 2 | clipped sequences: 152 TG 128 CG 7 CT 4 CC 3 TT 3 TC 2 AT 1 CA 1 GG 1 GT 1 The header line contains the number of clipped bases and the number of affected alignments. Below the most commonly clipped sequences are enumerated along with the number of affected alignments in descending order. **Pipeline input:** ``bam`` **Pipeline output:** ==================== ================= ========================================================== Parameter Default value Description ==================== ================= ========================================================== outdir_name bam_analysis name of the output directory storing diagnostics and plots ==================== ================= ========================================================== TransitionFrequencyModule """"""""""""""""""""""""" Plots the frequency of transitions in a bar plot. **Pipeline input:** ``mpileup`` **Pipeline output:** ==================== ================= ========================================================== Parameter Default value Description ==================== ================= ========================================================== output_prefix file name prefix of the output file min_cov 5 only consider sites with at least this coverage y_axis_limit 0 limit of the y axis remove_tmp_files True remove temporary files ==================== ================= ========================================================== Shows the observed substitution rates for all possible base substitutions. The T->C transitions is displayed as P(C|T). Example output: .. figure:: imgs/transition_freq.png Duplicate removal ^^^^^^^^^^^^^^^^^ In order to make meaningful biological conclusions, PCR duplicates should be removed after sequencing. We offer two modules here. A naive approach implemented in :ref:`dupremov_module` and two modules using `umi_tools `_, applicable when the library contains unique molecular identifiers (UMIs). .. _dupremov_module: DuplicateRemovalModule """""""""""""""""""""" A poor man's approach to duplicate removal: this module removes fastq reads with the same sequence. If the PAR-CLIP library uses UMIs, it is recommended to use the more sophisticated modules based on `umi_tools `_. Please refer to :ref:`umi_tools_dedup_module` and :ref:`umi_tools_extract_module`. **Pipeline input:** ``fastq`` **Pipeline output:** ``fastq`` .. _umi_tools_extract_module: UmiToolsExtractModule """"""""""""""""""""" Together with :ref:`umi_tools_dedup_module` this is the preferred way to deduplicate PAR-CLIP libraries that use UMIs. This module extracts the UMI and appends it to the read name. As such it is recommended to run this module as one of the first steps in a pipeline. After mapping the :ref:`umi_tools_dedup_module` module can be used to remove duplicated reads. For further information, please also check the documentation of `umi_tools `_ :cite:`smith2017umi`. ==================== ================= ========================================================== Parameter Default value Description ==================== ================= ========================================================== extra_flags [] list of extra flags passed to the command line tool ==================== ================= ========================================================== **Pipeline input:** ``fastq`` **Pipeline output:** ``fastq`` .. _umi_tools_dedup_module: UmiToolsDedupModule """"""""""""""""""" Together with :ref:`umi_tools_extract_module` this is the preferred way to deduplicate PAR-CLIP libraries that use UMIs. This module deduplicates bam files based on extracted UMIs. This module has to be run after :ref:`umi_tools_extract_module`. For further information, please also check the documentation of `umi_tools `_ :cite:`smith2017umi`. **Pipeline input:** ``bam`` **Pipeline output:** ``bam`` ==================== ================= ========================================================== Parameter Default value Description ==================== ================= ========================================================== extra_flags [] list of extra flags passed to the command line tool ==================== ================= ========================================================== Adapter clipping ^^^^^^^^^^^^^^^^ Adapter clipping is an important step in PAR-CLIP libraries: in case of very small inserts, the beginning of the 3' adapter is present in the reads. :ref:`skewer_module` is the module of choice for removing these adapters. :ref:`clippy_module` clips adapters with less sensitity, but also detects partial 5' adapters. It can be useful when dealing with libraries that were generated from very low amounts of RNA. .. _skewer_module: SkewerAdapterClippingModule """"""""""""""""""""""""""" This module uses `skewer `_ :cite:`jiang2014skewer` to trim the 3' adapter from the PAR-CLIP reads. Additional arguments can be directly passed to skewer's command line call. Please consult skewer's documentation for a detailed description of all available options. **Pipeline input:** ``fastq`` **Pipeline output:** ``fastq`` ================ ================= ============================================================ Parameter Default value Description ================ ================= ============================================================ extra_args [] list of extra flags passed to the command line tool ================ ================= ============================================================ .. _clippy_module: ClippyAdapterClippingModule """"""""""""""""""""""""""" This module removes longer traces of adapters by looking for perfect matches of the adapter ends. It also removes random barcodes and adapters. If you are using :ref:`umi_tools_extract_module` to extract the UMIs don't forget to set ``clipped_5prime_bc`` to ``True``. **Pipeline input:** ``fastq`` **Pipeline output:** ``fastq`` ================= ================= ============================================================ Parameter Default value Description ================= ================= ============================================================ clip_len 10 minimum base pairs required to be detected as adapter sequence clipped_5prime_bc False UMIs already removed from the 5' end ================= ================= ============================================================ Mapping ^^^^^^^ With `Bowtie `_ and `STAR `_ we offer two very different alignment strategies. `STAR` can map spliced reads and use softclipping to remove contaminants automatically. Soft clipping however prevents from mapping transitions at either end of the alignment. Mappers feed unmapped reads back in the pipeline and thus can be chained. STARMapModule """"""""""""" `STAR `_ :cite:`dobin2013star` is a general purpose RNA-seq data mapper. Unmapped reads are returned to the pipeline in fastq format. For detailed configuration options, please also refer to STAR's user manual. **Pipeline input:** ``fastq`` **Pipeline output:** ``bam``, ``fastq`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ genome_index path to the directory containing the STAR genome index n_mismatch 1 maximum number of allowed mismatches n_multimap 1 maximum number of mapping positions. Maps uniquely by default allow_soft_clipping True enable softclipping outdir_name star_out name of the output directory extra_flags [] additional commandline options passed to STAR =================== ================= ============================================================ BowtieMapModule """"""""""""""" `Bowtie `_ :cite:`langmead2010aligning` is a genomic aligner and as such cannot map spliced reads. Unmapped reads are re-queued in the pipeline. **Pipeline input:** ``fastq`` **Pipeline output:** ``bam``, ``fastq`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ genome_index prefix of bowtie's genome index n_mismatch 1 maximum number of allowed mismatches n_multimap 1 maximum number of mapping positions. Maps uniquely by default extra_flags [] additional commandline options passed to bowtie =================== ================= ============================================================ Binding site prediction ^^^^^^^^^^^^^^^^^^^^^^^ We offer two different strategies for predicing binding sites. :ref:`bsfinder_module` does not require mock information and therefore cannot distinguish background binding from factor specific binding events. :ref:`mockinbird_module` is the recommended binding site predictor. It requires parameters trained on a mock experiment. .. _bsfinder_module: BSFinderModule """""""""""""" BSFinder calculates p-values by learning a statistical model on non-specific conversion events. This model cannot distinguish background binding from factor specific binding events. The prediction algorithm was presented in :cite:`torkler2015stammp`. **Pipeline input:** ``mpileup`` **Pipeline output:** ``table`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ pval_threshold 0.005 only sites with a p-value smaller than this are reported min_cov 2 minimum coverage of reported binding sites =================== ================= ============================================================ NaiveBSFinderModule """"""""""""""""""" The naive binding site finder predicts all sites that pass a specified minimum threshold of conversion events. **Pipeline input:** ``mpileup`` **Pipeline output:** ``table`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ min_transitions 2 only sites with at least this many conversion events are reported =================== ================= ============================================================ .. _mockinbird_module: MockinbirdModule """""""""""""""" Mockinbird is the core module that predicts binding sites by harnessing information from a mock experiments. Its input files are generated by the modules in the :ref:`mockinbird_rel_modules` section. **Pipeline input:** ``trtable``, ``mock_model`` **Pipeline output:** ``table`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ plot_dir mockinbird_plots directory for writing out diagnostic plots max_k_mock 10 sites with more specific conversions than ``max_k_mock`` are discarded extra_args [] additional arguments directly passed to the called script =================== ================= ============================================================ Miscellaneous ^^^^^^^^^^^^^ A collection of miscellaneous helper modules. .. _sort_index_module: SortIndexModule """"""""""""""" The SortIndexModule sorts and indexes a bam files using `samtools `_. This is generally required before generating a pileup file. **Pipeline input:** ``bam`` **Pipeline output:** ``bam`` PileupModule """""""""""" Uses `samtools `_ to create a pileup file. Pileup report coverage and transitions per genomic base and are the input of our predictors. Please be aware that the bam file has to be sorted. If in doubt, queue after the :ref:`sort_index_module`. **Pipeline input:** ``bam`` **Pipeline output:** ``mpileup`` NormalizationModule """"""""""""""""""" The normalization module calculates an occupancy by dividing the number of observed transitions by the coverage of a reference experiment. The appropriate reference experiment should reflect the pool of RNA the factor *sees* when *choosing* where to bind. Depending on the binding properties of the protein of interest, an RNA-seq experiment under PAR-CLIP conditions, PAR-CLIP of the RNA polymerase or protocols to capture transient binding such as 4SU-seq may be appropriate. Additionally, SNPs are removed by detecting elevated conversion rates in the normalization pileup file. **Pipeline input:** ``table`` **Pipeline output:** ``table`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ mut_snp_ratio 0.75 ratio of conversations to coverage in the normalization pileup for a site being detected as SNP =================== ================= ============================================================ QuantileCapModule """"""""""""""""" Caps the occupancy value at a given quantile. This module can help removing the influence of outliers on downstream analyses, such as the gene plot. **Pipeline input:** ``table`` **Pipeline output:** ``table`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ max_quantile 0.95 all occupancy values are capped to the value of this quantile =================== ================= ============================================================ Table2FastaModule """"""""""""""""" Converts a binding site table file to fasta by extracting the genomic sequence around the binding site. **Pipeline input:** ``table`` **Pipeline output:** ``fasta`` =================== ================= ============================================================ Parameter Default value Description =================== ================= ============================================================ genome_fasta path to the genome fasta file =================== ================= ============================================================ .. _mockinbird_rel_modules: Mockinbird related modules ^^^^^^^^^^^^^^^^^^^^^^^^^^ :ref:`mockinbird_module` requires as input a joint dataset of factor of interest and the mock measurement and parameters estimated on the mock. The following modules can be used to create the required input data. .. _prediction_sites_module: PredictionSitesModule """"""""""""""""""""" This module creates a file that contains all sites that are considered in the prediction of binding sites. By default these are all genomic sites that have the transition nucleotide on either strand. This can be restricted by giving gff files of genomic regions of interest. **Pipeline input:** **Pipeline output:** ``sites`` ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ sites_file path to sites file. Will be created if does not exist yet. Will not be recreated if already existing. fasta_file path to genomic fasta file gff_file '' gff file for restricting predictions to specific regions transition_nucleotide T nucleotide that converts in the PAR-CLIP experiment ===================== ================= ============================================================ .. _mock_table_module: MockTableModule """"""""""""""" Converts the pileup file from the mock experiment to a mock table. Required by the :ref:`trtable_module`. **Pipeline input:** **Pipeline output:** ``mocktable`` ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ mock_pileup path to the mock pileup file mock_table path to mock table. Will be created if does not exist yet. Will not be recreated if already existing. ===================== ================= ============================================================ .. _trtable_module: TransitionTableModule """"""""""""""""""""" Combines mock table and factor pileup file to the so called transition table. Depends on the outputs of :ref:`mock_table_module`, :ref:`prediction_sites_module`. **Pipeline input:** ``sites``, ``mock_table``, ``mpileup`` **Pipeline output:** ``trtable`` LearnMockModule """"""""""""""" Learns the model parameters from the transition table. Requires inputs from :ref:`trtable_module` and :ref:`bam_stat_module`. **Pipeline input:** ``trtable`` **Pipeline output:** ``mock_model`` ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ mock_model path to the mock model pickle file. Will be created if does not exist. Will not be recreated if already existing mock_statistics path to the mock bam statistics n_mixture_components 5 number of mixture components for fitting the geometric mixture models em_iterations 250 number of iterations of the EM algorithm fitting the geometric mixture model ===================== ================= ============================================================ .. _bam_stat_module: BamStatisticsModule """"""""""""""""""" This module stores additional information from a ``bam`` file in a json file. If the predicted sites are restrained by a gff file in :ref:`prediction_sites_module`, the same gff file should be used for generating the statistics. **Pipeline input:** ``bam`` **Pipeline output:** ``stat_file`` ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ gff_file '' gff file for restricting the prediction sites ===================== ================= ============================================================ Postprocessing modules ---------------------- Plots ^^^^^ .. _centerplot_bs: CenterPlotBSModule """""""""""""""""" This module plots a bootstrapped metagene plot fixed at the start and end of each annotations. **Pipeline input:** ``table`` **Pipeline output:** ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ gff_file gff file of annotations used for the metagene plot output_prefix file name prefix for the output files labelCenterA label for the metagene start position labelCenterB label for the metagene end position labelBody label for the metagene body downstream_bp 1000 number of base pairs shown downstream of start upstream_bp 1000 number of base pairs shown upstream of the end gene_bp 750 number of base pairs inside the annotation, i.e. downstream of the start and upstream of the end min_trscr_size_bp 1500 filter out all transcript shorter than this size max_trscr_size_bp 100000 filter out all trascript longer than this size smoothing_window 20 size of the window used for smoothing the profile (in bp) remove_tmp_files True clean up temporary files bootstrap_iter 2500 number of bootstrap iterations n_processes 4 number of parallel processes ===================== ================= ============================================================ Example plot: .. figure:: imgs/nrd1_metagene.png :alt: size metagene plot :align: center A plot of the PAR-CLIP occupancy around the start and end sites of genomic annoations. The occupancy profile of the sense strand is depicted in blue, occupancy on the anti-sense strand is shown in green. The profiles are smoothed with a running mean. 95%-confidence intervals are calculated by bootstrap sampling the annotations. The bottom of the plot shows the occupancy profiles in heatmap representation. KmerPerPositionModule """"""""""""""""""""" Plots the kmer occurence frequencies around binding sites. **Pipeline input:** ``table`` **Pipeline output:** ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ genome_fasta path to genome in fasta format (requires index) output_prefix file name prefix for the output files kmer_k 3 length of kmers first_index 0 index of the first site plotted last_index 1500 index of the last site plotted width 50 number of base pairs around the binding site considered sort_key occupancy sort column of the table; valid values are occupancy, transitions, coverage and score. gff_exclude_path '' sites that overlap one of the annotations in this file are dropped gff_padding 20 annotations are extended by this amount of base pairs around start and end remove_tmp_files True clean up temporary files ===================== ================= ============================================================ .. _module_plot_heatmap: HeatmapPlotModule """"""""""""""""" Plot a heatmap of all transcripts matching given length criteria as a heat map. Using binning over transcripts and transcript lenth. **Pipeline input:** ``table`` **Pipeline output:** ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ gff_file gff file of annotations used for the metagene plot output_prefix file name prefix for the output files downstream_bp 4000 number of base pairs shown downstream of start upstream_bp 1000 number of base pairs shown upstream of the end min_trscr_size_bp 0 filter out all transcript shorter than this size max_trscr_size_bp 5000 filter out all transcripts longer than this size x_bins 500 number of bins in x direction (transcript length) y_bins 500 number of bins in y direction (grouping transcripts) x_pixels 500 pixel in x directions y_pixels 500 pixels in y direction remove_tmp_files True clean up temporary files ===================== ================= ============================================================ HeatmapSmallPlotModule """""""""""""""""""""" Plot a heatmap of all transcripts matching given length criteria as a heat map. Using binning over transcripts and transcript lenth. Similar to :ref:`module_plot_heatmap`, but more consise plot. **Pipeline input:** ``table`` **Pipeline output:** ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ gff_file gff file of annotations used for the metagene plot output_prefix file name prefix for the output files downstream_bp 500 number of base pairs shown downstream of start upstream_bp 1000 number of base pairs shown upstream of the end min_trscr_size_bp 0 filter out all transcript shorter than this size max_trscr_size_bp 5000 filter out all transcripts longer than this size x_bins 500 number of bins in x direction (transcript length) y_bins 500 number of bins in y direction (grouping transcripts) x_pixels 500 pixel in x directions y_pixels 500 pixels in y direction remove_tmp_files True clean up temporary files ===================== ================= ============================================================ Motif detection ^^^^^^^^^^^^^^^ XXmotifModule """"""""""""" Runs XXmotif, a tool for de-novo detection of overrepresented motifs. **Pipeline input:** ``table`` **Pipeline output:** ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ genome_fasta path to genome in fasta format (requires index) output_prefix file name prefix for the output files negative_set_gff path to a gff file used for sampling the negative set n_negative seqs 20000 number of negative sequences sampled first_index 0 index of the first site plotted last_index 1500 index of the last site plotted width 12 number of base pairs around the binding site considered sort_key occupancy sort column of the table; valid values are occupancy, transitions, coverage and score. gff_exclude_path '' sites that overlap one of the annotations in this file are dropped gff_padding 20 annotations are extended by this amount of base pairs around start and end remove_tmp_files True clean up temporary files ===================== ================= ============================================================ Miscellaneous ^^^^^^^^^^^^^ GffFilterModule """"""""""""""" Filter sites that overlap with a given gff annotation. Can be used to filter sites in highly abundant transcripts such as `tRNA` or `rRNA`. Pass the name of the gff features you want to exclude to the ``features`` option. **Pipeline input:** ``table`` **Pipeline output:** ``table`` ===================== ================= ============================================================ Parameter Default value Description ===================== ================= ============================================================ filter_gff path to gff file used for filtering file_postfix fil postfix appended to the table name padding_bp 20 annotations are extended by this amount of base pairs features [] list of features that are filtered. Excludes all by default. ===================== ================= ============================================================ Writing your own modules ------------------------ A straightforward way to implement your own pipeline module is by subclassing :class:`.CmdPipelineModule`. Here we use the :class:`.SkewerAdapterClippingModule` as an example. :: from mockinbird.utils import pipeline as pl class SkewerAdapterClippingModule(pl.CmdPipelineModule): .. _custommod_args: Adding module arguments ^^^^^^^^^^^^^^^^^^^^^^^ If your module accepts arguments, you have to modify the constructor ``__init__(self, pipeline)``: :: def __init__(self, pipeline): cfg_fmt = [ ('extra_args', cv.Annot(list, default=[])), ] super().__init__(pipeline, cfg_req=cfg_fmt) Each argument is a tuple consisting of a name, here ``extra_args`` and an annotation object :class:`.Annot`. Upon construction the annotation takes following keyword arguments: type: a callable that represents the base type of the value, here a ``list`` default: the default value if the option was not provided by the user. The default value ``None`` makes setting the value in the configuration file mandatory. converter: a callable that validates and converts the value the user set in the config file. Can raise a ``ValueError``, if the user entered an invalid value. By calling ``super().__init__()`` the list of arguments ``cfg_fmt`` is passed to the parent constructor. Following configurations are now valid in an configuration file: :: [...] - SkewerAdapterClippingModule: extra_args: - -k 12 - -d 0.01 :: [...] - SkewerAdapterClippingModule: extra_args: [] which is equivalent to falling back to the default argument :: [...] - SkewerAdapterClippingModule We define a variety of validators that can readily be used. Please refer to :ref:`api_config` for details. Preparing the module ^^^^^^^^^^^^^^^^^^^^ ``prepare(self, cfg)`` is the heart of the module and defines the commands that are executed when the module runs. It also registers the outputs and thus makes new files visible to downstream modules. ``cfg`` is a dict of all configuration options set by the user. Configuration of all options requested as described in :ref:`custommod_args` can be accessed by their names. The first action in the ``prepare`` method has to be the call to the parent's prepare method: :: def prepare(self, cfg): super().prepare(cfg) Information from previously run modules and global configuration options can be obtained through a reference to :class:`.Pipeline`, which can be accessed through the private attribute ``self._pipeline``. :: def prepare(self, cfg): super().prepare(cfg) pipeline = self._pipeline general_cfg = pipeline.get_config('general') read_cfg = pipeline.get_config('reads') output_dir = general_cfg['output_dir'] prefix = general_cfg['prefix'] Here we access the global configuration sections ``general`` and ``reads`` to obtain the path to the output directory and the file name prefix. :func:`.Pipeline.get_curfile` is used to obtain the path to files that are queued in the pipeline. The pipeline stores the most recent file path of each format. :: def prepare(self, cfg): [...] fastq_file = pipeline.get_curfile(fmt='fastq') Here we obtain the path to the most recently queued ``fastq`` file. Now we have all information to construct the command line call for running ``skewer``. The output path is stored in ``adapter_clipped_file``, a file path relative to the output directory passed by the command line script. :: def prepare(self, cfg): [...] adapter_clipped_file = os.path.join(output_dir, prefix + '_skewer.clipped') cmd = [ 'skewer', fastq_file, '-x %s' % general_cfg['adapter3prime'], '-m tail', '--min %s' % read_cfg['min_len'], '--quiet', '--stdout', '> %r' % adapter_clipped_file, ] if cfg['extra_args']: cmd.extend(cfg['extra_args']) self._cmds.append(cmd) Having collected the command as a list of arguments in ``cmd``, we append all user defined arguments that are conveniently stored in ``cfg`` and queue the command by appending it to the private ``self._cmds`` variable. :: self._intermed_files.append(adapter_clipped_file) pipeline.upd_curfile(fmt='fastq', filepath=adapter_clipped_file) As a final step we register ``adapter_clipped_file`` as an intermediate file, meaning that it can be deleted provided this module is not the last in the pipeline. :func:`.Pipeline.upd_curfile` registers the file as new ``fastq`` file and thus makes it visible to following modules. **Note:** At time ``prepare()`` is called, files to the passed paths may or may not exist. \ Only queue new commands by appending to ``self._cmds``, never try to execute commands. The full module ^^^^^^^^^^^^^^^ Putting everything together, the code of the :class:`.SkewerAdapterClippingModule` looks like this: :: import os from mockinbird.utils import pipeline as pl from mockinbird.utils import config_validation as cv class SkewerAdapterClippingModule(pl.CmdPipelineModule): def __init__(self, pipeline): cfg_fmt = [ ('extra_args', cv.Annot(list, default=[])), ] super().__init__(pipeline, cfg_req=cfg_fmt) def prepare(self, cfg): super().prepare(cfg) pipeline = self._pipeline general_cfg = pipeline.get_config('general') read_cfg = pipeline.get_config('reads') output_dir = general_cfg['output_dir'] prefix = general_cfg['prefix'] fastq_file = pipeline.get_curfile(fmt='fastq') adapter_clipped_file = os.path.join(output_dir, prefix + '_skewer.clipped') cmd = [ 'skewer', fastq_file, '-x %s' % general_cfg['adapter3prime'], '-m tail', '--min %s' % read_cfg['min_len'], '--quiet', '--stdout', '> %r' % adapter_clipped_file, ] if cfg['extra_args']: cmd.extend(cfg['extra_args']) self._cmds.append(cmd) self._intermed_files.append(adapter_clipped_file) pipeline.upd_curfile(fmt='fastq', filepath=adapter_clipped_file) The parent class :class:`.CmdPipelineModule` and the :class:`Pipeline` take care of all required steps such as config parsing and validation, executing commands and cleaning up files. Custom modules in config files ------------------------------ Having created your own module, you have to make the class importable in python. The cleanest way to achieve that is to make your module installable by writing a ``setup.py``. For more information please refer to available documentation, e.g. `here `_. In the config file you have to refer to the module with its full import path, e.g.: :: - mypackage.mymodule.SkewerAdapterClippingModule: extra_args: - -k 2