nf-core/isoseq   
 Genome annotation with PacBio Iso-Seq. Takes raw subreads as input, generate Full Length Non Chemiric (FLNC) sequences and produce a bed annotation.
Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- CCS - Generate CCS sequences
- LIMA - Remove primer sequences from CCS
- ISOSEQ REFINE - Detect and remove chimerics reads
- BAMTOOLS CONVERT - Convert bam file into fasta file
- TAMA POLYA CLEAN UP - Detect and trim polyA tails reads
- GUNZIP - Decompress FLNC fastas (uLTRA path only)
- ULTRA or MINIMAP2 - Map FLNCs on genome
- BIOPERL - Remove spurious alignments (uLTRA path only, Issue #11)
- TAMA FILE LIST - Prepare list file for TAMA collapse
- TAMA COLLAPSE - Clean gene models
- TAMA MERGE - Merge all annotations into one for each sample with TAMA merge
- Pipeline information - Report metrics generated during the workflow execution
CCS
Output files
- 01_PBCCS/- <sample>.chunk<X>.bam: The CCS sequences
- <sample>.chunk<X>.bam.pbi: The Pacbio index of CCS files
- <sample>.chunk<X>.metrics.json.gz: Statistics for each zmws
- <sample>.chunk<X>.report.json: General statistics about generated CCS sequences in json format
- <sample>.chunk<X>.report.txt: General statistics about generated CCS sequences in txt format
 
CCS generate a Circular Consensus Sequence from subreads. It reports the number of selected and discarded zmws and the reason why.
LIMA
Output files
- 02_LIMA/- <sample>.chunk<X>_flnc.json: Metadata about generated xml file
- <sample>.chunk<X>_flnc.lima.clips: Clipped sequences
- <sample>.chunk<X>_flnc.lima.counts: Statistics about detected primers pairs
- <sample>.chunk<X>_flnc.lima.guess: Statistics about detected primers pairs
- <sample>.chunk<X>_flnc.lima.report: Detailed statistics on primers pairs for each sequence
- <sample>.chunk<X>_flnc.lima.summary: General statistics about selected and rejected sequences
- <sample>.chunk<X>_flnc.primer_5p--primer_3p.bam: Selected sequences
- <sample>.chunk<X>_flnc.primer_5p--primer_3p.bam.pbi: Pacbio index of selected sequences
- <sample>.chunk<X>_flnc.primer_5p--primer_3p.consensusreadset.xml: Selected sequences metadata
 
LIMA clean generated CCS. It selects sequences containing valid pairs of primers and removed it.
ISOSEQ REFINE
Output files
- 03_ISOSEQ3_REFINE/- <sample>.chunk<X>.bam: Sequences sequences
- <sample>.chunk<X>.bam.pbi: Pacbio index of selected sequences
- <sample>.chunk<X>.consensusreadset.xml: Metadata
- <sample>.chunk<X>.filter_summary.json: Number of Full Length, Full Length Non Chimeric, Full Length Non Chimeric PolyA
- <sample>.chunk<X>.report.csv: Primers and insert length of each read
 
ISOSEQ REFINE discard chimeric reads.
BAMTOOLS CONVERT
Output files
- 04_BAMTOOLS_CONVERT/- <sample>.chunk<X>.fasta: The reads in fasta format.
 
BAMTOOLS CONVERT convert reads in BAM format into fasta format.
TAMA POLYA CLEAN UP
Output files
- 05_GSTAMA_POLYACLEANUP/- <sample>.chunk<X>_tama.fa.gz: The polyA tail free reads.
- <sample>.chunk<X>_polya_flnc_report.txt.gz: Length of removed tails.
- <sample>.chunk<X>_tama_tails.fa.gz: Sequence of removed tails.
 
GSTAMA_POLYACLEANUP TAMA cleanup remove polyA tails from the selected reads.
GUNZIP
Output files
- 06.1_GUNZIP/- <sample>.chunk<X>_tama.fa: The polyA tail free reads uncompressed.
 
GUNZIP Uncompress FLNCs for their alignment with uLTRA (gzip not handled by uLTRA yet).
ULTRA or MINIMAP2
Output files
- 06.2_ULTRA/or- 06_MINIMAP2/- <sample>.chunk<X>.sam: The aligned reads.
 
MINIMAP2 or uLTRA aligns reads ont the genome.
BIOPERL
Output files
- 06.3_PERL_BIOPERL/- <sample>.chunk<X>_filtered.sam: The aligned reads with spurious alignments removed.
 
BIOPERL Some CIGAR string sometimes with a gap (N). This can happen when using GFF file converted to GTF file. See Issue #11 from uLTRA repo.
TAMA COLLAPSE
Output files
- 07_GSTAMA_COLLAPSE/- <sample>.chunk<X>_collapsed.bed: This is a bed12 format file containing the final collapsed version of your transcriptome
- <sample>.chunk<X>_local_density_error.txt: This file contains the log of filtering for local density error around the splice junctions
- <sample>.chunk<X>_polya.txt: This file contains the reads with potential poly A truncation
- <sample>.chunk<X>_read.txt: This file contains information for all mapped reads from the input SAM/BAM file.
- <sample>.chunk<X>_strand_check.txt: This file shows instances where the sam flag strand information contrasted the GMAP strand information.
- <sample>.chunk<X>_trans_read.bed: This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.
- <sample>.chunk<X>_trans_report.txt: This file contains collapsing information for each transcript
- <sample>.chunk<X>_varcov.txt: This file contains the coverage information for each variant detected.
- <sample>.chunk<X>_variants.txt: This file contains the variants called
 
TAMA COLLAPSE TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.
TAMA FILE LIST
Output files
- 08_GSTAMA_FILELIST/- <sample>.tsv: A tsv listing bed files to merge with TAMA merge
- all_samples.tsv: A tsv listing bed files from all samples to merge with TAMA merge
 
TAMA FILELIST is a home script for generating input file list for TAMA merge.
TAMA MERGE
Output files
- 09_GSTAMA_MERGE/- <sample>.bed: This is the main merged annotation file.
- <sample>_gene_report.txt: This contains a report of the genes from the merged file.
- <sample>_merge.txt: This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID.
- <sample>_trans_report.txt: This contains the source information for each merged transcript.
 
TAMA MERGE TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information. When there are two or more samples, output files corresponding to all_samples are also stored if --tama_merge_all parameter is set.
MultiQC
Output files
- multiqc/- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.
 
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.### Pipeline information
Output files
- pipeline_info/- Reports generated by Nextflow: execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.
 
- Reports generated by Nextflow: 
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.