Alignments Pipeline

A flexible Nextflow automated pipeline used for producing the enrichment data from PhIP-Seq data when provided the demultiplexed fastq files, as well as annotation files for both the experimental samples and peptides in the phage library being used.

directed acyclic graph — **Directed Acyclic Graph (DAG)** defining the execution order and dependencies of each individual processes. The major steps of the pipeline can be summarized as: (1) Build a `Bowtie` index from the relevant peptide oligos (2) Align each of the samples to the library reference using Bowtie end-to-end alignment allowing for up to N mismatches (default 2). The user specifies the size of both the reads and peptide, the low-quality end of the read are then trimmed to match the reference length before alignment. (3) Peptide counts for each sample alignment are obtained using `samtools-idxstats` (Li et al., 2009) in parallel to computing the common alignment stats such as raw total sequences, reads_mapped, error_rate, and average_quality, by default. (4) The resulting dataset containing the enrichment matrix, sample metadata, and peptide metadata are organized using the xarray package (Hamman and Hoyer, 2017). (5) Optionally, summary statistics such as counts per million, size factors, fold enrichment, as well as model fits for estimates of significance are computed. (6) By default, the pipeline outputs all results computed above as a pickle dump binary of the xarray object with all combined results, as well as individual csv’s for each statistic computed. You may also export the results in `PhIPData` (Chen et al 2022) format via RDS file, and even tall CSV formats.

Input files

Sample table

A CSV where one of the columns must be fastq_filename listing all samples to be run through the pipeline. By default, the pipeline assumes the reads are relative to the project directory where the pipeline is being executed.

Note

If there is some other prefix for the filepaths, you may check out the --reads_prefix parameter.

As an example, let’s assume there’s some directory ngs/* containing all the fastq files for a project. To organize these files (excluding barcode files) into a minimal sample table describing each of their relative paths, we might use the following command.

(echo "fastq_filepath" && ls ngs/*R1*.gz)  > sample_table.csv

Peptide table

A CSV where one of the columns must be “oligo” which contains the oligonucleotide sequence encoding a peptide in the phage library. Adapters are assumed to be encoded by lower case nucleotides and are ultimately tossed from the output of the pipeline. Conversely, the oligonucleotide encoding of the expressed peptide should be upper case. Similar to the sample annotation table, you may include any annotations you like to the peptides (e.g. “Virus”, “Strain”, “Loci” etc) except an annotation named peptide_id which is again reserved for the pipeline execution.

Pipeline results

The primary use of this pipeline is to process raw sequencing data, produce the peptide counts table, apply statistical methods (such as the EdgeR), then combine and organize the results from these workflows for the user to analyze however they wish. By default the pipeline will produce the following outputs

results
├── pickle_data
│   └── data.phip
├── rds_data
│   └── PhIPData.rds
└── wide_data
    ├── data_counts.csv.gz
    ├── data_cpm.csv.gz
    ├── data_edgeR_hits.csv.gz
    ├── data_edgeR_logfc.csv.gz
    ├── data_edgeR_logpval.csv.gz
    ├── data_peptide_annotation_table.csv.gz
    ├── data_sample_annotation_table.csv.gz
    └── data_size_factors.csv.gz

4 directories, 11 files

see the example page for a more detailed explanation of these outputs.

Parameters

Below, we describe each of the possible parameters that may be passed to the pipeline. Parameters with a “*” next to the name must be provided values explicitly in the nextflow run, command unless you wish to be using the default values described below. Otherwise, the parameter value is only required for relevant the optional workflow.

--sample_table

help: Table describing each input sample, minimally containing the column ‘fastq_filepath’ with the name of each file to be analyzed. Control samples are indicated with a value of ‘beads_only’ in the column ‘control_status’.
wb_type: file
required: True

--reads_prefix

help: Folder which contains the files listed in the sample table
wb_type: folder
required: True

--read_length

help: Read length for alignment. See alignment approach section for more.
wb_type: integer
default: 125

--fastq_stream_func

help: Set this as ‘cat’ if fastq files not g’zipped
wb_type: string
default: zcat

--peptide_table

help: Table describing each peptide in the library, minimally containing the column ‘oligo’ with the sequence used for each peptide
wb_type: file
required: True

--oligo_tile_length

help: length of oligonucleotide sequence encoding any given peptide in the library. See alignment approach section for more.
wb_type: integer
default: 117

--dataset_prefix

help: String which is prepended to all output files
wb_type: string
default: data

--output_pickle_xarray

help: Generate output files in xarray pickle format
wb_type: bool
default: True

--output_tall_csv

help: Generate output files in tall CSV format
wb_type: bool
default: True

--output_wide_csv

help: Generate output files in wide CSV format
wb_type: bool
default: True

--n_mismatches

help: Number of mismatches allowed
wb_type: integer
default: 2

--bowtie_optional_args

help: Other bowtie options
wb_type: string
default: –tryhard –nomaqround –norc –best –sam –quiet

--replicate_sequence_counts

help: Flag for replicating counts for replicate sequences
wb_type: bool
default: True

Optional Parameters

We provide a popular (at least for us) selection of the features found in the phippery python API as optional during pipeline execution. To run any one of these optional workflows, you’ll set the relevant boolean flag parameter to true. Additionally, you may need to provide certain annotation features and factor levels in the sample or peptide table.

Our example pan-CoV dataset includes library enrichment samples that are appropriately annotated in the sample table, meaning we could run the cpm enrichment workflow like so:

(base) ubuntu phippery/phip-flow ‹V1.04*› » nextflow run main.nf -profile docker --run_cpm_enr_workflow true
N E X T F L O W  ~  version 21.04.3
Launching `main.nf` [distracted_banach] - revision: 9ea43df075
P H I P - F L O W!
Matsen, Overbaugh, and Minot Labs
Fred Hutchinson CRC, Seattle WA
================================
sample_table    : /home/jared/MatsenGroup/Projects/phippery/phip-flow/data/pan-cov-example/sample_table.csv
peptide_table   : /home/jared/MatsenGroup/Projects/phippery/phip-flow/data/pan-cov-example/peptide_table.csv
results         : /home/jared/MatsenGroup/Projects/phippery/phip-flow/results/


executor >  local (29)
[2c/30601d] process > ALIGN:validate_sample_table (1)    [100%] 1 of 1 ✔
[1d/073399] process > ALIGN:validate_peptide_table (1)   [100%] 1 of 1 ✔
[37/6937e7] process > ALIGN:generate_fasta_reference (1) [100%] 1 of 1 ✔
[61/a636a9] process > ALIGN:generate_index (1)           [100%] 1 of 1 ✔
[1d/454757] process > ALIGN:short_read_alignment (1)     [100%] 6 of 6 ✔
[0d/320a4d] process > ALIGN:sam_to_counts (6)            [100%] 6 of 6 ✔
[f4/687d71] process > ALIGN:sam_to_stats (6)             [100%] 6 of 6 ✔
[c6/036a15] process > ALIGN:collect_phip_data (1)        [100%] 1 of 1 ✔
[c0/3eb016] process > ALIGN:replicate_counts (1)         [100%] 1 of 1 ✔
[e8/fae63a] process > STATS:counts_per_million (1)       [100%] 1 of 1 ✔
[a9/81d0b7] process > STATS:size_factors (1)             [100%] 1 of 1 ✔
[11/031d9e] process > STATS:cpm_fold_enrichment (1)      [100%] 1 of 1 ✔
[-        ] process > STATS:fit_predict_neg_binom        -
[-        ] process > STATS:fit_predict_zscore           -
[7e/df19de] process > STATS:merge_binary_datasets        [100%] 1 of 1 ✔
[c0/5b1faf] process > DSOUT:dump_binary                  [100%] 1 of 1 ✔
[-        ] process > DSOUT:dump_wide_csv                -
[-        ] process > DSOUT:dump_tall_csv                -
[-        ] process > AGG:split_samples                  -
[-        ] process > AGG:aggregate_organisms            -
[-        ] process > AGG:join_organisms                 -

We can then use the phippery.utils to read in the data to take a look at the results.

>>> import phippery
>>> ds = phippery.load("results/pickle_data/data.phip")
>>> ds
<xarray.Dataset>
Dimensions:           (sample_id: 6, peptide_id: 10047, sample_metadata: 9,
                       peptide_metadata: 7)
Coordinates:
  * sample_id         (sample_id) int64 0 1 2 3 4 5
  * peptide_id        (peptide_id) int64 0 1 2 3 4 ... 10043 10044 10045 10046
  * sample_metadata   (sample_metadata) object 'library_batch' ... 'average_q...'
  * peptide_metadata  (peptide_metadata) object 'Virus' ... 'Prot_Start'
Data variables:
    counts            (peptide_id, sample_id) int64 0 1 0 3 0 3 ... 0 1 0 0 2 1
    sample_table      (sample_id, sample_metadata) object 'MEGSUB' ... 37.3
    peptide_table     (peptide_id, peptide_metadata) object 'OC43' ... 4052
    size_factors      (peptide_id, sample_id) float64 0.0 1.0 0.0 ... 2.29 1.0
    cpm               (peptide_id, sample_id) float64 0.0 94.85 ... 560.2 245.6
    enrichment        (peptide_id, sample_id) float64 0.02065 1.979 ... 5.093
>>> ds.counts.to_pandas()
sample_id   0  1  2  3  4  5
peptide_id
0           0  1  0  3  0  3
1           7  2  3  0  5  1
2           2  1  0  0  0  0
3           0  2  0  0  0  0
4           0  0  0  0  0  0
...        .. .. .. .. .. ..
10042       0  1  0  0  0  0
10043       1  2  0  0  0  0
10044       0  1  0  0  0  0
10045       8  0  0  0  1  0
10046       0  1  0  0  2  1

[10047 rows x 6 columns]
>>> ds.enrichment.to_pandas()
sample_id          0         1         2          3          4          5
peptide_id
0           0.020650  1.979353  0.020651  34.805577   0.020651  15.238485
1           1.552418  0.447580  4.091275   0.002347   3.289535   0.578875
2           1.328661  0.671337  0.007004   0.007004   0.007004   0.007004
3           0.010433  1.989571  0.010433   0.010433   0.010433   0.010433
4           1.000000  1.000000  1.000000   1.000000   1.000000   1.000000
...              ...       ...       ...        ...        ...        ...
10042       0.020650  1.979353  0.020651   0.020651   0.020651   0.020651
10043       0.666665  1.333336  0.006992   0.006992   0.006992   0.006992
10044       0.020650  1.979353  0.020651   0.020651   0.020651   0.020651
10045       1.997354  0.002643  0.002643   0.002643   0.742907   0.002643
10046       0.020650  1.979353  0.020651   0.020651  11.589568   5.093262

[10047 rows x 6 columns]

BEER

Warning

This workflow has not been fully tested and may be very slow. For errors which may arise from the BEER workflow, we recommend that you direct questions to the BEER developers. If you would like to run BEER outside of the pipeline, note that by default the pipeline runs EdgeR and outputs those results into the PhIPData object file which can be directly loaded and used with the BEER library.

--run_BEER

help: Flag for running edgeR and BEER using the infrastructure in the BEER pipeline. See the R Vignettes and BEER Paper for more on this method. Enrichments, EdgeR hits, and Annotations are tied into a PhIPData object and exported to an RDS binary object file. The object file is then saved in the params.results directory below the rds_data/ sub-directory. Additionally, these results will be tied back into the xarray object used by the Python phippery API, as well as any CSV outputs (wide & tall).
wb_type: bool
default: False

CPM Enrichment

--run_cpm_enr_workflow

help: Flag for running the enrichment workflow using counts
per million as a pre-processing step to fold enrichment of empricial IP samples over library abundance controls. If True, we require that the sample annotation table provides a column “control_status” for which a subset of samples is labeled as “library” indicating the sample is a control For a description of how this function works in more detail, see phippery.normalize.enrichment().
wb_type: bool
default: False

Z-Score

--run_zscore_fit_predict

help: Flag for running Z-score enrichment analysis.
This model fits to mock ip (bead only controls) and thus requires the sample annotation column “control_status” where mock IP's are marked “beads_only”. Note that this method uses the phippery.normalize.counts_per_million() function to normalize the data before fitting and estimating significance using phippery.modeling.zscore(). For more on this method, see the background modeling documentation
wb_type: bool
default: False

VirScan Organism Summary

This workflow will summarize the hits to epitopes (peptides) across the groups across the proteome specified in the peptide annotation table input to the pipeline.

Note that this analysis workflow was created with the Virscan peptide library in mind, but could be used for any peptide assay being analyzed. For example, you could run this workflow on the example data like so:

nextflow run matsengrp/phip-flow -r V1.12 \
        -profile docker \
        --summarize_by_organism true \
        --peptide_seq_col "Prot" \
        --peptide_org_col "Virus" \
        --results "$(date -I)"

--summarize_by_organism

help: Flag used to control the summary of results by organism
Requires that the peptide table includes information regarding the source organism for each epitope. It is possible to annotate an epitope as being contained in multiple organisms by including multiple lines with the same peptide. When this flag is enabled, an additional output table will be produced (aggregated_data/organism.summary.csv.gz) which summarizes the number of epitopes with Z-scores above the threshold (--zscore_threshold, described below) for each organism. The sequence of each peptide is taken into account to filter out overlapping peptide hits. For any pair of peptides which overlap by more than the allowed number of amino acids (--max_overlap), only the higher-scoring peptide (in terms of Z-score) will be retained. Overlaps between peptides are determined by exact k-mer matching. A peptide is marked as a ‘hit’ when it is above the threshold in all replicates of that sample. When it is only above the threshold in a subset of replicates, it is marked as ‘discordant’. The Epitope Binding Score (EBS) is also calculated for each peptide as the mean Z-score across all replicates from the same sample. At the organism level, the max and mean EBS is reported. Finally, all of those results are reported for the subset of epitopes which are marked as ‘public’ (using --public_epitopes_csv), which indicates that there is independent experimental evidence supporting the presence of binding antibodies in a human population.
wb_type: bool
default: False

--peptide_org_col

help: Column in the peptide table indicating the organism for each peptide
wb_type: string
default: organism

--peptide_seq_col

help: Column in the peptide table containing the peptide sequence (used to match against public epitopes provided with --public_epitopes_csv)
wb_type: string
default: seq

--max_overlap

help: Maximum allowed overlap between detected peptides
wb_type: integer
default: 7

--zscore_threshold

help: Minimum Z-score threshold
wb_type: float
default: 2.5

--sample_grouping_col

help: Column in the sample table used for mapping replicates to samples
wb_type: string
default:

--public_epitopes_csv

help: Optional, a CSV containing public epitopes
wb_type: file

--public_epitopes_col

help: In the public epitopes CSV, the column containing the translated amino acid sequence
wb_type: string
default: peptide_translate

--nxf_profile

help: Profile used for resource allocation (options: standard / docker / cluster)
wb_env: PROFILE
wb_type: string
default: standard