.. _sec_quick_start: ======== Examples ======== There are a few primary steps to PhIP-Seq analysis after the sequencing and demultiplexing of samples. To address each of these, we provide a flexible `Nextflow automated pipeline `_ used for producing the enrichment data when provided Next Generation Sequencing (demultiplexed fastq files) data, as well as coupled sample and peptide library annotation files, as input. Below, we'll give a brief overview of using the pipeline on some example data, followed by a typical approach for running on some new data. Pan-CoV example dataset ^^^^^^^^^^^^^^^^^^^^^^^ The dataset provided with this pipeline is derived from a pre-COVID-19 healthy adult serum sample, along with serum from a SARS-CoV-2 infected convalescent individual, both run in duplicate across two separate batches of the Pan-Human CoV full proteome library. Huge thanks to the authors of `Stoddard et al. 2021 `_ and the wonderful folks at the `Overbaugh Lab `_ for obtaining this data and allowing us to use it here. .. _sec_align_soup_nutz: The example data is included and run by default unless specified otherwise. This input includes the next generation sequencing files for each of the six samples described in the `sample table `_: the replicates of the samples described above, as well as two replicates of the input phage library. We'll begin by aligning the fastq files to the oligo library described by the oligonucleotide coding sequences in the `peptide table `_. Running locally +++++++++++++++ .. _sec_clone_template: To run the following example, you must first have installed `docker `_, and `Nextflow `_. For our approach to easy installation, see the :ref:`installation page `. To run the pipeline, we can use the `nextflow run `_, git-aware command. This will run the specified version (``-r V1.12``) of the pipeline in the current working directory. We recommend using the latest version which can be seen in the `release section `_ of the repository. Note that no cloning of the repository is necessary if we're simply running the pipeline and we do not wish to modify the source code. Simply: :: $ nextflow run matsengrp/phip-flow -r V1.12 -profile docker --output_tall_csv true Here we specified three parameters: two that are native to ``Nextflow`` (denoted with a single **'-'** prefix) and one that is specific to ``PhIP-Flow`` (double minus **'- -'** symbols). Here, we did not specify a sample table or peptide table, so the pipeline will run on the default example data. Additionally, we did not specify a results directory, so the pipeline will ,by default, write to ``results/``: (1) a pickled binary `xarray DataSet object `_ (as it is the primary data structure for using the :ref:`Python CLI `) in addition to (2) an RDS file for the respective `PhIPData Object `_. The options ``--output_tall_csv`` (default false) and ``--output_wide_csv`` (default true) each specifies one of two optional output formats: a tall CSV and a collection of wide CSVs. .. seealso:: For more on the parameters that specify pipeline behavior, see the :ref:`alignments pipeline parameters ` section. .. note:: For more on these CSV formats see this `great blog post `_ on the topic. By default, this command ran the pipeline on the example dataset described above. The files can be viewed in the `phip-flow git repo `_. In short, the workflow (1) used ``bowtie`` to align all the reads described in the sample annotation table to the reference phage library described in the peptide table, (2) computed some useful statistics on raw counts (such as the edgeR log fold enrichment), and (3) formatted the data into a single coherent dataset. For more detail about the exact steps that were run, see the :ref:`nextflow pipeline page `. Running on HPC (cluster) ++++++++++++++++++++++++ Above, we specified ``-profile docker`` as a parameter option, which will assume you are running this locally with ``Docker`` and ``Nextflow`` installed. For high performance computing systems, we can also specify the ``-profile cluster`` option for running the default configurations on a `slurm `_ cluster. This option assumes the cluster has loaded modules or installs for Singularity and Nextflow. Here's an example script we might execute to run the pipeline on the Fred Hutch Rhino machines: .. code-block:: bash #!/bin/bash set -e source /app/lmod/lmod/init/profile module load nextflow module load Singularity export PATH=$SINGULARITYROOT/bin/:$PATH nextflow run matsengrp/phip-flow -r V1.12 \ --read_length 125 \ --oligo_tile_length 117 \ --output_tall_csv true \ --results "$(date -I)" \ -profile cluster \ -resume Here, we specified a few more parameters, ``--read_length`` and ``--oligo_tile_length`` that define our input data for alignment purposes. See the :ref:`Creating and running your own data ` section for more details on these. Example results (tall CSV) ++++++++++++++++++++++++++ Now, let's take a quick look at the results from the Pan-CoV example dataset that was run. By default, the pipeline runs the Pan-CoV example data, and writes the results out to a directory, "*results/*". The pickled binary `xarray `_ object is output by default, and we additionally specified that a tall style data ("data-tall.csv") as well as a collection of wide style data matrices be output. Let's take a quick look. :: results ├── pickle_data │   └── data.phip ├── rds_data │   └── PhIPData.rds ├── tall_data │   └── data-tall.csv.gz └── wide_data ├── data_counts.csv.gz ├── data_cpm.csv.gz ├── data_edgeR_hits.csv.gz ├── data_edgeR_logfc.csv.gz ├── data_edgeR_logpval.csv.gz ├── data_peptide_annotation_table.csv.gz ├── data_sample_annotation_table.csv.gz └── data_size_factors.csv.gz 4 directories, 11 files Note the csv outputs are ``gzipped``, so we'll need to unzip them before moving forward. The following command unzip's all the csv files in the results directory. :: gunzip results/**/*.csv.gz Let's take a look at how you might use **ggplot** to visualize the data found in the tall formatted CSV. We'll start by plotting the individual sample enrichments, colored by infection status. .. code-block:: R library(ggplot2) library(dplyr) library(viridis) phip_data <- read.table( "results/tall_data/data-tall.csv", # gunzip first header=TRUE, sep= "," ) %>% filter(Protein == "spike") %>% filter(Virus == "SARSCoV2") # Plot p <- phip_data %>% ggplot(aes( x=Prot_Start, y=counts, group=factor(sample_id), color=factor(patient_status)) ) + theme_bw() + geom_line() + ggtitle("Sars-CoV-2 Spike Protein Enrichments") + labs(y="# peptide alignments", x="Locus", color="infection status") .. figure:: images/example-counts-R.svg :width: 700 :alt: example results :align: left Example data counts plotted as a function of location on Spike protein of SARS-CoV-2, and colored by infection status of the sample. Example results (wide CSV) ++++++++++++++++++++++++++ Looking at the files in the wide format sub directory, we are given back the peptide and sample annotation tables, both with an index (i.e. first) column "peptide_id" and "sample_id". These indices can simply be mapped back to the rows and columns of each of the output enrichment matrices. By default, the phip-flow pipeline outputs the raw counts as well as counts per million and size factor normalizations (`Anders and Huber, 2010 `_) of the matrix. Let's use matplotlib's ``implot`` to plot the same sample's binding to OC43 as a heatmap. .. code-block:: python3 import pandas as pd import seaborn as sns import matplotlib.pyplot as plt cpm = pd.read_csv("results/wide_data/data_cpm.csv", index_col=0, header=0) cpm.columns = cpm.columns.astype(int) sample_table = pd.read_csv("results/wide_data/data_sample_annotation_table.csv") peptide_table = pd.read_csv("results/wide_data/data_peptide_annotation_table.csv") OC43_spike = peptide_table.query("Full_name == 'OC43_SC0776_spike'") non_null_samples = sample_table.query("patient_status.notnull()") cpm_OC43_spike = cpm.loc[OC43_spike.index, non_null_samples.index] fig, ax = plt.subplots(figsize=[7, 3]) sns.heatmap( cpm_OC43_spike.transpose(), yticklabels=non_null_samples["patient_status"], xticklabels=OC43_spike["Prot_Start"], cbar_kws={'label': 'binding counts per million'}, ax=ax, cmap="YlGnBu", vmax = 2000 ) for label in ax.xaxis.get_ticklabels()[::2]: label.set_visible(False) ax.set_title("OC43 Spike Binding - \n Strain: SC0776") ax.set_xlabel("Locus") plt.tight_layout() .. figure:: images/example-heatmap-Py-2.svg :width: 700 :alt: example heatmap results :align: left A heatmap of peptide alignment counts per million across the OC43 Spike protein. Each row corresponds to a sample replicate. .. _example_own_data: Creating and running your own data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. _alignment_approach: Alignment approach ++++++++++++++++++ We designed this pipeline to work well with the common protocol for PhIP-Seq experiments; which is to use a single read, shotgun sequencing of the phage library after the antibody immunoprecipitation step. We expect the input reads to be trimmed to a uniform length (specified with the ``--read_length`` parameter), which should be longer than, or equal to, the oligonucleotide sequence encoding any given peptide in the provided library (specified with the ``--oligo_tile_length`` parameter). Additionally, we assume that your peptide library is comprised of uniform-length oligonucleotide sequences (tiles), and that sequencing adapters are designed such that the 5' (high-quality) end of the read is where alignment should begin. Concretely, we perform alignment of reads to the peptide-encoding oligo sequences via `bowtie (-n mode) `_. This means alignments may have no more than ``--n_mismatches`` mismatches (where ``--n_mismatches`` is a number 0-3, with a default of 2) in the first ``--oligo_tile_length`` bases (where ``--oligo_tile_length`` is a number 5 or greater, with a default of 117) on the high-quality (left) end of the read. Additional parameters for alignment can be specified by the ``--bowtie_optional_args`` parameter, which by default is ``--tryhard --nomaqround --norc --best --sam --quiet``. This specifies that the aligner should not be looking at the reverse compliment of the oligo sequences, and that it should report *only* the best alignment for each read. This can be modified as you see fit. It's worth noting that while we only report a single alignment per read, if there are identical oligo sequences in the peptide table (described below) then the parameter ``--replicate_sequence_counts`` (default True) will ensure that that the resulting alignment counts for all replicate sequences are reported as the sum of alignments across each. Ideally, the ``--read_length`` is the same as the length specified by the ``--oligo_tile_length`` parameter. If the ``read_length`` is greater than the ``oligo_tile_length``, we use bowtie's ``--trim3`` parameter to trim the reads on the 3' end to match the ``--oligo_tile_length``. If for some reason reads are shorter than the ``--oligo_tile_length``, or variable in length, then we recommend setting the ``--read_length`` to ``0``, and potentially allowing for more ``--n_mismatches`` such that reads are *not* trimmed before alignment, and partial reads still have a chance at being reported. If you would like to modify the behavior of the alignment approach outside the scope of the ``--bowtie_optional_args`` parameters described above or even the alignment tool itself, and you have experience with Nextflow, it should be relatively straightforward to modify the `alignment template script `_, it's `respective parameters `_, and the `associated process definition `_. Input file requirements +++++++++++++++++++++++ Input to the pipeline is dependent upon the following: - **NGS files**: Single-read, demultiplexed fastq files for each of the samples. We do not currently support paired-end reads. - **sample annotation table**: a CSV containing a column *fastq_filepath*, where each row contains a path relative from where the pipeline is run to where the respective fastq file resides. - **peptide annotation table**: a CSV containing a column *oligo*, where each row contains a single peptide from the complete library used in the antibody immunoprecipitation step. This will be generated into an index for all samples to be aligned to. As an example, let's assume there's some directory *ngs/* containing all the fastq files for a project. To organize these files (excluding barcode files) into a minimal sample table describing each of their relative paths, we might use the following command. .. code-block:: bash (echo "fastq_filepath" && ls ngs/*R1*.gz) > sample_table.csv Now, we must have a peptide annotation file which will describe the phage library being used in this particular study. Usually, we expect something of this nature has been created prior to synthesizing the library during the phage library design. For the sake of this pipeline, we must have a column denoting the oligonucleotide sequence. Here's an peek at what a `phage-dms `_ peptide annotation might look like: :: Virus,Protein,Loc,aa_sub,Loc_Rel,is_wt,oligo BG505,gp120,1,G,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTGGTGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,E,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTGAAGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,D,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTGACGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,V,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTGTTGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,A,30,TRUE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTGCTGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,R,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTCGTGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,S,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTTCTGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,K,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTAAAGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc BG505,gp120,1,N,30,FALSE,aggaattctacgctgagtGGAGGAGGTGGTTCTGGTGGTGGAGGTTCAGGTGGTGGTGGAAGTAACGAGAACCTGTGGGTGACCGTGTATTACGGCGTTCCTGTCTGGAAAtgatagcaagcttgcc .. warning:: Currently, only *upper case* oligonucleotides will be included as part of the reference index when aligning the reads. Historically, we have encoded the barcodes with lower case letters. With these, we can simply use the same command as shown above, however, now we will specify the ``--sample_table`` and ``--peptide_table`` parameters to the ``run`` command: .. code-block:: bash #!/bin/bash set -e source /app/lmod/lmod/init/profile module load nextflow module load Singularity export PATH=$SINGULARITYROOT/bin/:$PATH nextflow run matsengrp/phip-flow -r main \ --sample_table sample_table.csv \ --peptide_table peptide_table.csv \ --output_tall_csv true \ --output_wide_csv true \ --results "$(date -I)" \ -profile cluster \ -resume Note that while here we specified nothing but the fastq filepaths in the sample table, we could have populated the CSV with any number of useful annotations pertaining to the fastq files in each of the rows. Any of the annotations added here will be tied in correctly to all output formats for more organized downstream analysis and plotting. If you want to run some of the more advanced analysis available through this pipeline such as fold enrichment, differential selection, or model fitting for estimates of significance, you will need to include special annotations in either of the annotation tables. The requirements and descriptions of these columns can be found in the :ref:`optional workflows ` section of the documentation.