gctree

The primary gctree command.

genotype collapsed tree inference and simulation

usage: gctree [-h] {test,infer,simulate} ...

Sub-commands

test

run tests on library functions

gctree test [-h] [--outbase OUTBASE] [--img_type IMG_TYPE] [--verbose]

Named Arguments

--outbase: output file base name
--img_type: output image file type
--verbose: flag for verbose messaging

infer

likelihood ranking of parsimony trees

gctree infer [-h] [--root ROOT] [--colormapfile COLORMAPFILE]
             [--chain_split CHAIN_SPLIT] [--frame2 {1,2,3}]
             [--positionmapfile POSITIONMAPFILE]
             [--positionmapfile2 POSITIONMAPFILE2] [--idmapfile IDMAPFILE]
             [--isotype_mapfile ISOTYPE_MAPFILE]
             [--isotype_names ISOTYPE_NAMES [ISOTYPE_NAMES ...]]
             [--mutability MUTABILITY] [--substitution SUBSTITUTION]
             [--branching_process_ranking_coeff BRANCHING_PROCESS_RANKING_COEFF]
             [--ranking_coeffs RANKING_COEFFS RANKING_COEFFS RANKING_COEFFS]
             [--ranking_strategy RANKING_STRATEGY] [--use_old_mut_parsimony]
             [--show_nucleotide_mutations] [--summarize_forest] [--tree_stats]
             [--outbase OUTBASE] [--img_type IMG_TYPE] [--verbose]
             [--frame {1,2,3}] [--idlabel]
             infiles [infiles ...]

Positional Arguments

infiles: Input files for inference. If two filenames are passed, the first shall be a dnapars outfile (verbose output with sequences at each site), and the second shall be an abundance file containing allele frequencies (sequence counts) in the format: SeqID, Nobs. If a single filename is passed, it shall be the name of a pickled history DAG object created by gctree. A new pickled forest will be output only if new annotations (such as isotypes) are added.

Named Arguments

--root

name of root sequence (outgroup root), default "root"

--colormapfile

File containing color map in the tab-separated format: "SeqID color"

--chain_split

when using concatenated heavy and light chains, this is the 0-based index at which the 2nd chain begins, needed for determining coding frame in both chains, and also to correctly calculate context-based Poisson likelihood.

--frame2

Possible choices: 1, 2, 3

codon frame for the second chain when using the chain_split option

--positionmapfile

file containing a list of position numbers (e.g. IMGT) corresponding to indices in sequence

--positionmapfile2

positionmapfile for the 2nd chain when using the chain_split option

--idmapfile

input filename for a csv file mapping sequence names to original sequence ids. For use by isotype ranking. Such a file can be produced by deduplicate when it is provided the --idmapfile option.

--isotype_mapfile

filename for a csv file mapping original sequence ids to observed isotypes. For example, each line should have the format ‘somesequence_id, some_isotype’.

--isotype_names

A list of isotype names used in isotype_mapfile, in order of most naive to most differentiated. Default is equivalent to providing the argument --isotype_names IgM IgD IgG3 IgG1 IgG2 IgE IgA

--mutability

Path to mutability model file. If –mutability and –substitution are both provided, they will be used to rank trees after likelihood and isotype parsimony. This shall be a csv filewith the first column containing fivemers, and the second column containing mutability scores.See a file excerpt in the documentation for mutation_model.MutationModel().

--substitution

Path to substitution model file. If –mutability and –substitution are both provided, they will be used to rank trees after likelihood and isotype parsimony.This shall be a csv file with the first column containing fivemers, and the next fourcolumns containing targeting probabilities for bases A, C, G, and T, respectively.See a file excerpt in the documentation for mutation_model.MutationModel().

--branching_process_ranking_coeff

This argument is deprecated. Use --ranking_strategy instead. Coefficient used for branching process likelihood, when ranking trees by a linear combination of traits. This value will be ignored if --ranking_coeffs argument is not also provided.

--ranking_coeffs

This argument is deprecated. Use --ranking_strategy instead. List of coefficients for ranking trees by a linear combination of traits. Coefficients are in order: isotype parsimony, mutation model parsimony, number of alleles. A coefficient of -1 will be applied to branching process likelihood. If not provided, trees will be ranked lexicographically by likelihood, isotype parsimony, and context-based Poisson likelihood in that order.

--ranking_strategy

Expression describing tree ranking strategy. If provided, takes precedence over all other ranking arguments. Two types of expressions are permitted: First are those describing lexicographic orderings, like B,C,A, which means choose trees to minimize branching process log loss, then minimize context log loss, then minimize number of alleles. Next are expressions describing linear combinations of criteria, like B+2C-1.1A, which means choose trees to minimize the specified linear combination of criteria. If linear combination expression has leading -, use = instead of space to separate argument, e.g. --ranking_strategy=-B+R. These two methods of ranking cannot be combined. For example, B+C,A is not a valid ranking strategy expression. Ranking criteria are specified using the following identifiers. All are by default minimized: B - branching process log loss, I - isotype parsimony, C - context-based Poisson log loss, M - old mutability parsimony, A - number of alleles, R - sitewise reversions to naive sequence. To compute the value of a criterion on ranked trees without affecting the ranking, include that ranking criterion with a coefficient of zero, as in B+2C+0A, or B,C,0A. To maximize instead of minimizing a criterion in lexicographic ranking, provide a negative coefficient. For example, B,-A will first minimize branching process log loss, then maximize the number of alleles. A ranking strategy string containing a single ranking criterion identifier will be interpreted as a lexicographic ordering. gctree infer --verbose will describe the ranking strategy used. Examine this output to make sure it’s as expected.

--use_old_mut_parsimony

This argument is deprecated. Use the identifier ‘M’ with the argument --ranking_strategy instead. Use old mutability parsimony instead of poisson context likelihood. Not recommended unless attempting to reproduce results from older versions of gctree. This argument will have no effect unless an S5F model is provided with the arguments --mutability and --substitution.

--show_nucleotide_mutations

If provided, branches in rendered tree will be annotated with nucleotide mutations.

--summarize_forest

write a file [outbase].forest_summary.log with a summary of traits for trees in the forest.

--tree_stats

write a file [outbase].tree_stats.log with stats for all trees in the forest. For large forests, this is slow and memory intensive.

--outbase

output file base name

--img_type

output image file type

--verbose

flag for verbose messaging

--frame

Possible choices: 1, 2, 3

codon frame

--idlabel

label nodes with their sequence ids in output tree images, and write a fasta alignment mapping those sequence ids to sequences. This is the easiest way to access inferred ancestral sequences.

simulate

Neutral, and target selective, model gctree simulation

gctree simulate [-h] [--sequence2 SEQUENCE2] [--lambda LAMBDA_]
                [--lambda0 [LAMBDA0 ...]] [--n N] [--N N] [--T T [T ...]]
                [--seed SEED] [--target_dist TARGET_DIST] [--plotAA PLOTAA]
                [--outbase OUTBASE] [--img_type IMG_TYPE] [--verbose]
                [--frame {1,2,3}] [--idlabel]
                sequence mutability substitution

Positional Arguments

sequence: seed root nucleotide sequence
mutability: path to mutability model file
substitution: path to substitution model file

Named Arguments

--sequence2

Second seed root nucleotide sequence. For simulating heavy/light chain co-evolution.

--lambda

poisson branching parameter

--lambda0

List of one or two elements with the baseline mutation rates. Space separated input values. First element belonging to seed sequence one and optionally the next to sequence 2. If only one rate is provided for two sequences, this rate will be used on both.

--n

cells downsampled

--N

target simulation size

--T

observation time, if None we run until termination and take all leaves

--seed

integer random seed

--target_dist

The number of non-synonymous mutations the target should be away from the root.

--plotAA

Plot trees with collapsing and coloring on amino acid level.

--outbase

output file base name

--img_type

output image file type

--verbose

flag for verbose messaging

--frame

Possible choices: 1, 2, 3

codon frame

--idlabel

label nodes with their sequence ids in output tree images, and write a fasta alignment mapping those sequence ids to sequences. This is the easiest way to access inferred ancestral sequences.

Parsimony utilities

Additional command line uilities for processing data for genotype collapse, inferring parsimony trees with phylip, and prepping parsimony trees for analysis with gctree.

deduplicate

Deduplicate sequences in a fasta file, write to stdout in phylip format, and create a few other files (see arguments). Headers must be a unique ID of less than or equal to 10 ASCII characters. An additional sequence representing the outgroup/root must be included (even if one or more observed sequences are identical to it).

usage: deduplicate [-h] [--abundance_file ABUNDANCE_FILE]
                   [--idmapfile IDMAPFILE] [--id_abundances] [--root ROOT]
                   [--frame {1,2,3}] [--colorfile COLORFILE]
                   [--colormap COLORMAP]
                   infile [infile ...]

Positional Arguments

infile: Fasta file with less than or equal to 10 characters unique header ID. Because dnapars will name internal nodes by integers a node name must includeat least one non number character (unless using the id_abundances option).

Named Arguments

--abundance_file

filename for the output file containing the counts.

--idmapfile

filename for the output file containing the map of new unique ids to original seq ids.

--id_abundances

flag to interpret integer ids in input as abundances

--root

ID of the root sequence in fasta file. This ID will be used as the unique id for the root sequence, and any observed sequences matching the root sequence.

--frame

Possible choices: 1, 2, 3

codon frame

--colorfile

optional input csv filename for colors of each cell.

--colormap

optional output filename for colors map.

mkconfig

Read a PHYLIP-format file and produce an appropriate config file for passing to dnapars.

dnapars doesn’t play very well in a pipeline. It prompts the user for configuration information and reads responses from stdin. The config file generated by this script is meant to mimic the responses to the expected prompts.

Typical usage is,

$ mkconfig sequence.phy > dnapars.cfg $ dnapars < dnapars.cfg

usage: mkconfig [-h] [--quick] [--jumble JUMBLE] [--bootstrap BOOTSTRAP]
                phylip treeprog

Positional Arguments

phylip: PHYLIP input
treeprog: dnapars, dnaml, or seqboot

Named Arguments

--quick: quicker (less thourough) dnapars
--jumble: search tree space with this many random permutations of the input sequences
--bootstrap: input is seqboot output with this many samples

phylip_parse

Given an outputfile from one of the PHYLIP tools - dnaml or dnapars - produce a CollapsedForest containing the trees in that outputfile.

usage: phylip_parse [-h] [--outputfile OUTPUTFILE] [--root ROOT]
                    phylip_outfile abundance_file

Positional Arguments

phylip_outfile: dnaml outfile (verbose output with inferred ancestral sequences, option 5).
abundance_file: count file

Named Arguments

--outputfile: output file.
--root: root sequence id

isotype

Given gctree inference outputs, and a file mapping original sequence names (the original sequence ids referenced as values in idmapfile) with format “Original SeqID, Isotype”, this utility

Adds observed isotypes to each observed node in the collapsed trees output by gctree inference. If cells with the same sequence but different isotypes are observed, then collapsed tree nodes must be ‘exploded’ into new nodes with the appropriate isotypes and abundances. Each unique sequence ID generated by gctree is prepended to its observed isotype, and a new isotyped.idmap mapping these new sequence IDs to original sequence IDs is written in the output directory.
Resolves isotypes of unobserved ancestral genotypes in a way that minimizes isotype switching and obeys isotype switching order. If observed isotypes of an observed internal node and its children violate switching order, then the observed internal node is replaced with an unobserved node with the same sequence, and the observed internal node is placed as a child leaf. This procedure always allows switching order conflicts to be resolved, and should usually increase isotype transitions required in the resulting tree.
Renders each new collapsed tree with colors and labels reflecting observed or inferred isotypes, and writes a fasta and newick file just like the gctree inference pipeline.
Prints for each collapsed tree, the original branching process log likelihood, the original node count, the isotype parsimony score, and the new node count after isotype additions. The isotype parsimony score is just a count of how many isotype transitions are required along tree edges. Changes in node count after isotype additions indicate that either observed nodes had to be exploded based on observed isotypes, or isotype switching order violations required internal nodes to be expanded as leaf nodes.

This tool doesn’t make any judgements about which tree is best. Tree output order is the same as in gctree inference: ranking is by branching process likelihood before isotype additions. A determination of which is the best tree is left to the user, based on likelihoods, isotype parsimony score, and changes in the number of nodes after isotype additions.

usage: isotype [-h] [--trees TREES [TREES ...]]
               [--isotype_names ISOTYPE_NAMES [ISOTYPE_NAMES ...]]
               [--out_directory OUT_DIRECTORY]
               idmapfile isotype_mapfile

Positional Arguments

idmapfile: filename for a csv file mapping sequence names to original sequence ids, like the one output by deduplicate.
isotype_mapfile: filename for a csv file mapping original sequence ids to observed isotypes. For example, each line should have the format ‘somesequence_id, some_isotype’.

Named Arguments

--trees: filenames for collapsed tree pickle files output by gctree inference
--isotype_names: A list of isotype names used in isotype_mapfile, in order of most naive to most differentiated. Default is equivalent to providing the argument --isotype_names IgM IgD IgG3 IgG1 IgG2 IgE IgA
--out_directory: Directory in which to place output. Default is working directory.