Getting Started

Installation

Python dependencies (defined in the requirements file, should be automatically installed when using conda or pip).

In addition to the python dependencies, Clade-o-Matic requires Jellyfish 2.3.0 and snp-dists 0.8.2. Install the latest released version from conda:

conda create -c bioconda -c conda-forge -n cladeomatic cladeomatic

Install using pip:

pip install cladeomatic

Install the latest master branch version directly from Github:

conda install jellyfish snp-dists
pip install git+https://github.com/phac-nml/cladeomatic.git

Quick Start

Basic Usage

If you run cladeomatic, you should see the following usage statement:

Usage: cladeomatic <command> [options] <required arguments>

To get minimal usage for a command use:
cladeomatic command

To get full help for a command use one of:
cladeomatic command -h
cladeomatic command --help

Available commands:

create     Identify population structure and develop typing scheme
genotype   Call genotypes from a VCF and a scheme file
benchmark  Test developed scheme using labeled samples and scheme
namer      Rename genotypes within a scheme

For further reference on the options available, refer the the Usage Guide guide.

Create Scheme

Option 1 - De novo tree-based

This mode will discover clades and lineages which meet membership size and SNP requirements.

Input requirements are:

newick formatted tree
Reference (Outgroup) sequence (.fasta / .gbk)
VCF (Must use the same reference sequence as above)
Name of Reference (Outgroup) sequence (Must be the same as the reference sequence)
Metadata file

cladeomatic create --in_nwk examples/small_test/tree.nwk --in_var examples/small_test/snps.vcf --in_meta examples/small_test/sample.meta.txt --outdir small_test_cladeomatic/ --root_name root.0 --reference examples/small_test/root.gbk

Option 2 - Predefined groups

This mode will attempt to define a scheme based on a group manifest which meet membership size and SNP requirements. Note every group id must be unique across all ranks to use this feature.Cladeomatic uses this for internal representation of the genotypes but they can be mapped to whatever nomenclature is desired to have as an output

sample_id	invalid_genotype	valid_genotype
A	0.1	0.1
B	0.1.1.1	0.2.4.7
C	0.1.1.2	0.2.4.8
D	0.1.1.2	0.2.4.8
E	0.2	0.3
F	0.2.1	0.3.5
G	0.2.2	0.3.6
H	0.2.2	0.3.6

Input requirements are:

TSV formatted group file (sample_id, genotype)
VCF
Reference sequence (.fasta / .gbk)
Name of outgroup sequence
Metadata file

cladeomatic create --in_groups examples/small_test/groups.tsv --in_var examples/small_test/snps.vcf --in_meta examples/small_test/sample.meta.txt --outdir small_test_cladeomatic_groups/ --root_name root.0 --reference examples/small_test/root.gbk

Outputs:

{Output folder name}
├── {prefix}-altseq.fasta - Artificial sequence which has a different base from the reference at every position in scheme
├── {prefix}-biohansel.fasta  - biohansel formatted kmer fasta file
├── {prefix}-biohansel.meta.txt - descriptions of biohansel kmers: kmername,target_position,target_base
├── {prefix}-clades.info.txt - Information on each individual clade, including supporting SNPs and metadata associations
├── {prefix}-dist.mat.txt - tab delimeted distance matrix from snp-dists
├── {prefix}-extracted.kmers.txt - Raw kmer output of extracted kmers with positions mapped
├── {prefix}-filtered.vcf - VCF file where invalid sites have been removed
├── {prefix}-genotypes.distance.txt - Histogram of node distances
├── {prefix}-genotypes.raw.txt - Tree or group file without filtering
├── {prefix}-genotypes.selected.txt - Nodes which meet the user criteria
├── {prefix}-genotypes.supported.txt - Nodes which were selected based on the supported nodes
├── {prefix}-kmer.scheme.txt - Cladeomatic kmer based scheme
├── {prefix}-params.log - Selected parameters for the run
├── {prefix}-sample.distances.html - Histogram of node distances
├── {prefix}-snps.scheme.txt - Cladeomatic SNP based scheme
├── {prefix}-snps.info.txt
├── pseudo.seqs.fasta - reconstructed fasta sequences based on reference sequence and vcf
└──

Genotype:

Genotype samples using the developed scheme based on a VCF file with the same reference selected to build the scheme

Input requirements are:

VCF
Clade-O-Matic Scheme
Metadata file (sample_id,genotype) * Produced by “create” {prefix}-genotypes.selected.txt
(Optional) Metadata file (sample_id,genotype) * Produced by “create” {prefix}-genotypes.selected.txt

cladeomatic genotype --in_var examples/small_test/snps.vcf --in_scheme examples/small_test/cladeomatic-snp.scheme.txt --sample_meta examples/small_test/sample.meta.txt --genotype_meta examples/small_test/genotype.meta.txt --outfile genotype.calls.txt

VCF files will not include positions which are exclusively the reference sequence or missing and this poses an issue for calling genotypes based on the VCF file where missing and reference state cannot be distinguished. A work around for this issue is the inclusion of a sequence which is different from the reference sequence for every position targeted by the scheme. The create module generates a sequence where every position used by the scheme is flipped to be a different base from the reference. This is not an ideal solution but it will allow users to use the genotype module using SNIPPY-CORE with their query sequence and the “alt” sequence.

Outputs:

Outputs a file with the genotype calls for each input sample

Benchmark Scheme:

Benchmark the scheme based on the output of genotype tool. At this point only vcf based genotyping is supported

Input requirements are:

TXT file produced by genotype module with predicted and expected genotypes, or tsv with predicted and submitted genotype information
Clade-O-Matic scheme file used to call genotypes
VCF
Name of column for predicted genotype
Name of column for submitted genotype

cladeomatic benchmark --in_var examples/small_test/snps.vcf --in_scheme examples/small_test/cladeomatic-kmer.scheme.txt --in_genotype examples/small_test/genotype.calls.txt --submitted_genotype_col genotype --predicted_genotype_col predicted_genotype --outdir benchmark

The benchmark tool will identify the F1 scores for calling genotypes based on the provided scheme and will report per sample any sites which are responsible for the submitted genotype not being called

Outputs:

OutputFolderName
├── {prefix}-scheme.scores.txt
└── {prefix}-sample.results.txt