truvari package

Overview

Truvari - SV comparison and annotation toolkit

See help() of specific functions / objects for details

VariantRecord methods:

entry_boundaries() entry_distance() entry_gt_comp() entry_is_filtered() entry_is_present() entry_reciprocal_overlap() entry_same_variant_type() entry_shared_ref_context() entry_seq_similarity() entry_size() entry_size_similarity() entry_to_hash() entry_to_key() entry_within() entry_variant_type()

Extra methods:

allele_freq_annos() bed_ranges() build_anno_tree() calc_af() calc_hwe() compress_index_vcf() create_pos_haplotype() get_gt() get_scalebin() get_sizebin() get_svtype() msa2vcf() overlap_percent() overlaps() phab() reciprocal_overlap() ref_ranges() seqsim() sizesim() unroll_compare() vcf_ranges()

Dev methods:

benchdir_count_entries() chunker() cmd_exe() consolidate_phab_vcfs() coords_within() count_entries() file_zipper() help_unknown_cmd() make_temp_filename() opt_gz_open() optimize_df_memory() performance_metrics() region_filter() restricted_float() restricted_int() setup_logging() vcf_to_df()

Objects:

Bench BenchOutput GT RegionVCFIterator LogFileStderr MatchResult Matcher StatsBox SV

Data:

truvari.HEADERMAT truvari.QUALBINS truvari.SVTYTYPE truvari.SZBINMAX truvari.SZBINS truvari.SZBINTYPE

VariantRecord Methods

entry_boundaries

truvari.entry_boundaries(entry, ins_inflate=False)[source]

Get the start and end of an entry

Parameters:
  • entry (pysam.VariantRecord) – entry to get bounds

  • ins_inflate (bool, optional) – inflate INS boundaries by sv length

Returns:

start/end

Return type:

tuple (int, int)

entry_distance

truvari.entry_distance(entryA, entryB)[source]

Calculate the start and end distances of the pair. Negative distances indicate entryA is upstream of entryB

Parameters:
  • entryA (pysam.VariantRecord) – first entry

  • entryB (pysam.VariantRecord) – second entry

Returns:

starts and ends distance

Return type:

tuple (int, int)

entry_gt_comp

truvari.entry_gt_comp(entryA, entryB, sampleA=None, sampleB=None)[source]

Compare the genotypes of two entries

Parameters:
  • entryA (pysam.VariantRecord) – first entry

  • entryB (pysam.VariantRecord) – second entry

  • sampleA (string, optional) – sample of entryA to check

  • sampleB (string, optional) – sample of entryB to check

Returns:

True if the genotypes are concordant

Return type:

bool

Example
>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> truvari.entry_gt_comp(a, b)
True

entry_is_filtered

truvari.entry_is_filtered(entry, values=None)[source]

Checks if entry should be filtered given the filter values provided. If values is None, assume that filter must have PASS or be blank ‘.’

Parameters:
  • entry (pysam.VariantRecord) – entry to check

  • values (set, optional) – set of filter values for intersection

Returns:

True if entry should be filtered

Return type:

bool

Example
>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> truvari.entry_is_filtered(next(v)) # PASS shouldn't be filtered
False
>>> truvari.entry_is_filtered(next(v), set(["lowQ"])) # Call isn't lowQ, so filter
True

entry_is_present

truvari.entry_is_present(entry, sample=None, allow_missing=True)[source]

Checks if entry’s sample genotype is present and is heterozygous or homozygous (a.k.a. present) If allow_missing, just check for a 1 in the genotype. Otherwise, a missing (‘.’) genotype isn’t considered present

Parameters:
  • entry (pysam.VariantRecord) – entry to check

  • sample (string, optional) – sample name

Returns:

True if variant is present in the sample

Return type:

bool

Example
>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> truvari.entry_is_present(next(v))
True

entry_reciprocal_overlap

truvari.entry_reciprocal_overlap(entry1, entry2)[source]

Calculates reciprocal overlap of two entries

Parameters:
  • entry1 (pysam.VariantRecord) – First entry

  • entry2 (pysam.VariantRecord) – Second entry

Returns:

The reciprocal overlap

Return type:

float

Example
>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> truvari.entry_reciprocal_overlap(a, b)
0

entry_same_variant_type

truvari.entry_same_variant_type(entryA, entryB, dup_to_ins=False)[source]

Check if entryA svtype == entryB svtype

Parameters:
  • entryA (pysam.VariantRecord) – first entry

  • entryB (pysam.VariantRecord) – second entry

  • dup_to_ins (bool) – Convert DUP to INS types

Returns:

True if entry SVTYPEs match

Return type:

bool

entry_seq_similarity

truvari.entry_seq_similarity(entryA, entryB, ref=None, min_len=0)[source]

Calculate sequence similarity of two entries. If reference is not None, compare their shared reference context. Otherwise, use the unroll technique.

Parameters:
  • entryA (pysam.VariantRecord) – first entry

  • entryB (pysam.VariantRecord) – second entry

  • ref (pysam.FastaFile) – Reference genome

  • min_len (float, optional) – Minimum length of reference context to generate

Returns:

sequence similarity

Return type:

float

entry_size

truvari.entry_size(entry)[source]

Determine the size of the variant.

Note

How size is determined

  • Starts by trying to use INFO/SVLEN

  • If SVLEN is unavailable and ALT field is an SV (e.g. <INS>, <DEL>, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS.

  • If len of vcf.REF and vcf.ALT[0] are equal, return len of vcf.REF

  • Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0])))

Parameters:

entry (pysam.VariantRecord) – entry to look at

Returns:

the entry’s size

Return type:

int

entry_size_similarity

truvari.entry_size_similarity(entryA, entryB)[source]

Calculate the size similarity and difference for two entries

Parameters:
  • entryA (pysam.VariantRecord) – first entry

  • entryB (pysam.VariantRecord) – second entry

Returns:

size similarity and size diff (A - B)

Return type:

(float, int)

Example
>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> truvari.entry_size_similarity(a, b)
(0.07142857142857142, 13)

entry_shared_ref_context

truvari.entry_shared_ref_context(entryA, entryB, ref, use_ref_seq=False, min_len=0)[source]

Get the shared reference context of two entries and create the haplotype

Parameters:
  • entryA (pysam.VariantRecord) – first entry

  • entryB (pysam.VariantRecord) – second entry

  • ref (pysam.FastaFile) – Reference genome

  • use_ref_seq (bool, optional) – If True, use the reference genome to get the variant sequence instead the entries

  • min_len (int, optional) – Minimum length of the reference context to create

Returns:

sequences created

Return type:

tuple : (string, string)

entry_to_hash

truvari.entry_to_hash(entry, hasher=<built-in function openssl_sha1>)[source]

Turn variant into a key and hash with provided hasher

Parameters:
  • entry (method) – entry

  • hasher – hashing function

Returns:

hash

Return type:

string

entry_to_key

truvari.entry_to_key(entry, prefix='', bounds=False)[source]

Turn a vcf entry into a hashable key string. Use the prefix (base/comp) to indicate the source VCF when consolidating multiple files’ calls. If bounds: call entry_boundaries for start/stop.

Warning

If a caller redundantly calls a variant exactly the same it will not have a unique key

Parameters:
  • entry (pysam.VariantRecord) – entry to stringify

  • prefix (string, optional) – prefix

  • bounds (bool, optional) – use entry_boundaries

Returns:

hashable string uniquely identifying the variant

Return type:

string

entry_variant_type

truvari.entry_variant_type(entry)[source]

Return entry’s SVTYPE

Note

How svtype is determined:

  • Starts by trying to use INFO/SVTYPE

  • If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences

  • If REF/ALT sequences are not available, try to parse the <INS>, <DEL>, etc from the ALT column.

  • Otherwise, assume ‘UNK’

Parameters:

entry (pysam.VariantRecord)

Returns:

SV type

Return type:

truvari.SV

entry_within

..autofunction:: entry_within

Extra Methods

allele_freq_annos

truvari.allele_freq_annos(entry, samples=None)[source]

Calculate allele annotations for a VCF Entry

Parameters:
  • entry (pysam.VariantRecord) – Entry with samples to parse

  • samples (list of strings, optional) – Subset of samples from the entry over which to calculate annos

Returns:

Dictonary of
AF - allele frequency
MAF - minor allele frequency
ExcHet - excess heterozygosity
HWE - hardy weinberg equilibrium
AC - allele count
MAC - minor allele count
AN - number of called alleles

Return type:

dict

Example
>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/multi.vcf.gz')
>>> truvari.allele_freq_annos(next(v))
{'AF': 0.5, 'MAF': 0.5, 'ExcHet': 1.0, 'HWE': 1.0, 'MAC': 1, 'AC': [1, 1], 'AN': 2, 'N_HEMI': 0, 'N_HOMREF': 0, 'N_HET': 1, 'N_HOMALT': 0, 'N_MISS': 2}

bed_ranges

truvari.bed_ranges(bed, chunk_size=10000000)[source]

Chunk bed regions into pieces. Useful for multiprocessing.

Parameters:
  • bed (string) – Filename of bed to chunk

  • chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example
>>> import truvari
>>> gen = truvari.bed_ranges("repo_utils/test_files/beds/giab.bed", 1000)
>>> print(len([_ for _ in gen]))
992

build_anno_tree

truvari.build_anno_tree(filename, chrom_col=0, start_col=1, end_col=2, one_based=False, comment='#', idxfmt=None)[source]

Build an dictionary of IntervalTrees for each chromosome from tab-delimited annotation file

By default, the file is assumed to be a bed-format. If custom chrom/start/end are used, the columns can be specified.

idxfmt is a string that will be formatted with the line number from the file (e.g. “num %d” will make intervals with data=”num 0”, “num 1”. By default the data will be interger line number. If intervals will be compared between anno_trees, set idxfmt to “”

Parameters:
  • filename (string) – Path to file to parse, can be compressed

  • chrom_col (integer, optional) – Index of column in file with chromosome

  • start_col (integer, optional) – Index of column in file with start

  • end_col (integer, optional) – Index of column in file with end

  • one_based (bool, optional) – True if coordinates are one-based

  • is_pyintv (bool, optional) – add 1 to end position to correct pyintervaltree.overlap behavior

  • comment (string, optional) – ignore lines if they start with this string

  • idxfmt (string, optional) – Index of column in file with chromosome

Returns:

dictionary with chromosome keys and intervaltree.IntervalTree values

Return type:

dict

calc_af

truvari.calc_af(gts)[source]

Calculate allele annotations for a list of genotypes

Parameters:

gts (list) – Genotype tuples

Returns:

Dictonary of
AF - allele frequency
MAF - minor allele frequency
ExcHet - excess heterozygosity
HWE - hardy weinberg equilibrium
AC - allele count for GT 1
MAC - minor allele count
AN - number of called alleles
N_HEMI - Number of partial genotypes (length of 1 or a single missing)
N_HOMREF - homozygous reference GT count
N_HET - heterozygous GT count
N_HOMALT - homozygous alternate GT count
N_MISS - Number of missing genotypes (all .)

Return type:

dict

calc_hwe

truvari.calc_hwe(nref, nalt, nhet)[source]

Calculate Hardy Weinberg equilibrium and excess heterozygosity

Parameters:
  • nref (int) – Number of reference alleles

  • nalt (int) – Number of alternate alleles

  • nhet (int) – Number of heterozygous sites

Returns:

The HWE and ExcHet calculated

Return type:

tuple (float, float)

check_vcf_index

truvari.check_vcf_index(vcf_path)[source]

Return true if an index file is found for the vcf

compress_index_vcf

truvari.compress_index_vcf(fn, fout=None, remove=True)[source]

Compress and index a vcf. If no fout is provided, write to fn + ‘.gz’

Parameters:
  • fn (string) – filename of vcf to work on

  • fout (string, optional) – output filename

  • remove (bool) – remove fn after fout is made

create_pos_haplotype

truvari.create_pos_haplotype(a1, a2, ref, min_len=0)[source]

Create haplotypes of two allele’s regions that are assumed to be overlapping

Parameters:
  • a1 (tuple) – tuple of chrom, start, end, seq

  • a2 (tuple) – tuple of chrom, start, end, seq

  • ref (pysam.FastaFile) – Reference genome

  • min_len (int, optional) – Minimum length of the haplotype sequence to create

Returns:

allele haplotype sequences created

Rtupe:

tuple (string, string)

get_gt

truvari.get_gt(gt)[source]

Turn a genotype tuple into a GT object

Parameters:

gt (tuple (int, int)) – Genotype tuple

Returns:

A GT of the genotype

Return type:

truvari.GT

Example
>>> import truvari
>>> truvari.get_gt((0, 0))
<GT.REF: 0>
>>> truvari.get_gt((0, 1, 1))
<GT.UNK: 4>

get_scalebin

truvari.get_scalebin(x, rmin=0, rmax=100, tmin=0, tmax=100, step=10)[source]

Scale variable x from rdomain to tdomain with step sizes return key, index

Parameters:
  • x (number) – Number to scale

  • rmin (number, optional) – The minimum of the range of your measurement

  • rmax (number, optional) – The maximum of the range of your measurement

  • tmin (number, optional) – The minimum of the range of your desired target scaling

  • tmax (number, optional) – The maximum of the range of your measurement

  • step (number, optional) – The step size of bins of target range

Returns:

The bin-string of the scaled variable

Return type:

string

Example
>>> import truvari
>>> truvari.get_scalebin(4, 1, 5, 0, 20, 5)
('[15,20)', 3)
>>> truvari.get_scalebin(6, 1, 5, 0, 20, 5)
('>=20', 4)

get_sizebin

truvari.get_sizebin(sz)[source]

Bin a given size into truvari.SZBINS

Parameters:

sz (integer) – SVLEN to bin into the SZBINS

Returns:

SZBIN

Return type:

string

get_svtype

truvari.get_svtype(svtype)[source]

Turn a SVTYPE string to a truvari.SV object

Parameters:

svtype (string)

Returns:

A SV of the SVTYPE

Return type:

truvari.SV

Example
>>> import truvari
>>> truvari.get_svtype("INS")
<SV.INS: 2>
>>> truvari.get_svtype("foo")
<SV.UNK: 6>

msa2vcf

truvari.msa2vcf(msa, anchor_base='N')[source]

Parse an MSA dict of {name: alignment, …} and returns its VCF entries as a string

Assumes one entry in the MSA has the name ref_${chrom}:${start}-${end} which gives VCF entries coordinates Provide anchor_base to prevent ‘N’ from being used as an anchor base Returns a string of entries

Example (for dealing with test coverage not being seen)
>>> import truvari
>>> from truvari.phab import fasta_reader
>>> msa_dir = "repo_utils/test_files/external/fake_mafft/lookup/"
>>> msa_file = "fm_92be9763c7114fd6ce736a7008114ad0.msa"
>>> seqs = open(msa_dir + msa_file).read()
>>> fasta = dict(fasta_reader(seqs))
>>> m_entries_str = truvari.msa2vcf(fasta)

overlap_percent

truvari.overlap_percent(astart, aend, bstart, bend)[source]

Calculates the percent of range A which overlaps with range B

Parameters:
  • astart (int) – First range’s start position

  • aend (int) – First range’s end position

  • bstart (int) – Second range’s start position

  • bend (int) – Second range’s end position

Returns:

overlap percent

Return type:

float

overlaps

truvari.overlaps(s1, e1, s2, e2)[source]

Check if two start/end ranges have overlap

Parameters:
  • s1 (int) – range 1 start

  • e1 (int) – range 1 end

  • s2 (int) – range 2 start

  • e2 (int) – range 2 end

Returns:

True if ranges overlap

Return type:

bool

phab

truvari.phab(var_regions, base_vcf, ref_fn, output_fn, bSamples=None, buffer=100, mafft_params='--auto --thread 1', comp_vcf=None, cSamples=None, prefix_comp=False, threads=1, method='mafft', passonly=True, max_size=50000)[source]

Harmonize variants with MSA. Runs on a set of regions given files to create haplotypes and writes results to a compressed/indexed VCF

Parameters:
  • var_regions (list) – List of tuples of region’s (chrom, start, end)

  • base_vcf (str) – VCF file name

  • ref_fn (str) – Reference file name

  • output_fn – Output VCF.gz

  • bSamples (list) – Samples from base_vcf to create haplotypes

  • buffer (int) – Flanking reference bases to added to regions

  • mafft_params (str) – Parameters for mafft

  • comp_vcf (str) – VCF file name

  • cSamples (list) – Samples from comp_vcf to create haplotypes

  • prefix_comp (bool) – Ensure unique sample names by prefixing comp samples

  • threads (int) – Number of threads to use

  • method (str) – Alignment method to use mafft or wfa

reciprocal_overlap

truvari.reciprocal_overlap(astart, aend, bstart, bend)[source]

Calculates reciprocal overlap of two ranges

Parameters:
  • astart (int) – First range’s start position

  • aend (int) – First range’s end position

  • bstart (int) – Second range’s start position

  • bend (int) – Second range’s end position

Returns:

reciprocal overlap

Return type:

float

ref_ranges

truvari.ref_ranges(reference, chunk_size=10000000)[source]

Chunk reference into pieces. Useful for multiprocessing.

Parameters:
  • reference (string) – Filename of reference to chunk

  • chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example
>>> import truvari
>>> gen = truvari.ref_ranges("repo_utils/test_files/references/reference.fa", 1000)
>>> print(len([_ for _ in gen]))
1000

seqsim

truvari.seqsim(allele1, allele2)[source]

Calculate similarity of two sequences

Parameters:
  • allele1 (pysam.VariantRecord) – first entry

  • allele2 (pysam.VariantRecord) – second entry

Returns:

sequence similarity

Return type:

float

sizesim

truvari.sizesim(sizeA, sizeB)[source]

Calculate the size similarity percent and size diff for two sizes

Parameters:
  • sizeA (int) – first size

  • sizeB (int) – second size

Returns:

size similarity percent and size diff (A - B)

Return type:

(float, int)

unroll_compare

truvari.unroll_compare(seqA, seqB, p)[source]

Unroll two sequences and compare. See https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c for details

Parameters:
  • seqA (string) – sequence upstream sequence

  • seqB (string) – sequence downstream sequence

  • p (integer) – how many bases upstream seqA is from seqB

Returns:

sequence similarity of seqA vs seqB after unrolling

Return type:

float

vcf_ranges

truvari.vcf_ranges(vcf, min_dist=1000)[source]

Chunk vcf into discrete pieces. Useful for multiprocessing.

Parameters:
  • vcf (string) – Filename of vcf to find ranges

  • min_dist (int, optional) – Minimum distance between entries for new range

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example
>>> import truvari
>>> gen = truvari.vcf_ranges("repo_utils/test_files/variants/input1.vcf.gz")
>>> print(len([_ for _ in gen]))
228

Dev methods

benchdir_count_entries

truvari.benchdir_count_entries(benchdir, regions, within=False, threads=4)[source]

Count the number of variants per bed region in Truvari bench directory by state

Returns a pandas dataframe of the counts

chunker

truvari.chunker(matcher, *files)[source]

Given a Matcher and multiple files, zip them and create chunks

Yields tuple of the chunk of calls, and an identifier of the chunk

cmd_exe

truvari.cmd_exe(cmd, stdin=None, timeout=-1, cap_stderr=True, pipefail=False)[source]

Executes a command through the shell and captures the output while enabling process handling with timeouts and keyboard interrupts.

Parameters:
  • cmd (string) – The command to be executed.

  • stdin (bytes) – stdinput to pipe to command

  • timeout (int) – Timeout for the command in minutes. So 1440 means 24 hours. -1 means never

  • cap_stderr (boolean) – If True, capture the stderr and return it as part of the returned cmd_results. Otherwise, stderr will be streamed through to the terminal

  • pipefail (boolean) – Set to True if the cmd contains pipes |

Returns:

namedtuple of
ret_code - integer, exit code of the command
stdout - string, captured standard output of the command
stderr - binary string, captured standard error of the command
run_time - datetime.timedelta, execution time

Return type:

namedtuple (ret_code, stdout, stderr, run_time)

Example
>>> import truvari
>>> ret = truvari.cmd_exe("echo 'hello world'")
>>> ret.ret_code
0
>>> ret.stdout
'hello world\n'
>>> import truvari
>>> ret = truvari.cmd_exe("sleep 5", pipefail=True, timeout=2/60) # Error logged is caught
>>> ret.ret_code
214

consolidate_phab_vcfs

coords_within

truvari.coords_within(qstart, qend, rstart, rend, end_within)[source]

Returns if a span is within the provided [start, end). All coordinates assumed 0-based

Parameters:
  • qstart (integer) – query start position

  • qend (integer) – query end position

  • start (integer) – start of span

  • end (integer) – end of span

  • end_within (bool) – if true, qend <= rend, else qend < rend

Returns:

If the coordinates are within the span

Return type:

bool

count_entries

truvari.count_entries(vcf, chroms, regions, within)[source]

Count the number of variants per bed region a VCF regions is a list of lists with sub-lists having 0:3 of chrom, start, end Returns a list of the counts in the same order as the regions

file_zipper

truvari.file_zipper(*start_files)[source]

Zip files to yield the entries in order. Each file must be sorted in the same order. start_files is a tuple of (‘key’, iterable) where key is the identifier (so we know which file the yielded entry came from) and iterable is usually a pysam.VariantFile

yields key, pysam.VariantRecord

help_unknown_cmd

truvari.help_unknown_cmd(user_cmd, avail_cmds, threshold=0.8)[source]

Guess the command in avail_cmds that’s most similar to user_cmd. If there is no guess below threshold, don’t guess.

Parameters:
  • user_cmd (string) – user command

  • avail_cmds (list of strings) – available commands

  • threshold (float, optional) – max score (0-1) to give a guess

Returns:

best guess from :param: avail_cmds or None

Return type:

string

Example
>>> import truvari
>>> truvari.help_unknown_cmd('banno', ["bench", "anno"])
'anno'
>>> truvari.help_unknown_cmd('random', ["bench", "anno"])

make_temp_filename

truvari.make_temp_filename(tmpdir=None, suffix='')[source]

Get a random filename in a tmpdir with an optional extension

opt_gz_open

truvari.opt_gz_open(in_fn)[source]

Chooses file handler for plain-text files or *.gz files.

returns a generator which yields lines of the file

optimize_df_memory

truvari.optimize_df_memory(df)[source]

Optimize DataFrame memory by trying to downcast every column in-place. Returns the pre/post size

Parameters:

df (pandas.DataFrame) – Dataframe to optimize

Returns:

tuple (pre_size, post_size)

Return type:

int, sizes in bytes

performance_metrics

truvari.performance_metrics(tpbase, tp, fn, fp)[source]

Calculates precision, recall, and f1 given counts by state Can return None if values are zero

Parameters:
  • tpbase (int) – found base count

  • tp (int) – found comp count

  • fn (int) – missed base count

  • fp (int) – missed comp count

Returns:

tuple of precision, recall, f1

Return type:

tuple, float

region_filter

truvari.region_filter(vcf, tree, inside=True, with_region=False)[source]

Chooses to stream or fetch entries inside/outside a VCF If the VCF is over 25Mb or the number of regions is above 1k, use stream

restricted_float

truvari.restricted_float(x)[source]

Restrict float to range (0,1). Raises argparse.ArgumentTypeError if float is out of range Used with argparse.ArgumentParser.add_argument type parameter

Parameters:

x (float) – number to check

Returns:

input number

Return type:

float

Example
>>> import truvari
>>> truvari.restricted_float(0.5)
0.5
>>> truvari.restricted_float(2)
Traceback (most recent call last):
argparse.ArgumentTypeError: 2.0 not in range (0, 1)

restricted_int

truvari.restricted_int(x)[source]

Restrict int to positive. Raises argparse.ArgumentTypeError if int is negative Used with argparse.ArgumentParser.add_argument type parameter

Parameters:

x (int) – number to check

Returns:

input number

Return type:

float

Example
>>> import truvari
>>> truvari.restricted_int(5)
5
>>> truvari.restricted_int(-2)
Traceback (most recent call last):
argparse.ArgumentTypeError: -2 is < 0

setup_logging

truvari.setup_logging(debug=False, stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, log_format='%(asctime)s [%(levelname)s] %(message)s', show_version=False)[source]

Create default logger

Parameters:
  • debug (boolean, optional) – Set log-level to logging.DEBUG

  • stream (file handler, optional) – Where log is written

  • log_format (string, optional) – Format of log lines

  • show_version (bool, optional) – Start log with truvari version and command line args

vcf_to_df

truvari.vcf_to_df(fn, with_info=True, with_format=True, sample=None, no_prefix=False, alleles=False)[source]

Parse a vcf file and turn it into a dataframe. Tries its best to pull info/format tags from the sample information. For Formats with Number=G, append _ref, _het, _hom. For things with Number=R, append _ref, _alt. Specify which sample with its name or index in the VCF.

Parameters:
  • fn (string) – File name of VCF to open and turn into a DataFrame

  • with_info (boolean, optional) – Add the INFO fields from the VCF to the DataFrame columns

  • with_format – Add the FORMAT fields from the VCF to the DataFrame columns

  • sample (int/string, optional) – Sample from the VCF to parse. Only used when with_format==True

  • no_prefix (boolean, optional) – Don’t prefix FORMAT fields with sample name

  • alleles (boolean, optional) – Add column with allele sequences

Returns:

Converted VCF

Return type:

pandas.DataFrame

Example
>>> import truvari
>>> df = truvari.vcf_to_df("repo_utils/test_files/variants/input2.vcf.gz", True, True)
>>> df.columns
Index(['chrom', 'start', 'end', 'id', 'svtype', 'svlen', 'szbin', 'qual',
       'filter', 'is_pass', 'QNAME', 'QSTART', 'QSTRAND', 'SVTYPE', 'SVLEN',
       'NA12878_GT', 'NA12878_PL_ref', 'NA12878_PL_het', 'NA12878_PL_hom',
       'NA12878_AD_ref', 'NA12878_AD_alt'],
      dtype='object')

Objects

Bench

class truvari.Bench(matcher=None, base_vcf=None, comp_vcf=None, outdir=None, includebed=None, extend=0, debug=False, do_logging=False, short_circuit=False)[source]

Object to perform operations of truvari bench

You can build a Bench worker with default matching parameters via:

m_bench = truvari.Bench()

If you’d like to customize the parameters, build and edit a Matcher and pass it to the Bench init

matcher = truvari.Matcher()
matcher.params.pctseq = 0.50
m_bench = truvari.Bench(matcher)

To run on a chunk of pysam.VariantRecord already loaded, pass them in as lists to:

matches = m_bench.compare_calls(base_variants, comp_variants)

For advanced parsing, one can build the match matrix for a chunk of calls with:

match_matrix = m_bench.build_matrix(base_variants, comp_variants)

This can then be used for things such as creating custom match pickers or adding new matching checks.

If you want to run on existing files, simply supply the arguments to init

m_bench = truvari.Bench(matcher, base_vcf, comp_vcf, outdir)
output = m_bench.run()

Note that running on files must write to an output directory and is the only way to use things like ‘includebed’. However, the returned BenchOutput has attributes pointing to all the results.

build_matrix(base_variants, comp_variants, chunk_id=0, skip_gt=False)[source]

Builds MatchResults, returns them as a numpy matrix

check_refine_candidate(result)[source]

Adds this region as a candidate for refinement if there are unmatched variants

compare_calls(base_variants, comp_variants, chunk_id=0)[source]

Builds MatchResults, returns them as a numpy matrix if there’s at least one base and one comp variant. Otherwise, returns a list of the variants placed in MatchResults

compare_chunk(chunk)[source]

Given a filtered chunk, (from chunker) compare all of the calls

param_dict()[source]

Returns the parameters as a dict

run()[source]

Runs bench and returns the resulting truvari.BenchOutput

BenchOutput

class truvari.BenchOutput(bench, matcher)[source]

Makes all of the output files for a Bench.run

The variable BenchOutput.vcf_filenames holds a dictonary. The keys are tpb, tpc for true positive base/comp vcf filename and fn, fp. The variable stats_box holds a StatsBox.

close_outputs()[source]

Close all the files

write_match(match)[source]

Annotate a MatchResults’ entries then write to the apppropriate file and do the stats counting. Writer is responsible for handling FPs between sizefilt-sizemin

StatsBox

class truvari.StatsBox[source]

Make a blank stats box for counting TP/FP/FN and calculating performance

calc_performance()[source]

Calculate the precision/recall

clean_out()[source]

When reusing a StatsBox (typically inside refine), gt numbers are typically invalidated. This removes those numbers from self to make a cleaner report

write_json(out_name)[source]

Write stats as json to file

GT

class truvari.GT(value)[source]

Genotypes

  • REF = <GT.REF: 0>

  • HET = <GT.HET: 1>

  • HOM = <GT.HOM: 2>

  • NON = <GT.NON: 3> - Non-Genotyped (e.g. ./.)

  • UNK = <GT.UNK: 4> - Undetermined Genotype

RegionVCFIterator

LogFileStderr

class truvari.LogFileStderr(fn)[source]

Write to stderr and a file

Useful in conjunction with setup_logging()

Example
>>> import truvari
>>> truvari.setup_logging(stream=LogFileStderr('log.txt'))
flush()[source]

Flush handlers

write(*args)[source]

Write to handlers

MatchResult

class truvari.MatchResult[source]

A base/comp match holder

calc_score()[source]

Unite the similarity measures and make a score

Matcher

class truvari.Matcher(args=None)[source]

Holds matching parameters. Allows calls to be checked for filtering and matches to be made

Example
>>> import pysam
>>> import truvari
>>> mat = truvari.Matcher()
>>> mat.params.pctseq = 0
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> one = next(v); two = next(v)
>>> mat.build_match(one, two)
<truvari.bench.MatchResult (False 2.381)>

Look at Matcher.make_match_params() for a list of all params and their defaults

build_match(base, comp, matid=None, skip_gt=False, short_circuit=False)[source]

Build a MatchResult if skip_gt, don’t do genotype comparison if short_circuit, return after first failure

filter_call(entry, base=False)[source]

Returns True if the call should be filtered Base has different filtering requirements, so let the method know

static make_match_params()[source]

Makes a simple namespace of matching parameters. Holds defaults

static make_match_params_from_args(args)[source]

Makes a simple namespace of matching parameters

SV

class truvari.SV(value)[source]

SVtypes

  • SNP = <SV.SNP: 0>

  • DEL = <SV.DEL: 1>

  • INS = <SV.INS: 2>

  • DUP = <SV.DUP: 3>

  • INV = <SV.INV: 4>

  • NON = <SV.NON: 5> - Not a variant (monomorphic ref?)

  • UNK = <SV.UNK: 6> - Unknown type

Data

HEADERMAT

regular expression of vcf header INFO/FORMAT fields with groups

truvari.HEADERMAT = re.compile('##\\w+=<ID=(?P<name>\\w+),Number=(?P<num>[\\.01AGR]),Type=(?P<type>\\w+)')

Compiled regular expression object.

QUALBINS

0-100 quality score bin strings (step size 10)

truvari.QUALBINS = ['[0,10)', '[10,20)', '[20,30)', '[30,40)', '[40,50)', '[50,60)', '[60,70)', '[70,80)', '[80,90)', '[90,100)', '>=100']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SVTYTYPE

pandas.CategoricalDtype of truvari.SV

SZBINMAX

integer list of maximum size for size bins

truvari.SZBINMAX = [1, 5, 10, 15, 20, 30, 40, 50, 100, 200, 300, 400, 600, 800, 1000, 2500, 5000, 9223372036854775807]

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINS

string list of size bins

truvari.SZBINS = ['SNP', '[1,5)', '[5,10)', '[10,15)', '[15,20)', '[20,30)', '[30,40)', '[40,50)', '[50,100)', '[100,200)', '[200,300)', '[300,400)', '[400,600)', '[600,800)', '[800,1k)', '[1k,2.5k)', '[2.5k,5k)', '>=5k']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINTYPE

pandas.CategoricalDtype of truvari.SZBINS