truvari package

Overview

Truvari - SV comparison and annotation toolkit

See help() of specific functions / objects for details

VariantRecord methods:

entry_boundaries() entry_distance() entry_gt_comp() entry_is_filtered() entry_is_present() entry_reciprocal_overlap() entry_same_variant_type() entry_shared_ref_context() entry_seq_similarity() entry_size() entry_size_similarity() entry_to_hash() entry_to_key() entry_within() entry_variant_type()

Extra methods:

allele_freq_annos() bed_ranges() build_anno_tree() calc_af() calc_hwe() compress_index_vcf() create_pos_haplotype() get_gt() get_scalebin() get_sizebin() get_svtype() msa2vcf() overlap_percent() overlaps() phab() reciprocal_overlap() ref_ranges() seqsim() sizesim() unroll_compare() vcf_ranges()

Dev methods:

benchdir_count_entries() chunker() cmd_exe() consolidate_phab_vcfs() coords_within() count_entries() file_zipper() help_unknown_cmd() make_temp_filename() opt_gz_open() optimize_df_memory() performance_metrics() region_filter() restricted_float() restricted_int() setup_logging() vcf_to_df()

Objects:

Bench BenchOutput GT RegionVCFIterator LogFileStderr MatchResult Matcher StatsBox SV

Data:

truvari.HEADERMAT truvari.QUALBINS truvari.SVTYTYPE truvari.SZBINMAX truvari.SZBINS truvari.SZBINTYPE

VariantRecord Methods

entry_boundaries

truvari.entry_boundaries(entry, ins_inflate=False)[source]

Get the start and end of an entry

Parameters:

entry (pysam.VariantRecord) – entry to get bounds
ins_inflate (bool, optional) – inflate INS boundaries by sv length

Returns:

start/end

Return type:

tuple (int, int)

entry_distance

truvari.entry_distance(entryA, entryB)[source]

Calculate the start and end distances of the pair. Negative distances indicate entryA is upstream of entryB

Parameters:

entryA (pysam.VariantRecord) – first entry
entryB (pysam.VariantRecord) – second entry

Returns:

starts and ends distance

Return type:

tuple (int, int)

entry_gt_comp

truvari.entry_gt_comp(entryA, entryB, sampleA=None, sampleB=None)[source]

Compare the genotypes of two entries

Parameters:

entryA (pysam.VariantRecord) – first entry
entryB (pysam.VariantRecord) – second entry
sampleA (string, optional) – sample of entryA to check
sampleB (string, optional) – sample of entryB to check

Returns:

True if the genotypes are concordant

Return type:

bool

Example

>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> truvari.entry_gt_comp(a, b)
True

entry_is_filtered

truvari.entry_is_filtered(entry, values=None)[source]

Checks if entry should be filtered given the filter values provided. If values is None, assume that filter must have PASS or be blank ‘.’

Parameters:

entry (pysam.VariantRecord) – entry to check
values (set, optional) – set of filter values for intersection

Returns:

True if entry should be filtered

Return type:

bool

Example

>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> truvari.entry_is_filtered(next(v)) # PASS shouldn't be filtered
False
>>> truvari.entry_is_filtered(next(v), set(["lowQ"])) # Call isn't lowQ, so filter
True

entry_is_present

truvari.entry_is_present(entry, sample=None, allow_missing=True)[source]

Checks if entry’s sample genotype is present and is heterozygous or homozygous (a.k.a. present) If allow_missing, just check for a 1 in the genotype. Otherwise, a missing (‘.’) genotype isn’t considered present

Parameters:

entry (pysam.VariantRecord) – entry to check
sample (string, optional) – sample name

Returns:

True if variant is present in the sample

Return type:

bool

Example

>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> truvari.entry_is_present(next(v))
True

entry_reciprocal_overlap

truvari.entry_reciprocal_overlap(entry1, entry2)[source]

Calculates reciprocal overlap of two entries

Parameters:

entry1 (pysam.VariantRecord) – First entry
entry2 (pysam.VariantRecord) – Second entry

Returns:

The reciprocal overlap

Return type:

float

Example

>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> truvari.entry_reciprocal_overlap(a, b)
0

entry_same_variant_type

truvari.entry_same_variant_type(entryA, entryB, dup_to_ins=False)[source]

Check if entryA svtype == entryB svtype

Parameters:

entryA (pysam.VariantRecord) – first entry
entryB (pysam.VariantRecord) – second entry
dup_to_ins (bool) – Convert DUP to INS types

Returns:

True if entry SVTYPEs match

Return type:

bool

entry_seq_similarity

truvari.entry_seq_similarity(entryA, entryB, ref=None, min_len=0)[source]

Calculate sequence similarity of two entries. If reference is not None, compare their shared reference context. Otherwise, use the unroll technique.

Parameters:

entryA (pysam.VariantRecord) – first entry
entryB (pysam.VariantRecord) – second entry
ref (pysam.FastaFile) – Reference genome
min_len (float, optional) – Minimum length of reference context to generate

Returns:

sequence similarity

Return type:

float

entry_size

truvari.entry_size(entry)[source]

Determine the size of the variant.

Note

How size is determined

Starts by trying to use INFO/SVLEN
If SVLEN is unavailable and ALT field is an SV (e.g. <INS>, <DEL>, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS.
If len of vcf.REF and vcf.ALT[0] are equal, return len of vcf.REF
Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0])))

Parameters:: entry (pysam.VariantRecord) – entry to look at
Returns:: the entry’s size
Return type:: int

entry_size_similarity

truvari.entry_size_similarity(entryA, entryB)[source]

Calculate the size similarity and difference for two entries

Parameters:

entryA (pysam.VariantRecord) – first entry
entryB (pysam.VariantRecord) – second entry

Returns:

size similarity and size diff (A - B)

Return type:

(float, int)

Example

>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> truvari.entry_size_similarity(a, b)
(0.07142857142857142, 13)

entry_shared_ref_context

truvari.entry_shared_ref_context(entryA, entryB, ref, use_ref_seq=False, min_len=0)[source]

Get the shared reference context of two entries and create the haplotype

Parameters:

entryA (pysam.VariantRecord) – first entry
entryB (pysam.VariantRecord) – second entry
ref (pysam.FastaFile) – Reference genome
use_ref_seq (bool, optional) – If True, use the reference genome to get the variant sequence instead the entries
min_len (int, optional) – Minimum length of the reference context to create

Returns:

sequences created

Return type:

tuple : (string, string)

entry_to_hash

truvari.entry_to_hash(entry, hasher=<built-in function openssl_sha1>)[source]

Turn variant into a key and hash with provided hasher

Parameters:

entry (method) – entry
hasher – hashing function

Returns:

hash

Return type:

string

entry_to_key

truvari.entry_to_key(entry, prefix='', bounds=False)[source]

Turn a vcf entry into a hashable key string. Use the prefix (base/comp) to indicate the source VCF when consolidating multiple files’ calls. If bounds: call entry_boundaries for start/stop.

Warning

If a caller redundantly calls a variant exactly the same it will not have a unique key

Parameters:

entry (pysam.VariantRecord) – entry to stringify
prefix (string, optional) – prefix
bounds (bool, optional) – use entry_boundaries

Returns:

hashable string uniquely identifying the variant

Return type:

string

entry_variant_type

truvari.entry_variant_type(entry)[source]

Return entry’s SVTYPE

Note

How svtype is determined:

Starts by trying to use INFO/SVTYPE
If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences
If REF/ALT sequences are not available, try to parse the <INS>, <DEL>, etc from the ALT column.
Otherwise, assume ‘UNK’

Parameters:: entry (pysam.VariantRecord)
Returns:: SV type
Return type:: truvari.SV

entry_within

..autofunction:: entry_within

Extra Methods

allele_freq_annos

truvari.allele_freq_annos(entry, samples=None)[source]

Calculate allele annotations for a VCF Entry

Parameters:

entry (pysam.VariantRecord) – Entry with samples to parse
samples (list of strings, optional) – Subset of samples from the entry over which to calculate annos

Returns:

Dictonary of
AF - allele frequency
MAF - minor allele frequency
ExcHet - excess heterozygosity
HWE - hardy weinberg equilibrium
AC - allele count
MAC - minor allele count
AN - number of called alleles

Return type:

dict

Example

>>> import truvari
>>> import pysam
>>> v = pysam.VariantFile('repo_utils/test_files/variants/multi.vcf.gz')
>>> truvari.allele_freq_annos(next(v))
{'AF': 0.5, 'MAF': 0.5, 'ExcHet': 1.0, 'HWE': 1.0, 'MAC': 1, 'AC': [1, 1], 'AN': 2, 'N_HEMI': 0, 'N_HOMREF': 0, 'N_HET': 1, 'N_HOMALT': 0, 'N_MISS': 2}

bed_ranges

truvari.bed_ranges(bed, chunk_size=10000000)[source]

Chunk bed regions into pieces. Useful for multiprocessing.

Parameters:

bed (string) – Filename of bed to chunk
chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example

>>> import truvari
>>> gen = truvari.bed_ranges("repo_utils/test_files/beds/giab.bed", 1000)
>>> print(len([_ for _ in gen]))
992

build_anno_tree

truvari.build_anno_tree(filename, chrom_col=0, start_col=1, end_col=2, one_based=False, comment='#', idxfmt=None)[source]

Build an dictionary of IntervalTrees for each chromosome from tab-delimited annotation file

By default, the file is assumed to be a bed-format. If custom chrom/start/end are used, the columns can be specified.

idxfmt is a string that will be formatted with the line number from the file (e.g. “num %d” will make intervals with data=”num 0”, “num 1”. By default the data will be interger line number. If intervals will be compared between anno_trees, set idxfmt to “”

Parameters:

filename (string) – Path to file to parse, can be compressed
chrom_col (integer, optional) – Index of column in file with chromosome
start_col (integer, optional) – Index of column in file with start
end_col (integer, optional) – Index of column in file with end
one_based (bool, optional) – True if coordinates are one-based
is_pyintv (bool, optional) – add 1 to end position to correct pyintervaltree.overlap behavior
comment (string, optional) – ignore lines if they start with this string
idxfmt (string, optional) – Index of column in file with chromosome

Returns:

dictionary with chromosome keys and intervaltree.IntervalTree values

Return type:

dict

calc_af

truvari.calc_af(gts)[source]

Calculate allele annotations for a list of genotypes

Parameters:

gts (list) – Genotype tuples

Returns:

Dictonary of
AF - allele frequency
MAF - minor allele frequency
ExcHet - excess heterozygosity
HWE - hardy weinberg equilibrium
AC - allele count for GT 1
MAC - minor allele count
AN - number of called alleles
N_HEMI - Number of partial genotypes (length of 1 or a single missing)
N_HOMREF - homozygous reference GT count
N_HET - heterozygous GT count
N_HOMALT - homozygous alternate GT count
N_MISS - Number of missing genotypes (all .)

Return type:

dict

calc_hwe

truvari.calc_hwe(nref, nalt, nhet)[source]

Calculate Hardy Weinberg equilibrium and excess heterozygosity

Parameters:

nref (int) – Number of reference alleles
nalt (int) – Number of alternate alleles
nhet (int) – Number of heterozygous sites

Returns:

The HWE and ExcHet calculated

Return type:

tuple (float, float)

check_vcf_index

truvari.check_vcf_index(vcf_path)[source]: Return true if an index file is found for the vcf

compress_index_vcf

truvari.compress_index_vcf(fn, fout=None, remove=True)[source]

Compress and index a vcf. If no fout is provided, write to fn + ‘.gz’

Parameters:

fn (string) – filename of vcf to work on
fout (string, optional) – output filename
remove (bool) – remove fn after fout is made

create_pos_haplotype

truvari.create_pos_haplotype(a1, a2, ref, min_len=0)[source]

Create haplotypes of two allele’s regions that are assumed to be overlapping

Parameters:

a1 (tuple) – tuple of chrom, start, end, seq
a2 (tuple) – tuple of chrom, start, end, seq
ref (pysam.FastaFile) – Reference genome
min_len (int, optional) – Minimum length of the haplotype sequence to create

Returns:

allele haplotype sequences created

Rtupe:

tuple (string, string)

get_gt

truvari.get_gt(gt)[source]

Turn a genotype tuple into a GT object

Parameters:: gt (tuple (int, int)) – Genotype tuple
Returns:: A GT of the genotype
Return type:: truvari.GT

Example

>>> import truvari
>>> truvari.get_gt((0, 0))
<GT.REF: 0>
>>> truvari.get_gt((0, 1, 1))
<GT.UNK: 4>

get_scalebin

truvari.get_scalebin(x, rmin=0, rmax=100, tmin=0, tmax=100, step=10)[source]

Scale variable x from rdomain to tdomain with step sizes return key, index

Parameters:

x (number) – Number to scale
rmin (number, optional) – The minimum of the range of your measurement
rmax (number, optional) – The maximum of the range of your measurement
tmin (number, optional) – The minimum of the range of your desired target scaling
tmax (number, optional) – The maximum of the range of your measurement
step (number, optional) – The step size of bins of target range

Returns:

The bin-string of the scaled variable

Return type:

string

Example

>>> import truvari
>>> truvari.get_scalebin(4, 1, 5, 0, 20, 5)
('[15,20)', 3)
>>> truvari.get_scalebin(6, 1, 5, 0, 20, 5)
('>=20', 4)

get_sizebin

truvari.get_sizebin(sz)[source]

Bin a given size into truvari.SZBINS

Parameters:: sz (integer) – SVLEN to bin into the SZBINS
Returns:: SZBIN
Return type:: string

get_svtype

truvari.get_svtype(svtype)[source]

Turn a SVTYPE string to a truvari.SV object

Parameters:: svtype (string)
Returns:: A SV of the SVTYPE
Return type:: truvari.SV

Example

>>> import truvari
>>> truvari.get_svtype("INS")
<SV.INS: 2>
>>> truvari.get_svtype("foo")
<SV.UNK: 6>

msa2vcf

truvari.msa2vcf(msa, anchor_base='N')[source]

Parse an MSA dict of {name: alignment, …} and returns its VCF entries as a string

Assumes one entry in the MSA has the name ref_${chrom}:${start}-${end} which gives VCF entries coordinates Provide anchor_base to prevent ‘N’ from being used as an anchor base Returns a string of entries

Example (for dealing with test coverage not being seen)

>>> import truvari
>>> from truvari.phab import fasta_reader
>>> msa_dir = "repo_utils/test_files/external/fake_mafft/lookup/"
>>> msa_file = "fm_92be9763c7114fd6ce736a7008114ad0.msa"
>>> seqs = open(msa_dir + msa_file).read()
>>> fasta = dict(fasta_reader(seqs))
>>> m_entries_str = truvari.msa2vcf(fasta)

overlap_percent

truvari.overlap_percent(astart, aend, bstart, bend)[source]

Calculates the percent of range A which overlaps with range B

Parameters:

astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position

Returns:

overlap percent

Return type:

float

overlaps

truvari.overlaps(s1, e1, s2, e2)[source]

Check if two start/end ranges have overlap

Parameters:

s1 (int) – range 1 start
e1 (int) – range 1 end
s2 (int) – range 2 start
e2 (int) – range 2 end

Returns:

True if ranges overlap

Return type:

bool

phab

truvari.phab(var_regions, base_vcf, ref_fn, output_fn, bSamples=None, buffer=100, mafft_params='--auto --thread 1', comp_vcf=None, cSamples=None, prefix_comp=False, threads=1, method='mafft', passonly=True, max_size=50000)[source]

Harmonize variants with MSA. Runs on a set of regions given files to create haplotypes and writes results to a compressed/indexed VCF

Parameters:

var_regions (list) – List of tuples of region’s (chrom, start, end)
base_vcf (str) – VCF file name
ref_fn (str) – Reference file name
output_fn – Output VCF.gz
bSamples (list) – Samples from base_vcf to create haplotypes
buffer (int) – Flanking reference bases to added to regions
mafft_params (str) – Parameters for mafft
comp_vcf (str) – VCF file name
cSamples (list) – Samples from comp_vcf to create haplotypes
prefix_comp (bool) – Ensure unique sample names by prefixing comp samples
threads (int) – Number of threads to use
method (str) – Alignment method to use mafft or wfa

reciprocal_overlap

truvari.reciprocal_overlap(astart, aend, bstart, bend)[source]

Calculates reciprocal overlap of two ranges

Parameters:

astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position

Returns:

reciprocal overlap

Return type:

float

ref_ranges

truvari.ref_ranges(reference, chunk_size=10000000)[source]

Chunk reference into pieces. Useful for multiprocessing.

Parameters:

reference (string) – Filename of reference to chunk
chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example

>>> import truvari
>>> gen = truvari.ref_ranges("repo_utils/test_files/references/reference.fa", 1000)
>>> print(len([_ for _ in gen]))
1000

seqsim

truvari.seqsim(allele1, allele2)[source]

Calculate similarity of two sequences

Parameters:

allele1 (pysam.VariantRecord) – first entry
allele2 (pysam.VariantRecord) – second entry

Returns:

sequence similarity

Return type:

float

sizesim

truvari.sizesim(sizeA, sizeB)[source]

Calculate the size similarity percent and size diff for two sizes

Parameters:

sizeA (int) – first size
sizeB (int) – second size

Returns:

size similarity percent and size diff (A - B)

Return type:

(float, int)

unroll_compare

truvari.unroll_compare(seqA, seqB, p)[source]

Unroll two sequences and compare. See https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c for details

Parameters:

seqA (string) – sequence upstream sequence
seqB (string) – sequence downstream sequence
p (integer) – how many bases upstream seqA is from seqB

Returns:

sequence similarity of seqA vs seqB after unrolling

Return type:

float

vcf_ranges

truvari.vcf_ranges(vcf, min_dist=1000)[source]

Chunk vcf into discrete pieces. Useful for multiprocessing.

Parameters:

vcf (string) – Filename of vcf to find ranges
min_dist (int, optional) – Minimum distance between entries for new range

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example

>>> import truvari
>>> gen = truvari.vcf_ranges("repo_utils/test_files/variants/input1.vcf.gz")
>>> print(len([_ for _ in gen]))
228

Dev methods

benchdir_count_entries

truvari.benchdir_count_entries(benchdir, regions, within=False, threads=4)[source]

Count the number of variants per bed region in Truvari bench directory by state

Returns a pandas dataframe of the counts

chunker

truvari.chunker(matcher, *files)[source]

Given a Matcher and multiple files, zip them and create chunks

Yields tuple of the chunk of calls, and an identifier of the chunk

cmd_exe

truvari.cmd_exe(cmd, stdin=None, timeout=-1, cap_stderr=True, pipefail=False)[source]

Executes a command through the shell and captures the output while enabling process handling with timeouts and keyboard interrupts.

Parameters:

cmd (string) – The command to be executed.
stdin (bytes) – stdinput to pipe to command
timeout (int) – Timeout for the command in minutes. So 1440 means 24 hours. -1 means never
cap_stderr (boolean) – If True, capture the stderr and return it as part of the returned cmd_results. Otherwise, stderr will be streamed through to the terminal
pipefail (boolean) – Set to True if the cmd contains pipes |

Returns:

namedtuple of
ret_code - integer, exit code of the command
stdout - string, captured standard output of the command
stderr - binary string, captured standard error of the command
run_time - datetime.timedelta, execution time

Return type:

namedtuple (ret_code, stdout, stderr, run_time)

Example

>>> import truvari
>>> ret = truvari.cmd_exe("echo 'hello world'")
>>> ret.ret_code
0
>>> ret.stdout
'hello world\n'
>>> import truvari
>>> ret = truvari.cmd_exe("sleep 5", pipefail=True, timeout=2/60) # Error logged is caught
>>> ret.ret_code
214

consolidate_phab_vcfs

coords_within

truvari.coords_within(qstart, qend, rstart, rend, end_within)[source]

Returns if a span is within the provided [start, end). All coordinates assumed 0-based

Parameters:

qstart (integer) – query start position
qend (integer) – query end position
start (integer) – start of span
end (integer) – end of span
end_within (bool) – if true, qend <= rend, else qend < rend

Returns:

If the coordinates are within the span

Return type:

bool

count_entries

truvari.count_entries(vcf, chroms, regions, within)[source]: Count the number of variants per bed region a VCF regions is a list of lists with sub-lists having 0:3 of chrom, start, end Returns a list of the counts in the same order as the regions

file_zipper

truvari.file_zipper(*start_files)[source]

Zip files to yield the entries in order. Each file must be sorted in the same order. start_files is a tuple of (‘key’, iterable) where key is the identifier (so we know which file the yielded entry came from) and iterable is usually a pysam.VariantFile

yields key, pysam.VariantRecord

help_unknown_cmd

truvari.help_unknown_cmd(user_cmd, avail_cmds, threshold=0.8)[source]

Guess the command in avail_cmds that’s most similar to user_cmd. If there is no guess below threshold, don’t guess.

Parameters:

user_cmd (string) – user command
avail_cmds (list of strings) – available commands
threshold (float, optional) – max score (0-1) to give a guess

Returns:

best guess from :param: avail_cmds or None

Return type:

string

Example

>>> import truvari
>>> truvari.help_unknown_cmd('banno', ["bench", "anno"])
'anno'
>>> truvari.help_unknown_cmd('random', ["bench", "anno"])

make_temp_filename

truvari.make_temp_filename(tmpdir=None, suffix='')[source]: Get a random filename in a tmpdir with an optional extension

opt_gz_open

truvari.opt_gz_open(in_fn)[source]

Chooses file handler for plain-text files or *.gz files.

returns a generator which yields lines of the file

optimize_df_memory

truvari.optimize_df_memory(df)[source]

Optimize DataFrame memory by trying to downcast every column in-place. Returns the pre/post size

Parameters:: df (pandas.DataFrame) – Dataframe to optimize
Returns:: tuple (pre_size, post_size)
Return type:: int, sizes in bytes

performance_metrics

truvari.performance_metrics(tpbase, tp, fn, fp)[source]

Calculates precision, recall, and f1 given counts by state Can return None if values are zero

Parameters:

tpbase (int) – found base count
tp (int) – found comp count
fn (int) – missed base count
fp (int) – missed comp count

Returns:

tuple of precision, recall, f1

Return type:

tuple, float

region_filter

truvari.region_filter(vcf, tree, inside=True, with_region=False)[source]: Chooses to stream or fetch entries inside/outside a VCF If the VCF is over 25Mb or the number of regions is above 1k, use stream

restricted_float

truvari.restricted_float(x)[source]

Restrict float to range (0,1). Raises argparse.ArgumentTypeError if float is out of range Used with argparse.ArgumentParser.add_argument type parameter

Parameters:: x (float) – number to check
Returns:: input number
Return type:: float

Example

>>> import truvari
>>> truvari.restricted_float(0.5)
0.5
>>> truvari.restricted_float(2)
Traceback (most recent call last):
argparse.ArgumentTypeError: 2.0 not in range (0, 1)

restricted_int

truvari.restricted_int(x)[source]

Restrict int to positive. Raises argparse.ArgumentTypeError if int is negative Used with argparse.ArgumentParser.add_argument type parameter

Parameters:: x (int) – number to check
Returns:: input number
Return type:: float

Example

>>> import truvari
>>> truvari.restricted_int(5)
5
>>> truvari.restricted_int(-2)
Traceback (most recent call last):
argparse.ArgumentTypeError: -2 is < 0

setup_logging

truvari.setup_logging(debug=False, stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, log_format='%(asctime)s [%(levelname)s] %(message)s', show_version=False)[source]

Create default logger

Parameters:

debug (boolean, optional) – Set log-level to logging.DEBUG
stream (file handler, optional) – Where log is written
log_format (string, optional) – Format of log lines
show_version (bool, optional) – Start log with truvari version and command line args

vcf_to_df

truvari.vcf_to_df(fn, with_info=True, with_format=True, sample=None, no_prefix=False, alleles=False)[source]

Parse a vcf file and turn it into a dataframe. Tries its best to pull info/format tags from the sample information. For Formats with Number=G, append _ref, _het, _hom. For things with Number=R, append _ref, _alt. Specify which sample with its name or index in the VCF.

Parameters:

fn (string) – File name of VCF to open and turn into a DataFrame
with_info (boolean, optional) – Add the INFO fields from the VCF to the DataFrame columns
with_format – Add the FORMAT fields from the VCF to the DataFrame columns
sample (int/string, optional) – Sample from the VCF to parse. Only used when with_format==True
no_prefix (boolean, optional) – Don’t prefix FORMAT fields with sample name
alleles (boolean, optional) – Add column with allele sequences

Returns:

Converted VCF

Return type:

pandas.DataFrame

Example

>>> import truvari
>>> df = truvari.vcf_to_df("repo_utils/test_files/variants/input2.vcf.gz", True, True)
>>> df.columns
Index(['chrom', 'start', 'end', 'id', 'svtype', 'svlen', 'szbin', 'qual',
       'filter', 'is_pass', 'QNAME', 'QSTART', 'QSTRAND', 'SVTYPE', 'SVLEN',
       'NA12878_GT', 'NA12878_PL_ref', 'NA12878_PL_het', 'NA12878_PL_hom',
       'NA12878_AD_ref', 'NA12878_AD_alt'],
      dtype='object')

Objects

Bench

class truvari.Bench(matcher=None, base_vcf=None, comp_vcf=None, outdir=None, includebed=None, extend=0, debug=False, do_logging=False, short_circuit=False)[source]

Object to perform operations of truvari bench

You can build a Bench worker with default matching parameters via:

m_bench = truvari.Bench()

If you’d like to customize the parameters, build and edit a Matcher and pass it to the Bench init

matcher = truvari.Matcher()
matcher.params.pctseq = 0.50
m_bench = truvari.Bench(matcher)

To run on a chunk of pysam.VariantRecord already loaded, pass them in as lists to:

matches = m_bench.compare_calls(base_variants, comp_variants)

For advanced parsing, one can build the match matrix for a chunk of calls with:

match_matrix = m_bench.build_matrix(base_variants, comp_variants)

This can then be used for things such as creating custom match pickers or adding new matching checks.

If you want to run on existing files, simply supply the arguments to init

m_bench = truvari.Bench(matcher, base_vcf, comp_vcf, outdir)
output = m_bench.run()

Note that running on files must write to an output directory and is the only way to use things like ‘includebed’. However, the returned BenchOutput has attributes pointing to all the results.

build_matrix(base_variants, comp_variants, chunk_id=0, skip_gt=False)[source]: Builds MatchResults, returns them as a numpy matrix

check_refine_candidate(result)[source]: Adds this region as a candidate for refinement if there are unmatched variants

compare_calls(base_variants, comp_variants, chunk_id=0)[source]: Builds MatchResults, returns them as a numpy matrix if there’s at least one base and one comp variant. Otherwise, returns a list of the variants placed in MatchResults

compare_chunk(chunk)[source]: Given a filtered chunk, (from chunker) compare all of the calls

param_dict()[source]: Returns the parameters as a dict

run()[source]: Runs bench and returns the resulting truvari.BenchOutput

BenchOutput

class truvari.BenchOutput(bench, matcher)[source]

Makes all of the output files for a Bench.run

The variable BenchOutput.vcf_filenames holds a dictonary. The keys are tpb, tpc for true positive base/comp vcf filename and fn, fp. The variable stats_box holds a StatsBox.

close_outputs()[source]: Close all the files

write_match(match)[source]: Annotate a MatchResults’ entries then write to the apppropriate file and do the stats counting. Writer is responsible for handling FPs between sizefilt-sizemin

StatsBox

class truvari.StatsBox[source]

Make a blank stats box for counting TP/FP/FN and calculating performance

calc_performance()[source]: Calculate the precision/recall

clean_out()[source]: When reusing a StatsBox (typically inside refine), gt numbers are typically invalidated. This removes those numbers from self to make a cleaner report

write_json(out_name)[source]: Write stats as json to file

GT

class truvari.GT(value)[source]

Genotypes

REF = <GT.REF: 0>
HET = <GT.HET: 1>
HOM = <GT.HOM: 2>
NON = <GT.NON: 3> - Non-Genotyped (e.g. ./.)
UNK = <GT.UNK: 4> - Undetermined Genotype

RegionVCFIterator

LogFileStderr

class truvari.LogFileStderr(fn)[source]

Write to stderr and a file

Useful in conjunction with setup_logging()

Example

>>> import truvari
>>> truvari.setup_logging(stream=LogFileStderr('log.txt'))

flush()[source]: Flush handlers

write(*args)[source]: Write to handlers

MatchResult

class truvari.MatchResult[source]

A base/comp match holder

calc_score()[source]: Unite the similarity measures and make a score

Matcher

class truvari.Matcher(args=None)[source]

Holds matching parameters. Allows calls to be checked for filtering and matches to be made

Example

>>> import pysam
>>> import truvari
>>> mat = truvari.Matcher()
>>> mat.params.pctseq = 0
>>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> one = next(v); two = next(v)
>>> mat.build_match(one, two)
<truvari.bench.MatchResult (False 2.381)>

Look at Matcher.make_match_params() for a list of all params and their defaults

build_match(base, comp, matid=None, skip_gt=False, short_circuit=False)[source]: Build a MatchResult if skip_gt, don’t do genotype comparison if short_circuit, return after first failure

filter_call(entry, base=False)[source]: Returns True if the call should be filtered Base has different filtering requirements, so let the method know

static make_match_params()[source]: Makes a simple namespace of matching parameters. Holds defaults

static make_match_params_from_args(args)[source]: Makes a simple namespace of matching parameters

SV

class truvari.SV(value)[source]

SVtypes

SNP = <SV.SNP: 0>
DEL = <SV.DEL: 1>
INS = <SV.INS: 2>
DUP = <SV.DUP: 3>
INV = <SV.INV: 4>
NON = <SV.NON: 5> - Not a variant (monomorphic ref?)
UNK = <SV.UNK: 6> - Unknown type

Data

HEADERMAT

regular expression of vcf header INFO/FORMAT fields with groups

truvari.HEADERMAT = re.compile('##\\w+=<ID=(?P<name>\\w+),Number=(?P<num>[\\.01AGR]),Type=(?P<type>\\w+)'): Compiled regular expression object.

QUALBINS

0-100 quality score bin strings (step size 10)

truvari.QUALBINS = ['[0,10)', '[10,20)', '[20,30)', '[30,40)', '[40,50)', '[50,60)', '[60,70)', '[70,80)', '[80,90)', '[90,100)', '>=100']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SVTYTYPE

pandas.CategoricalDtype of truvari.SV

SZBINMAX

integer list of maximum size for size bins

truvari.SZBINMAX = [1, 5, 10, 15, 20, 30, 40, 50, 100, 200, 300, 400, 600, 800, 1000, 2500, 5000, 9223372036854775807]

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINS

string list of size bins

truvari.SZBINS = ['SNP', '[1,5)', '[5,10)', '[10,15)', '[15,20)', '[20,30)', '[30,40)', '[40,50)', '[50,100)', '[100,200)', '[200,300)', '[300,400)', '[400,600)', '[600,800)', '[800,1k)', '[1k,2.5k)', '[2.5k,5k)', '>=5k']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINTYPE

pandas.CategoricalDtype of truvari.SZBINS