Truvari package

Overview

Truvari - SV comparison and annotation toolkit

See help() of specific functions / objects for details

Objects:

Bench BenchOutput GT LogFileStderr MatchResult StatsBox SV VariantFile VariantRecord VariantParams

Extra methods:

bed_ranges() benchdir_count_entries() best_seqsim() build_region_tree() check_vcf_index() chunker() cmd_exe() compress_index_vcf() coords_within() count_entries() extend_region_tree() file_zipper() help_unknown_cmd() get_gt() get_scalebin() get_sizebin() get_svtype() make_temp_filename() merge_region_tree_overlaps() msa2vcf() opt_gz_open() optimize_df_memory() overlap_percent() overlaps() performance_metrics() read_bed_tree() reciprocal_overlap() restricted_float() restricted_int() ref_ranges() roll_seqsim() seqsim() setup_logging() sizesim() unroll_seqsim() vcf_ranges() vcf_to_df()

Data:

truvari.HEADERMAT truvari.QUALBINS truvari.SVTYTYPE truvari.SZBINMAX truvari.SZBINS truvari.SZBINTYPE

Variant Handling

class truvari.VariantFile(filename, *args, params=None, **kwargs)[source]

Wrapper around pysam.VariantFile with helper functions for iteration.

fetch(*args, **kwargs)[source]

Fetch variants from the pysam.VariantFile, wrapping them into truvari.VariantRecords.

Parameters:
  • args – Positional arguments for the pysam.VariantFile.fetch method.

  • kwargs – Keyword arguments for the pysam.VariantFile.fetch method.

Returns:

Iterator of truvari.VariantRecord objects.

Return type:

iterator

fetch_bed(bed_fn, inside=True, with_region=False)[source]

Fetch variants from the VCF based on regions defined in a BED file.

Parameters:
  • bed_fn (str) – Path to the BED file defining regions of interest.

  • inside (bool) – If True, fetch variants inside the regions. If False, fetch variants outside the regions.

  • with_region (bool) – If True, return tuples of (truvari.VariantRecord, region). Defaults to False.

Returns:

Iterator of truvari.VariantRecord objects or tuples of (truvari.VariantRecord, region).

Return type:

iterator

fetch_regions(tree, inside=True, with_region=False)[source]

Fetch variants from the VCF based on regions defined in a tree of chrom:IntervalTree.

Parameters:
  • tree (dict) – Tree of chrom:IntervalTree defining regions of interest.

  • inside (bool) – If True, fetch variants inside the regions. If False, fetch variants outside the regions.

  • with_region (bool) – If True, return tuples of (truvari.VariantRecord, region). Defaults to False.

Returns:

Iterator of truvari.VariantRecord objects or tuples of (truvari.VariantRecord, region).

Return type:

iterator

write(record)[source]

Write a truvari.VariantRecord to the pysam.VariantFile.

Parameters:

record (truvari.VariantRecord) – The truvari.VariantRecord to be written.

class truvari.VariantRecord(record, params=None)[source]

Wrapper around pysam.VariantRecords with helper functions of variant properties and basic comparisons

allele_freq_annos(samples=None)[source]

Calculate allele annotations for a VCF Entry

Parameters:

samples (list of strings, optional) – Subset of samples from the entry over which to calculate annos

Returns:

Dictonary of
AF - allele frequency
MAF - minor allele frequency
ExcHet - excess heterozygosity
HWE - hardy weinberg equilibrium
AC - allele count
MAC - minor allele count
AN - number of called alleles

Return type:

dict

Example
>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/multi.vcf.gz')
>>> e = next(v)
>>> e.allele_freq_annos()
{'AF': 0.5, 'MAF': 0.5, 'ExcHet': 1.0, 'HWE': 1.0, 'MAC': 1, 'AC': [1, 1], 'AN': 2, 'N_HEMI': 0, 'N_HOMREF': 0, 'N_HET': 1, 'N_HOMALT': 0, 'N_MISS': 2}
bnd_direction_strand()[source]

Parses a BND ALT string to determine its direction and strand.

A BND (breakend) ALT string indicates a structural variant breakpoint. This method parses the ALT string to determine:

  • The direction: “left” means the piece is anchored on the left side of the breakpoint, while “right” means it’s anchored on the right.

  • The strand: “direct” indicates the base is on the direct strand, and “complement” indicates the base is on the complement strand.

Note

This method assumes that self.is_bnd() is True, meaning the variant is a BND-type structural variant.

Returns:

A tuple containing the direction (“left” or “right”) and the strand (“direct” or “complement”).

Return type:

tuple (str, str)

Raises:

ValueError – If the ALT string does not follow the expected BND format.

bnd_match(other)[source]

Build a MatchResult for bnds

bnd_position()[source]

Extracts the chromosome and position from a BND ALT string.

Breakend (BND) ALT strings indicate structural variant breakpoints that span across chromosomes or positions. This method parses the ALT string to extract the target chromosome and position of the breakpoint.

Returns:

A tuple containing the chromosome (as a string) and the position (as an integer).

Return type:

tuple (str, int)

Raises:

ValueError – If the ALT string does not follow the expected BND format.

boundaries(ins_inflate=False)[source]

Get the start and end of an entry

Parameters:

ins_inflate (bool, optional) – inflate INS boundaries by sv length

Returns:

start/end

Return type:

tuple (int, int)

compare_gts(other, match)[source]

Populates genotype-specific comparison details in a MatchResult.

This method compares the genotypes of a “base” (reference) variant and a “comparison” variant and updates the provided MatchResult object with the results. It computes the genotype counts and determines the difference between the base and comparison genotypes.

Parameters:
  • other (truvari.VariantRecord) – The other variant entry.

  • match (truvari.MatchResult) – The MatchResult object to update with genotype comparison details.

Updates the MatchResult object with the following attributes:
  • base_gt: The genotype of the base sample.

  • base_gt_count: The count of the reference allele (1) in the base genotype.

  • comp_gt: The genotype of the comparison sample.

  • comp_gt_count: The count of the reference allele (1) in the comparison genotype.

  • gt_match: The absolute difference between base_gt_count and comp_gt_count.

copy()[source]

Copies this truvari.VariantRecord

cpx_match(other)[source]

Attempt to match non-BND variants with a BND. This is performed by decomposing the non-BND into its BND representations before calling bnd_match. This will make at least two BNDs, so the MatchResults are sorted and the greater is returned.

decompose()[source]

Decompose SVs into BND Records

This method decomposes structural variants (SVs) of <DEL>, <DUP>, and <INV> into Breakend (BND) records. The decomposed variants are stored internally and returned as a list.

Returns:

list: A list of new BND variant records. List will be empty when no valid decomposition

Notes:
  • For INV, four BNDs are created.

  • For DEL and DUP, two BNDs are created.

  • DUP are assumed to be of type DUP:TANDEM.

distance(other)[source]

Calculate the start and end distances of the pair. Negative distances indicate self is upstream of other

Parameters:

other (truvari.VariantRecord) – Other to compare

Returns:

starts and ends distance

Return type:

tuple (int, int)

filter_call(base=False)[source]

Determines whether a variant call should be filtered based on Truvari parameters or specific requirements.

This method evaluates a variant entry (entry) and checks if it should be excluded from further processing based on filtering criteria such as monomorphic reference, multi-allelic records, filtering status, sample presence, or unsupported single-end BNDs.

Parameters:
  • entry (truvari.VariantRecord) – The variant entry to evaluate.

  • base (bool, optional) – A flag indicating whether the entry is the “base” (reference) call or the “comparison” call. Filtering behavior may differ based on this flag.

Returns:

True if the variant should be filtered (excluded), otherwise False.

Return type:

bool

Raises:

ValueError – If the entry is multi-allelic and check_multi is enabled in the Truvari parameters.

Filtering Logic:
  • Monomorphic Reference: If check_monref is enabled and the entry is a monomorphic reference, it is filtered.

  • Multi-Allelic Records: If check_multi is enabled and the entry is multi-allelic, an error is raised.

  • Filtered Variants: If passonly is enabled and the entry is flagged as filtered, it is excluded.

  • Sample Presence: If no_ref is set to include the entry’s type (base or comparison) or pick == ‘ac’, the sample must be present in the entry.

  • Single-End BNDs: Single-end BNDs are always excluded.

filter_size(base=False)[source]

Determines whether a variant entry should be filtered based on its size.

This method evaluates the size of a variant and checks if it falls outside the specified size thresholds. Filtering criteria depend on whether the entry is a “base” (reference) or “comparison” call.

Parameters:
  • entry (truvari.VariantRecord) – The variant entry to evaluate.

  • base (bool, optional) – A flag indicating whether the entry is the “base” (reference) call. If True, the sizemin parameter is used as the minimum size threshold. Otherwise, the sizefilt parameter is used.

Returns:

True if the variant should be filtered due to its size, otherwise False.

Return type:

bool

Filtering Logic:
  • Minimum Size (Base): If base=True and the size is less than sizemin, the variant is filtered.

  • Minimum Size (Comparison): If base=False and the size is less than sizefilt, the variant is filtered.

  • Maximum Size: If the size exceeds sizemax, the variant is filtered.

get_alt()[source]

Returns either the resolved ALT or _record.alt[0]

get_record()[source]

Return internal pysam.VariantRecord

get_ref()[source]

Returns either the resolved REF or _record.ref

gt(sample=0)[source]

Returns the raw GT field from a record for a specific sample. Returns None if GT is not in FORMAT

is_bnd()[source]

Returns if a record is a resolved BND

is_filtered(values=None)[source]

Checks if entry should be filtered given the filter values provided. If values is None, assume that filter must have PASS or be blank ‘.’

Parameters:

values (set, optional) – set of filter values for intersection

Returns:

True if entry should be filtered

Return type:

bool

Example
>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> e = next(v)
>>> e.is_filtered() # PASS shouldn't be filtered
False
>>> e = next(v)
>>> e.is_filtered(["lowQ"]) # Call isn't lowQ, so filter
True
is_monrefstar()[source]

Returns if a record is a reference monozygotic or star-alt

is_multi()[source]

Returns if a record is multi-allelic

is_present(sample=0, allow_missing=True)[source]

Checks if entry’s sample genotype is present and is heterozygous or homozygous (a.k.a. present) If allow_missing, just check for a 1 in the genotype. Otherwise, a missing (‘.’) genotype isn’t considered present

Parameters:

sample (string, optional) – sample name

Returns:

True if variant is present in the sample

Return type:

bool

Example
>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> e = next(v)
>>> e.is_present(allow_missing=False)
True
is_resolved()[source]

Checks if an SV is a sequence resolved by checking it is has an alt, is not symbolic, is not monref, is not BND

Returns:

is_resolved

Return type:

bool

is_single_bnd()[source]

Returns if a record is a single-end BND

is_symbolic()[source]

Returns if a record is a symbolic variant (e.g. ALT == <DEL>)

match(other)[source]

Build a MatchResult from comparison of two VariantRecords If self and other are non-bnd, calls VariantRecord.var_match, If self and other are bnd, calls VariantRecord.bnd_match If decompose is on, calls VariantRecord.cpx_match Otherwise, returns a default MatchResult

move_record(out_vcf, take_samples=None, into_samples=None)[source]

Similar to pysam.VariantRecord.translate, except this allows adding new FORMAT fields and optionally selects a subset of samples

recovl(other, ins_inflate=True)[source]

Calculates reciprocal overlap of two entries

Parameters:
  • other (truvari.VariantRecord) – Other entry to compare with

  • ins_inflate (bool) – inflate entry boundaries

Returns:

The reciprocal overlap

Return type:

float

Example
>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> a.recovl(b)
0
resolve(ref)[source]

Attempts to resolve the REF/ALT sequences of a structural variant (SV) and updates the instance attributes.

This method tries to reconstruct the REF and ALT sequences for specific types of structural variants (e.g., deletions, inversions, duplications) based on the provided reference genome sequence. If successful, it stores the resolved sequences in self._resolved_ref and self._resolved_alt, and may adjust self.end.

Parameters:

ref (pysam.FastaFile) – The reference genome used to resolve the structural variant. Typically a pysam.FastaFile or equivalent object providing access to reference sequences.

Returns:

True if the REF/ALT sequences are successfully resolved, otherwise False.

Return type:

bool

Updates:
  • self._resolved_ref: The resolved REF sequence.

  • self._resolved_alt: The resolved ALT sequence.

  • self.end: May be adjusted for specific variant types (e.g., duplications).

Resolution Logic:
  • If ref is None, the ALT is <CNV> or <INS>, or the start position is invalid, the method returns False.

  • The method fetches the sequence from the reference genome using the variant’s chrom, start, and end.

  • The resolution depends on the variant type:
    • Deletion (`DEL`): The REF sequence is the fetched reference sequence, and the ALT is the first base.

    • Inversion (`INV`): The REF sequence is the fetched reference sequence, and the ALT is the reverse complement of the REF.

    • Duplication (`DUP`) (when self.params.dup_to_ins is enabled): The REF is the first base of the fetched sequence, and the ALT is the full sequence. The end is adjusted to be start + 1.

  • Returns False if the variant type is unsupported or cannot be resolved.

same_type(other)[source]

Check if self.var_type() == other.var_type() with extra handling for dup-to-ins

Parameters:
  • other (truvari.VariantRecord) – Other entry to compare with

  • dup_to_ins (bool) – Convert DUP to INS types

Returns:

True if entry SVTYPEs match

Return type:

bool

seqsim(other)[source]

Calculate sequence similarity of two entries. If reference is not None, compare their shared reference context. Otherwise, use the unroll technique.

Assumes that is either a sequence resolved variant or self.resolve has been called.

Parameters:
  • other (truvari.VariantRecord) – Other to compare with

  • roll (bool) – compare the lexicographically minimum rotation of sequences

Returns:

sequence similarity

Return type:

float

sizesim(other)[source]

Calculate the size similarity and difference for two entries

Parameters:

other (truvari.VariantRecord) – Other entry to compare with

Returns:

size similarity and size diff (A - B)

Return type:

(float, int)

Example
>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> a.sizesim(b)
(0.07142857142857142, 13)
to_hash(hasher=<built-in function openssl_sha1>)[source]

Turn variant into a key and hash with provided hasher

Parameters:

hasher – hashing function

Returns:

hash

Return type:

string

to_key(prefix='', bounds=False)[source]

Turn a vcf entry into a hashable key string. Use the prefix (base/comp) to indicate the source VCF when consolidating multiple files’ calls. If bounds: call entry_boundaries for start/stop.

Warning

If a caller redundantly calls a variant exactly the same it will not have a unique key

Parameters:
  • prefix (string, optional) – prefix

  • bounds (bool, optional) – use entry_boundaries

Returns:

hashable string uniquely identifying the variant

Return type:

string

var_match(other)[source]

Build a MatchResult

var_size()[source]

Determine the size of the variant.

Note

How size is determined

  • Starts by trying to use INFO/SVLEN

  • If SVLEN is unavailable and ALT field is an SV (e.g. <INS>, <DEL>, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS.

  • If len of vcf.REF and vcf.ALT[0] are equal, return len of vcf.REF

  • Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0])))

Returns:

the entry’s size

Return type:

int

var_type()[source]

Return entry’s SVTYPE

Note

the first of these checks that succeeds determins the svtype

  • BND if pysam.VariantRecord.alleles_variant_types[1] == “BND”

  • Checking INFO/SVTYPE

  • if is_resolved, REF and ALT seq size difference

  • if symbolic (e.g. <DEL>) parse the ALT. Subtypes are ignored (e.g. DUP:ME becomes DUP)

  • svtype is unknown (UNK)

Returns:

SV type

Return type:

truvari.SV

within(rstart, rend)[source]

Extract entry boundaries and type to call truvari.coords_within

within_tree(tree)[source]

Extract entry and tree boundaries to call truvari.coords_within

class truvari.VariantParams(args=None, **kwargs)[source]

Holds variant parsing and matching parameters.

Attributes

Attribute

Description

refdist

Distance threshold for comparing positions in the reference genome. Default: 500.

pctseq

Minimum percentage of sequence similarity required for a match. Default: 0.70 (70%).

pctsize

Minimum percentage of size similarity required for a match. Default: 0.70 (70%).

pctovl

Minimum percentage of reciprocal overlap required for comparing variants. Default: 0.0 (disabled).

typeignore

Whether to ignore variant type mismatches during comparison. Default: False.

no_roll

Whether to disable rolling of sequences for comparisons. Default: False.

chunksize

Number of entries to process in each chunk. Default: 1000.

bSample

Sample name or index for the “base” (a.k.a. self) variants during comparisons. Default: 0.

cSample

Sample name or index for the “comparison” (a.k.a. other) variants during comparisons. Default: 0.

dup_to_ins

Whether to treat duplications as insertions for some operations. Default: False.

bnddist

Maximum allowed distance for breakend (BND) comparisons. Default: 100.

sizemin

Minimum variant size to consider. Default: 50.

sizefilt

Minimum size filter for comparison in the “comparison” dataset. Default: 30.

sizemax

Maximum variant size to consider. Default: -1, off.

passonly

Whether to only consider variants with a “PASS” filter status. Default: False.

no_ref

Whether to ignore reference homozygous variants in (a)ll, (b)ase, or (c)omp VCF Default: False (off)`.

pick

Strategy for picking matches by Bench (single, ac, multi).

ignore_monref

Whether to ignore monoallelic reference calls. Default: True.

check_multi

Whether to check for and handle multi-allelic records. Default: True.

check_monref

Whether to check for monoallelic reference calls. Default: True.

no_single_bnd

Whether to exclude single-end breakends (BNDs) from comparisons. Default: True.

write_resolved

Whether to write resolved REF/ALT sequences to output. Default: False.

decompose

Decompose symbolic variants (e.g. <DEL>) into BNDs when comparing with BNDs. Default: True

short_circuit

Whether to enable short-circuit logic for early exits in comparisons. Default: False.

skip_gt

Whether to skip genotype comparisons. Default: False.

max_resolve

Maximum size of a variant to attempt sequence resolving. Default: 25000.

Objects

class truvari.MatchResult[source]

Holds results from a matching operation

Attributes

Attribute

Description

base

The base (a.k.a. self) variant

comp

The comp (a.k.a. other) variant

state

Boolean of if variants match

seqsim

Sequence similarity of variants

sizesim

Size similarity of variants

ovlpct

Reciprocal overlap ov variants

sizediff

Base variant var_size minus comp variant var_size

st_dist

Base variant start position minus comp variant start position

ed_dist

Base variant end position minus comp variant end position

gt_match

Boolean of if genotypes match

score

TruScore of matches

matid

Place to put MatchIds, not populated by truvari.VariantRecord.match

calc_score()[source]

Unite the similarity measures and make a score

class truvari.GT(value)[source]

Genotypes

  • REF = <GT.REF: 0>

  • HET = <GT.HET: 1>

  • HOM = <GT.HOM: 2>

  • NON = <GT.NON: 3> - Non-Genotyped (e.g. ./.)

  • UNK = <GT.UNK: 4> - Undetermined Genotype

class truvari.SV(value)[source]

SVtypes

  • SNP = <SV.SNP: 0>

  • DEL = <SV.DEL: 1>

  • INS = <SV.INS: 2>

  • DUP = <SV.DUP: 3>

  • INV = <SV.INV: 4>

  • NON = <SV.NON: 5> - Not a variant (monomorphic ref?)

  • UNK = <SV.UNK: 6> - Unknown type

class truvari.Bench(params=None, base_vcf=None, comp_vcf=None, outdir=None, includebed=None, extend=0, debug=False, do_logging=False)[source]

Object to perform operations of truvari bench

You can build a Bench worker with default matching parameters via:

m_bench = truvari.Bench()

If you’d like to customize the parameters, build and edit a VariantParams and pass it to the Bench init

params = truvari.VariantParams(pctseq=0.50)
m_bench = truvari.Bench(params)

To run on a chunk of truvari.VariantRecord already loaded, pass them in as lists to:

matches = m_bench.compare_calls(base_variants, comp_variants)

For advanced parsing, one can build the match matrix for a chunk of calls with:

match_matrix = m_bench.build_matrix(base_variants, comp_variants)

This can then be used for things such as creating custom match pickers or adding new matching checks.

If you want to run on existing files, simply supply the arguments to init

m_bench = truvari.Bench(params, base_vcf, comp_vcf, outdir)
output = m_bench.run()

Note

Running on files must write to an output directory and is the only way to use things like ‘includebed’. However, the returned BenchOutput has attributes pointing to all the results.

build_matrix(base_variants, comp_variants, chunk_id=0)[source]

Builds MatchResults, returns them as a numpy matrix

check_refine_candidate(result)[source]

Adds this region as a candidate for refinement if there are unmatched variants

compare_calls(base_variants, comp_variants, chunk_id=0)[source]

Builds MatchResults, returns them as a numpy matrix if there’s at least one base and one comp variant. Otherwise, returns a list of the variants placed in MatchResults

compare_chunk(chunk)[source]

Given a filtered chunk, (from chunker) compare all of the calls

param_dict()[source]

Returns the parameters as a dict

run()[source]

Runs bench and returns the resulting truvari.BenchOutput

class truvari.BenchOutput(bench, params)[source]

Makes all of the output files for a Bench.run

The variable BenchOutput.vcf_filenames holds a dictonary. The keys are tpb, tpc for true positive base/comp vcf filename and fn, fp. The variable stats_box holds a StatsBox.

close_outputs()[source]

Close all the files

write_match(match)[source]

Annotate a MatchResults’ entries then write to the apppropriate file and do the stats counting. Writer is responsible for handling FPs between sizefilt-sizemin

class truvari.StatsBox[source]

Make a blank stats box for counting TP/FP/FN and calculating performance

calc_performance()[source]

Calculate the precision/recall

write_json(out_name)[source]

Write stats as json to file

class truvari.LogFileStderr(fn)[source]

Write to stderr and a file

Useful in conjunction with setup_logging()

Example
>>> import truvari
>>> truvari.setup_logging(stream=LogFileStderr('log.txt'))
flush()[source]

Flush handlers

write(*args)[source]

Write to handlers

Extra Methods

truvari.bed_ranges(bed, chunk_size=10000000)[source]

Chunk bed regions into pieces. Useful for multiprocessing.

Parameters:
  • bed (string) – Filename of bed to chunk

  • chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example
>>> import truvari
>>> gen = truvari.bed_ranges("repo_utils/test_files/beds/giab.bed", 1000)
>>> print(len([_ for _ in gen]))
992
truvari.benchdir_count_entries(benchdir, regions, within=True, threads=4)[source]

Count the number of variants per bed region in Truvari bench directory by state

Returns a pandas dataframe of the counts

truvari.best_seqsim(a_seq, b_seq, st_dist)[source]

Returns best of roll, unroll, and direct sequence similarity

Warning

roll_seqsim is only called when both sequences are < 500bp in length

truvari.build_region_tree(vcfA, vcfB=None, includebed=None)[source]

Builds a dictionary of chromosome names mapped to IntervalTree objects containing regions based on the input VCF files and an optional BED file.

Parameters

vcfApysam.VariantFile

The primary VCF file used to define the regions and contigs.

vcfBpysam.VariantFile, optional

The secondary VCF file used for comparison. If provided, its contigs are considered to exclude any contigs not present in vcfA. Default is None.

includebedstr, optional

Path to a BED file specifying regions to include. If provided, these regions override those defined by the VCF files. Contigs in the BED file not found in vcfA are excluded. Default is None.

Returns

dict

A dictionary where keys are chromosome names (str), and values are IntervalTree objects representing the regions.

Notes

  • If includebed is provided:
    • The function reads regions from the BED file and uses them to define the regions in all_regions.

    • Contigs in the BED file that are not present in vcfA are excluded, and a warning is logged.

  • Contigs in vcfB that are not present in vcfA are ignored, and a warning is logged.

  • If a contig in vcfA lacks a length definition, the program logs an error and exits with code 10.

Raises

SystemExit

If a contig in vcfA has no length defined, the program terminates with an error message and exit code 10.

Logging

  • Logs warnings if contigs in vcfB are not found in vcfA.

  • Logs warnings if contigs in the BED file are not found in vcfA.

  • Logs an error and exits if any contig in vcfA lacks a length definition.

truvari.read_bed_tree(filename, chrom_col=0, start_col=1, end_col=2, one_based=False, comment='#', idxfmt=None)[source]

Build an dictionary of IntervalTrees for each chromosome from tab-delimited annotation file

By default, the file is assumed to be a bed-format. If custom chrom/start/end are used, the columns can be specified.

idxfmt is a string that will be formatted with the line number from the file (e.g. “num %d” will make intervals with data=”num 0”, “num 1”. By default the data will be interger line number. If intervals will be compared between anno_trees, set idxfmt to “”

Parameters:
  • filename (string) – Path to file to parse, can be compressed

  • chrom_col (integer, optional) – Index of column in file with chromosome

  • start_col (integer, optional) – Index of column in file with start

  • end_col (integer, optional) – Index of column in file with end

  • one_based (bool, optional) – True if coordinates are one-based

  • is_pyintv (bool, optional) – add 1 to end position to correct pyintervaltree.overlap behavior

  • comment (string, optional) – ignore lines if they start with this string

  • idxfmt (string, optional) – Index of column in file with chromosome

Returns:

dictionary with chromosome keys and intervaltree.IntervalTree values, number of lines parsed

Return type:

tuple (dict, int)

truvari.check_vcf_index(vcf_path)[source]

Return true if an index file is found for the vcf

truvari.chunker(params, *files)[source]

Given a truvari.VariantParams and multiple files, zip them and create chunks

Yields tuple of the chunk of calls, and an identifier of the chunk

truvari.cmd_exe(cmd, stdin=None, timeout=-1, cap_stderr=True, pipefail=False)[source]

Executes a command through the shell and captures the output while enabling process handling with timeouts and keyboard interrupts.

Parameters:
  • cmd (string) – The command to be executed.

  • stdin (bytes) – stdinput to pipe to command

  • timeout (int) – Timeout for the command in minutes. So 1440 means 24 hours. -1 means never

  • cap_stderr (boolean) – If True, capture the stderr and return it as part of the returned cmd_results. Otherwise, stderr will be streamed through to the terminal

  • pipefail (boolean) – Set to True if the cmd contains pipes |

Returns:

namedtuple of
ret_code - integer, exit code of the command
stdout - string, captured standard output of the command
stderr - binary string, captured standard error of the command
run_time - datetime.timedelta, execution time

Return type:

namedtuple (ret_code, stdout, stderr, run_time)

Example
>>> import truvari
>>> ret = truvari.cmd_exe("echo 'hello world'")
>>> ret.ret_code
0
>>> ret.stdout
'hello world\n'
>>> import truvari
>>> ret = truvari.cmd_exe("sleep 5", pipefail=True, timeout=2/60) # Error logged is caught
>>> ret.ret_code
214
truvari.compress_index_vcf(fn, fout=None, remove=True)[source]

Compress and index a vcf. If no fout is provided, write to fn + ‘.gz’

Parameters:
  • fn (string) – filename of vcf to work on

  • fout (string, optional) – output filename

  • remove (bool) – remove fn after fout is made

truvari.coords_within(qstart, qend, rstart, rend, end_within)[source]

Returns if a span is within the provided [start, end). All coordinates assumed 0-based One exception is made for REPL types with anchor bases. This is for TR callers and GIABTR benchmark where variants will span the entire includebed region.

Parameters:
  • qstart (integer) – query start position

  • qend (integer) – query end position

  • start (integer) – start of span

  • end (integer) – end of span

  • end_within (bool) – if true, qend <= rend, else qend < rend

Returns:

If the coordinates are within the span

Return type:

bool

truvari.count_entries(vcf, chroms, regions, within)[source]

Count the number of variants per bed region a VCF regions is a list of lists with sub-lists having 0:3 of chrom, start, end Returns a list of the counts in the same order as the regions

truvari.extend_region_tree(tree, pad)[source]

Extends all intervals by a fixed number of bases on each side Returns a copy of this tree

truvari.file_zipper(*start_files)[source]

Zip multiple files to yield their entries in order.

The function takes as input tuples of (key, iterable), where:

  • key is an identifier (used to track which file the yielded entry comes from).

  • iterable is an iterable object, typically a truvari.VariantFile.

The function iterates through all input files in a coordinated manner, yielding the entries in order.

Parameters:

start_files (tuple) – A variable-length argument list of tuples (key, iterable).

Yields:

A tuple of (key, truvari.VariantRecord), where key is the file identifier and the second element is the next record from the corresponding file.

Return type:

tuple

Raises:

StopIteration – Raised when all input files have been exhausted.

Logs:
  • Logs a summary of the zipping process after all files have been processed.

truvari.help_unknown_cmd(user_cmd, avail_cmds, threshold=0.8)[source]

Guess the command in avail_cmds that’s most similar to user_cmd. If there is no guess below threshold, don’t guess.

Parameters:
  • user_cmd (string) – user command

  • avail_cmds (list of strings) – available commands

  • threshold (float, optional) – max score (0-1) to give a guess

Returns:

best guess from :param: avail_cmds or None

Return type:

string

Example
>>> import truvari
>>> truvari.help_unknown_cmd('banno', ["bench", "anno"])
'anno'
>>> truvari.help_unknown_cmd('random', ["bench", "anno"])
truvari.get_gt(gt)[source]

Turn a genotype tuple into a GT object

Parameters:

gt (tuple (int, int)) – Genotype tuple

Returns:

A GT of the genotype

Return type:

truvari.GT

Example
>>> import truvari
>>> truvari.get_gt((0, 0))
<GT.REF: 0>
>>> truvari.get_gt((0, 1, 1))
<GT.UNK: 4>
truvari.get_scalebin(x, rmin=0, rmax=100, tmin=0, tmax=100, step=10)[source]

Scale variable x from rdomain to tdomain with step sizes return key, index

Parameters:
  • x (number) – Number to scale

  • rmin (number, optional) – The minimum of the range of your measurement

  • rmax (number, optional) – The maximum of the range of your measurement

  • tmin (number, optional) – The minimum of the range of your desired target scaling

  • tmax (number, optional) – The maximum of the range of your measurement

  • step (number, optional) – The step size of bins of target range

Returns:

The bin-string of the scaled variable

Return type:

string

Example
>>> import truvari
>>> truvari.get_scalebin(4, 1, 5, 0, 20, 5)
('[15,20)', 3)
>>> truvari.get_scalebin(6, 1, 5, 0, 20, 5)
('>=20', 4)
truvari.get_sizebin(sz)[source]

Bin a given size into truvari.SZBINS

Parameters:

sz (integer) – SVLEN to bin into the SZBINS

Returns:

SZBIN

Return type:

string

truvari.get_svtype(svtype)[source]

Turn a SVTYPE string to a truvari.SV object

Parameters:

svtype (string)

Returns:

A SV of the SVTYPE

Return type:

truvari.SV

Example
>>> import truvari
>>> truvari.get_svtype("INS")
<SV.INS: 2>
>>> truvari.get_svtype("foo")
<SV.UNK: 7>
truvari.make_temp_filename(tmpdir=None, suffix='')[source]

Get a random filename in a tmpdir with an optional extension

truvari.merge_region_tree_overlaps(tree)[source]

Runs IntervalTree.merge_overlaps on all trees. Info Logs the count of chromosomes with overlapping regions and Debug Logs the chromosome names with overlapping regions.

truvari.msa2vcf(msa, anchor_base='N')[source]

Parse an MSA dict of {name: alignment, …} and returns its VCF entries as a string

Assumes one entry in the MSA has the name ref_${chrom}:${start}-${end} which gives VCF entries coordinates Provide anchor_base to prevent ‘N’ from being used as an anchor base Returns a string of entries

Example (for dealing with test coverage not being seen)
>>> import truvari
>>> from truvari.phab import fasta_reader
>>> msa_dir = "repo_utils/test_files/external/fake_mafft/lookup/"
>>> msa_file = "fm_03ffe346bf2196b0ceebee5beb9a6acb.msa"
>>> seqs = open(msa_dir + msa_file).read()
>>> fasta = dict(fasta_reader(seqs))
>>> m_entries_str = truvari.msa2vcf(fasta)
truvari.opt_gz_open(in_fn)[source]

Chooses file handler for plain-text files or *.gz files.

returns a generator which yields lines of the file

truvari.optimize_df_memory(df)[source]

Optimize DataFrame memory by trying to downcast every column in-place. Returns the pre/post size

Parameters:

df (pandas.DataFrame) – Dataframe to optimize

Returns:

tuple (pre_size, post_size)

Return type:

int, sizes in bytes

truvari.overlap_percent(astart, aend, bstart, bend)[source]

Calculates the percent of range A which overlaps with range B

Parameters:
  • astart (int) – First range’s start position

  • aend (int) – First range’s end position

  • bstart (int) – Second range’s start position

  • bend (int) – Second range’s end position

Returns:

overlap percent

Return type:

float

truvari.overlaps(s1, e1, s2, e2)[source]

Check if two start/end ranges have overlap

Parameters:
  • s1 (int) – range 1 start

  • e1 (int) – range 1 end

  • s2 (int) – range 2 start

  • e2 (int) – range 2 end

Returns:

True if ranges overlap

Return type:

bool

truvari.performance_metrics(tpbase, tp, fn, fp)[source]

Calculates precision, recall, and f1 given counts by state Can return None if values are zero

Parameters:
  • tpbase (int) – found base count

  • tp (int) – found comp count

  • fn (int) – missed base count

  • fp (int) – missed comp count

Returns:

tuple of precision, recall, f1

Return type:

tuple, float

truvari.reciprocal_overlap(astart, aend, bstart, bend)[source]

Calculates reciprocal overlap of two ranges

Parameters:
  • astart (int) – First range’s start position

  • aend (int) – First range’s end position

  • bstart (int) – Second range’s start position

  • bend (int) – Second range’s end position

Returns:

reciprocal overlap

Return type:

float

truvari.restricted_float(x)[source]

Restrict float to range (0,1). Raises argparse.ArgumentTypeError if float is out of range Used with argparse.ArgumentParser.add_argument type parameter

Parameters:

x (float) – number to check

Returns:

input number

Return type:

float

Example
>>> import truvari
>>> truvari.restricted_float(0.5)
0.5
>>> truvari.restricted_float(2)
Traceback (most recent call last):
argparse.ArgumentTypeError: 2.0 not in range (0, 1)
truvari.restricted_int(x)[source]

Restrict int to positive. Raises argparse.ArgumentTypeError if int is negative Used with argparse.ArgumentParser.add_argument type parameter

Parameters:

x (int) – number to check

Returns:

input number

Return type:

float

Example
>>> import truvari
>>> truvari.restricted_int(5)
5
>>> truvari.restricted_int(-2)
Traceback (most recent call last):
argparse.ArgumentTypeError: -2 is < 0
truvari.ref_ranges(reference, chunk_size=10000000)[source]

Chunk reference into pieces. Useful for multiprocessing.

Parameters:
  • reference (string) – Filename of reference to chunk

  • chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example
>>> import truvari
>>> gen = truvari.ref_ranges("repo_utils/test_files/references/reference.fa", 1000)
>>> print(len([_ for _ in gen]))
1000
truvari.roll_seqsim(a_seq, b_seq)[source]

Compare the lexicographically smallest rotations of two sequences

truvari.seqsim(allele1, allele2)[source]

Calculate similarity of two sequences

Parameters:
  • allele1 (str) – first entry

  • allele2 (str) – second entry

Returns:

sequence similarity

Return type:

float

truvari.setup_logging(debug=False, stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, log_format='%(asctime)s [%(levelname)s] %(message)s', show_version=False)[source]

Create default logger

Parameters:
  • debug (boolean, optional) – Set log-level to logging.DEBUG

  • stream (file handler, optional) – Where log is written

  • log_format (string, optional) – Format of log lines

  • show_version (bool, optional) – Start log with truvari version and command line args

truvari.sizesim(sizeA, sizeB)[source]

Calculate the size similarity percent and size diff for two sizes

Parameters:
  • sizeA (int) – first size

  • sizeB (int) – second size

Returns:

size similarity percent and size diff (A - B)

Return type:

(float, int)

truvari.unroll_seqsim(seqA, seqB, p)[source]

Unroll two sequences and compare. See https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c for details

Parameters:
  • seqA (string) – sequence upstream sequence

  • seqB (string) – sequence downstream sequence

  • p (integer) – how many bases upstream seqA is from seqB

Returns:

sequence similarity of seqA vs seqB after unrolling

Return type:

float

truvari.vcf_ranges(vcf, min_dist=1000)[source]

Chunk vcf into discrete pieces. Useful for multiprocessing.

Parameters:
  • vcf (string) – Filename of vcf to find ranges

  • min_dist (int, optional) – Minimum distance between entries for new range

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example
>>> import truvari
>>> gen = truvari.vcf_ranges("repo_utils/test_files/variants/input1.vcf.gz")
>>> print(len([_ for _ in gen]))
228
truvari.vcf_to_df(fn, with_info=True, with_format=True, sample=None, no_prefix=False, alleles=False)[source]

Parse a vcf file and turn it into a dataframe. Tries its best to pull info/format tags from the sample information. For Formats with Number=G, append _ref, _het, _hom. For things with Number=R, append _ref, _alt. Specify which sample with its name or index in the VCF.

Parameters:
  • fn (string) – File name of VCF to open and turn into a DataFrame

  • with_info (boolean, optional) – Add the INFO fields from the VCF to the DataFrame columns

  • with_format – Add the FORMAT fields from the VCF to the DataFrame columns

  • sample (int/string, optional) – Sample from the VCF to parse. Only used when with_format==True

  • no_prefix (boolean, optional) – Don’t prefix FORMAT fields with sample name

  • alleles (boolean, optional) – Add column with allele sequences

Returns:

Converted VCF

Return type:

pandas.DataFrame

Example
>>> import truvari
>>> df = truvari.vcf_to_df("repo_utils/test_files/variants/input2.vcf.gz", True, True)
>>> df.columns
Index(['chrom', 'start', 'end', 'id', 'svtype', 'svlen', 'szbin', 'qual',
       'filter', 'is_pass', 'QNAME', 'QSTART', 'QSTRAND', 'SVTYPE', 'SVLEN',
       'NA12878_GT', 'NA12878_PL_ref', 'NA12878_PL_het', 'NA12878_PL_hom',
       'NA12878_AD_ref', 'NA12878_AD_alt'],
      dtype='object')

Data

HEADERMAT

regular expression of vcf header INFO/FORMAT fields with groups

truvari.HEADERMAT = re.compile('##\\w+=<ID=(?P<name>\\w+),Number=(?P<num>[\\.01AGR]),Type=(?P<type>\\w+)')

Compiled regular expression object.

QUALBINS

0-100 quality score bin strings (step size 10)

truvari.QUALBINS = ['[0,10)', '[10,20)', '[20,30)', '[30,40)', '[40,50)', '[50,60)', '[60,70)', '[70,80)', '[80,90)', '[90,100)', '>=100']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SVTYTYPE

pandas.CategoricalDtype of truvari.SV

SZBINMAX

integer list of maximum size for size bins

truvari.SZBINMAX = [1, 5, 10, 15, 20, 30, 40, 50, 100, 200, 300, 400, 600, 800, 1000, 2500, 5000, 9223372036854775807]

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINS

string list of size bins

truvari.SZBINS = ['SNP', '[1,5)', '[5,10)', '[10,15)', '[15,20)', '[20,30)', '[30,40)', '[40,50)', '[50,100)', '[100,200)', '[200,300)', '[300,400)', '[400,600)', '[600,800)', '[800,1k)', '[1k,2.5k)', '[2.5k,5k)', '>=5k']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINTYPE

pandas.CategoricalDtype of truvari.SZBINS