Truvari package

Overview

Truvari - SV comparison and annotation toolkit

See help() of specific functions / objects for details

Objects:

Bench BenchOutput GT LogFileStderr MatchResult StatsBox SV VariantFile VariantRecord VariantParams

Extra methods:

bed_ranges() benchdir_count_entries() best_seqsim() build_region_tree() check_vcf_index() chunker() cmd_exe() compress_index_vcf() coords_within() count_entries() extend_region_tree() file_zipper() help_unknown_cmd() get_gt() get_scalebin() get_sizebin() get_svtype() make_temp_filename() merge_region_tree_overlaps() msa2vcf() opt_gz_open() optimize_df_memory() overlap_percent() overlaps() performance_metrics() read_bed_tree() reciprocal_overlap() restricted_float() restricted_int() ref_ranges() roll_seqsim() seqsim() setup_logging() sizesim() unroll_seqsim() vcf_ranges() vcf_to_df()

Data:

truvari.HEADERMAT truvari.QUALBINS truvari.SVTYTYPE truvari.SZBINMAX truvari.SZBINS truvari.SZBINTYPE

Variant Handling

class truvari.VariantFile(filename, *args, params=None, **kwargs)[source]

Wrapper around pysam.VariantFile with helper functions for iteration.

fetch(*args, **kwargs)[source]

Fetch variants from the pysam.VariantFile, wrapping them into truvari.VariantRecords.

Parameters:

args – Positional arguments for the pysam.VariantFile.fetch method.
kwargs – Keyword arguments for the pysam.VariantFile.fetch method.

Returns:

Iterator of truvari.VariantRecord objects.

Return type:

iterator

fetch_bed(bed_fn, inside=True, with_region=False)[source]

Fetch variants from the VCF based on regions defined in a BED file.

Parameters:

bed_fn (str) – Path to the BED file defining regions of interest.
inside (bool) – If True, fetch variants inside the regions. If False, fetch variants outside the regions.
with_region (bool) – If True, return tuples of (truvari.VariantRecord, region). Defaults to False.

Returns:

Iterator of truvari.VariantRecord objects or tuples of (truvari.VariantRecord, region).

Return type:

iterator

fetch_regions(tree, inside=True, with_region=False)[source]

Fetch variants from the VCF based on regions defined in a tree of chrom:IntervalTree.

Parameters:

tree (dict) – Tree of chrom:IntervalTree defining regions of interest.
inside (bool) – If True, fetch variants inside the regions. If False, fetch variants outside the regions.
with_region (bool) – If True, return tuples of (truvari.VariantRecord, region). Defaults to False.

Returns:

Iterator of truvari.VariantRecord objects or tuples of (truvari.VariantRecord, region).

Return type:

iterator

write(record)[source]

Write a truvari.VariantRecord to the pysam.VariantFile.

Parameters:: record (truvari.VariantRecord) – The truvari.VariantRecord to be written.

class truvari.VariantRecord(record, params=None)[source]

Wrapper around pysam.VariantRecords with helper functions of variant properties and basic comparisons

allele_freq_annos(samples=None)[source]

Calculate allele annotations for a VCF Entry

Parameters:

samples (list of strings, optional) – Subset of samples from the entry over which to calculate annos

Returns:

Dictonary of
AF - allele frequency
MAF - minor allele frequency
ExcHet - excess heterozygosity
HWE - hardy weinberg equilibrium
AC - allele count
MAC - minor allele count
AN - number of called alleles

Return type:

dict

Example

>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/multi.vcf.gz')
>>> e = next(v)
>>> e.allele_freq_annos()
{'AF': 0.5, 'MAF': 0.5, 'ExcHet': 1.0, 'HWE': 1.0, 'MAC': 1, 'AC': [1, 1], 'AN': 2, 'N_HEMI': 0, 'N_HOMREF': 0, 'N_HET': 1, 'N_HOMALT': 0, 'N_MISS': 2}

bnd_direction_strand()[source]

Parses a BND ALT string to determine its direction and strand.

A BND (breakend) ALT string indicates a structural variant breakpoint. This method parses the ALT string to determine:

The direction: “left” means the piece is anchored on the left side of the breakpoint, while “right” means it’s anchored on the right.
The strand: “direct” indicates the base is on the direct strand, and “complement” indicates the base is on the complement strand.

Note

This method assumes that self.is_bnd() is True, meaning the variant is a BND-type structural variant.

Returns:: A tuple containing the direction (“left” or “right”) and the strand (“direct” or “complement”).
Return type:: tuple (str, str)
Raises:: ValueError – If the ALT string does not follow the expected BND format.

bnd_match(other)[source]: Build a MatchResult for bnds

bnd_position()[source]

Extracts the chromosome and position from a BND ALT string.

Breakend (BND) ALT strings indicate structural variant breakpoints that span across chromosomes or positions. This method parses the ALT string to extract the target chromosome and position of the breakpoint.

Returns:: A tuple containing the chromosome (as a string) and the position (as an integer).
Return type:: tuple (str, int)
Raises:: ValueError – If the ALT string does not follow the expected BND format.

boundaries(ins_inflate=False)[source]

Get the start and end of an entry

Parameters:: ins_inflate (bool, optional) – inflate INS boundaries by sv length
Returns:: start/end
Return type:: tuple (int, int)

compare_gts(other, match)[source]

Populates genotype-specific comparison details in a MatchResult.

This method compares the genotypes of a “base” (reference) variant and a “comparison” variant and updates the provided MatchResult object with the results. It computes the genotype counts and determines the difference between the base and comparison genotypes.

Parameters:

other (truvari.VariantRecord) – The other variant entry.
match (truvari.MatchResult) – The MatchResult object to update with genotype comparison details.

Updates the MatchResult object with the following attributes:

base_gt: The genotype of the base sample.
base_gt_count: The count of the reference allele (1) in the base genotype.
comp_gt: The genotype of the comparison sample.
comp_gt_count: The count of the reference allele (1) in the comparison genotype.
gt_match: The absolute difference between base_gt_count and comp_gt_count.

copy()[source]: Copies this truvari.VariantRecord

cpx_match(other)[source]: Attempt to match non-BND variants with a BND. This is performed by decomposing the non-BND into its BND representations before calling bnd_match. This will make at least two BNDs, so the MatchResults are sorted and the greater is returned.

decompose()[source]

Decompose SVs into BND Records

This method decomposes structural variants (SVs) of <DEL>, <DUP>, and <INV> into Breakend (BND) records. The decomposed variants are stored internally and returned as a list.

Returns:

list: A list of new BND variant records. List will be empty when no valid decomposition

Notes:

For INV, four BNDs are created.
For DEL and DUP, two BNDs are created.
DUP are assumed to be of type DUP:TANDEM.

distance(other)[source]

Calculate the start and end distances of the pair. Negative distances indicate self is upstream of other

Parameters:: other (truvari.VariantRecord) – Other to compare
Returns:: starts and ends distance
Return type:: tuple (int, int)

filter_call(base=False)[source]

Determines whether a variant call should be filtered based on Truvari parameters or specific requirements.

This method evaluates a variant entry (entry) and checks if it should be excluded from further processing based on filtering criteria such as monomorphic reference, multi-allelic records, filtering status, sample presence, or unsupported single-end BNDs.

Parameters:

entry (truvari.VariantRecord) – The variant entry to evaluate.
base (bool, optional) – A flag indicating whether the entry is the “base” (reference) call or the “comparison” call. Filtering behavior may differ based on this flag.

Returns:

True if the variant should be filtered (excluded), otherwise False.

Return type:

bool

Raises:

ValueError – If the entry is multi-allelic and check_multi is enabled in the Truvari parameters.

Filtering Logic:

Monomorphic Reference: If check_monref is enabled and the entry is a monomorphic reference, it is filtered.
Multi-Allelic Records: If check_multi is enabled and the entry is multi-allelic, an error is raised.
Filtered Variants: If passonly is enabled and the entry is flagged as filtered, it is excluded.
Sample Presence: If no_ref is set to include the entry’s type (base or comparison) or pick == ‘ac’, the sample must be present in the entry.
Single-End BNDs: Single-end BNDs are always excluded.

filter_size(base=False)[source]

Determines whether a variant entry should be filtered based on its size.

This method evaluates the size of a variant and checks if it falls outside the specified size thresholds. Filtering criteria depend on whether the entry is a “base” (reference) or “comparison” call.

Parameters:

entry (truvari.VariantRecord) – The variant entry to evaluate.
base (bool, optional) – A flag indicating whether the entry is the “base” (reference) call. If True, the sizemin parameter is used as the minimum size threshold. Otherwise, the sizefilt parameter is used.

Returns:

True if the variant should be filtered due to its size, otherwise False.

Return type:

bool

Filtering Logic:

Minimum Size (Base): If base=True and the size is less than sizemin, the variant is filtered.
Minimum Size (Comparison): If base=False and the size is less than sizefilt, the variant is filtered.
Maximum Size: If the size exceeds sizemax, the variant is filtered.

get_alt()[source]: Returns either the resolved ALT or _record.alt[0]

get_record()[source]: Return internal pysam.VariantRecord

get_ref()[source]: Returns either the resolved REF or _record.ref

gt(sample=0)[source]: Returns the raw GT field from a record for a specific sample. Returns None if GT is not in FORMAT

is_bnd()[source]: Returns if a record is a resolved BND

is_filtered(values=None)[source]

Checks if entry should be filtered given the filter values provided. If values is None, assume that filter must have PASS or be blank ‘.’

Parameters:: values (set, optional) – set of filter values for intersection
Returns:: True if entry should be filtered
Return type:: bool

Example

>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> e = next(v)
>>> e.is_filtered() # PASS shouldn't be filtered
False
>>> e = next(v)
>>> e.is_filtered(["lowQ"]) # Call isn't lowQ, so filter
True

is_monrefstar()[source]: Returns if a record is a reference monozygotic or star-alt

is_multi()[source]: Returns if a record is multi-allelic

is_present(sample=0, allow_missing=True)[source]

Checks if entry’s sample genotype is present and is heterozygous or homozygous (a.k.a. present) If allow_missing, just check for a 1 in the genotype. Otherwise, a missing (‘.’) genotype isn’t considered present

Parameters:: sample (string, optional) – sample name
Returns:: True if variant is present in the sample
Return type:: bool

Example

>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> e = next(v)
>>> e.is_present(allow_missing=False)
True

is_resolved()[source]

Checks if an SV is a sequence resolved by checking it is has an alt, is not symbolic, is not monref, is not BND

Returns:: is_resolved
Return type:: bool

is_single_bnd()[source]: Returns if a record is a single-end BND

is_symbolic()[source]: Returns if a record is a symbolic variant (e.g. ALT == <DEL>)

match(other)[source]: Build a MatchResult from comparison of two VariantRecords If self and other are non-bnd, calls VariantRecord.var_match, If self and other are bnd, calls VariantRecord.bnd_match If decompose is on, calls VariantRecord.cpx_match Otherwise, returns a default MatchResult

move_record(out_vcf, take_samples=None, into_samples=None)[source]: Similar to pysam.VariantRecord.translate, except this allows adding new FORMAT fields and optionally selects a subset of samples

recovl(other, ins_inflate=True)[source]

Calculates reciprocal overlap of two entries

Parameters:

other (truvari.VariantRecord) – Other entry to compare with
ins_inflate (bool) – inflate entry boundaries

Returns:

The reciprocal overlap

Return type:

float

Example

>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> a.recovl(b)
0

resolve(ref)[source]

Attempts to resolve the REF/ALT sequences of a structural variant (SV) and updates the instance attributes.

This method tries to reconstruct the REF and ALT sequences for specific types of structural variants (e.g., deletions, inversions, duplications) based on the provided reference genome sequence. If successful, it stores the resolved sequences in self._resolved_ref and self._resolved_alt, and may adjust self.end.

Parameters:: ref (pysam.FastaFile) – The reference genome used to resolve the structural variant. Typically a pysam.FastaFile or equivalent object providing access to reference sequences.
Returns:: True if the REF/ALT sequences are successfully resolved, otherwise False.
Return type:: bool

Updates:

self._resolved_ref: The resolved REF sequence.
self._resolved_alt: The resolved ALT sequence.
self.end: May be adjusted for specific variant types (e.g., duplications).

Resolution Logic:

If ref is None, the ALT is <CNV> or <INS>, or the start position is invalid, the method returns False.
The method fetches the sequence from the reference genome using the variant’s chrom, start, and end.
The resolution depends on the variant type:
- Deletion (`DEL`): The REF sequence is the fetched reference sequence, and the ALT is the first base.
- Inversion (`INV`): The REF sequence is the fetched reference sequence, and the ALT is the reverse complement of the REF.
- Duplication (`DUP`) (when self.params.dup_to_ins is enabled): The REF is the first base of the fetched sequence, and the ALT is the full sequence. The end is adjusted to be start + 1.
Returns False if the variant type is unsupported or cannot be resolved.

same_type(other)[source]

Check if self.var_type() == other.var_type() with extra handling for dup-to-ins

Parameters:

other (truvari.VariantRecord) – Other entry to compare with
dup_to_ins (bool) – Convert DUP to INS types

Returns:

True if entry SVTYPEs match

Return type:

bool

seqsim(other)[source]

Calculate sequence similarity of two entries. If reference is not None, compare their shared reference context. Otherwise, use the unroll technique.

Assumes that is either a sequence resolved variant or self.resolve has been called.

Parameters:

other (truvari.VariantRecord) – Other to compare with
roll (bool) – compare the lexicographically minimum rotation of sequences

Returns:

sequence similarity

Return type:

float

sizesim(other)[source]

Calculate the size similarity and difference for two entries

Parameters:: other (truvari.VariantRecord) – Other entry to compare with
Returns:: size similarity and size diff (A - B)
Return type:: (float, int)

Example

>>> import truvari
>>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz')
>>> a = next(v)
>>> b = next(v)
>>> a.sizesim(b)
(0.07142857142857142, 13)

to_hash(hasher=<built-in function openssl_sha1>)[source]

Turn variant into a key and hash with provided hasher

Parameters:: hasher – hashing function
Returns:: hash
Return type:: string

to_key(prefix='', bounds=False)[source]

Turn a vcf entry into a hashable key string. Use the prefix (base/comp) to indicate the source VCF when consolidating multiple files’ calls. If bounds: call entry_boundaries for start/stop.

Warning

If a caller redundantly calls a variant exactly the same it will not have a unique key

Parameters:

prefix (string, optional) – prefix
bounds (bool, optional) – use entry_boundaries

Returns:

hashable string uniquely identifying the variant

Return type:

string

var_match(other)[source]: Build a MatchResult

var_size()[source]

Determine the size of the variant.

Note

How size is determined

Starts by trying to use INFO/SVLEN
If SVLEN is unavailable and ALT field is an SV (e.g. <INS>, <DEL>, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS.
If len of vcf.REF and vcf.ALT[0] are equal, return len of vcf.REF
Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0])))

Returns:: the entry’s size
Return type:: int

var_type()[source]

Return entry’s SVTYPE

Note

the first of these checks that succeeds determins the svtype

BND if pysam.VariantRecord.alleles_variant_types[1] == “BND”
Checking INFO/SVTYPE
if is_resolved, REF and ALT seq size difference
if symbolic (e.g. <DEL>) parse the ALT. Subtypes are ignored (e.g. DUP:ME becomes DUP)
svtype is unknown (UNK)

Returns:: SV type
Return type:: truvari.SV

within(rstart, rend)[source]: Extract entry boundaries and type to call truvari.coords_within

within_tree(tree)[source]: Extract entry and tree boundaries to call truvari.coords_within

class truvari.VariantParams(args=None, **kwargs)[source]

Holds variant parsing and matching parameters.

Attributes

Attribute	Description
refdist	Distance threshold for comparing positions in the reference genome. Default: 500.
pctseq	Minimum percentage of sequence similarity required for a match. Default: 0.70 (70%).
pctsize	Minimum percentage of size similarity required for a match. Default: 0.70 (70%).
pctovl	Minimum percentage of reciprocal overlap required for comparing variants. Default: 0.0 (disabled).
typeignore	Whether to ignore variant type mismatches during comparison. Default: False.
no_roll	Whether to disable rolling of sequences for comparisons. Default: False.
chunksize	Number of entries to process in each chunk. Default: 1000.
bSample	Sample name or index for the “base” (a.k.a. self) variants during comparisons. Default: 0.
cSample	Sample name or index for the “comparison” (a.k.a. other) variants during comparisons. Default: 0.
dup_to_ins	Whether to treat duplications as insertions for some operations. Default: False.
bnddist	Maximum allowed distance for breakend (BND) comparisons. Default: 100.
sizemin	Minimum variant size to consider. Default: 50.
sizefilt	Minimum size filter for comparison in the “comparison” dataset. Default: 30.
sizemax	Maximum variant size to consider. Default: -1, off.
passonly	Whether to only consider variants with a “PASS” filter status. Default: False.
no_ref	Whether to ignore reference homozygous variants in (a)ll, (b)ase, or (c)omp VCF Default: False (off)`.
pick	Strategy for picking matches by Bench (single, ac, multi).
ignore_monref	Whether to ignore monoallelic reference calls. Default: True.
check_multi	Whether to check for and handle multi-allelic records. Default: True.
check_monref	Whether to check for monoallelic reference calls. Default: True.
no_single_bnd	Whether to exclude single-end breakends (BNDs) from comparisons. Default: True.
write_resolved	Whether to write resolved REF/ALT sequences to output. Default: False.
decompose	Decompose symbolic variants (e.g. <DEL>) into BNDs when comparing with BNDs. Default: True
short_circuit	Whether to enable short-circuit logic for early exits in comparisons. Default: False.
skip_gt	Whether to skip genotype comparisons. Default: False.
max_resolve	Maximum size of a variant to attempt sequence resolving. Default: 25000.

Objects

class truvari.MatchResult[source]

Holds results from a matching operation

Attributes

Attribute	Description
base	The base (a.k.a. self) variant
comp	The comp (a.k.a. other) variant
state	Boolean of if variants match
seqsim	Sequence similarity of variants
sizesim	Size similarity of variants
ovlpct	Reciprocal overlap ov variants
sizediff	Base variant var_size minus comp variant var_size
st_dist	Base variant start position minus comp variant start position
ed_dist	Base variant end position minus comp variant end position
gt_match	Boolean of if genotypes match
score	TruScore of matches
matid	Place to put MatchIds, not populated by truvari.VariantRecord.match

calc_score()[source]: Unite the similarity measures and make a score

class truvari.GT(value)[source]

Genotypes

REF = <GT.REF: 0>
HET = <GT.HET: 1>
HOM = <GT.HOM: 2>
NON = <GT.NON: 3> - Non-Genotyped (e.g. ./.)
UNK = <GT.UNK: 4> - Undetermined Genotype

class truvari.SV(value)[source]

SVtypes

SNP = <SV.SNP: 0>
DEL = <SV.DEL: 1>
INS = <SV.INS: 2>
DUP = <SV.DUP: 3>
INV = <SV.INV: 4>
NON = <SV.NON: 5> - Not a variant (monomorphic ref?)
UNK = <SV.UNK: 6> - Unknown type

class truvari.Bench(params=None, base_vcf=None, comp_vcf=None, outdir=None, includebed=None, extend=0, debug=False, do_logging=False)[source]

Object to perform operations of truvari bench

You can build a Bench worker with default matching parameters via:

m_bench = truvari.Bench()

If you’d like to customize the parameters, build and edit a VariantParams and pass it to the Bench init

params = truvari.VariantParams(pctseq=0.50)
m_bench = truvari.Bench(params)

To run on a chunk of truvari.VariantRecord already loaded, pass them in as lists to:

matches = m_bench.compare_calls(base_variants, comp_variants)

For advanced parsing, one can build the match matrix for a chunk of calls with:

match_matrix = m_bench.build_matrix(base_variants, comp_variants)

This can then be used for things such as creating custom match pickers or adding new matching checks.

If you want to run on existing files, simply supply the arguments to init

m_bench = truvari.Bench(params, base_vcf, comp_vcf, outdir)
output = m_bench.run()

Note

Running on files must write to an output directory and is the only way to use things like ‘includebed’. However, the returned BenchOutput has attributes pointing to all the results.

build_matrix(base_variants, comp_variants, chunk_id=0)[source]: Builds MatchResults, returns them as a numpy matrix

check_refine_candidate(result)[source]: Adds this region as a candidate for refinement if there are unmatched variants

compare_calls(base_variants, comp_variants, chunk_id=0)[source]: Builds MatchResults, returns them as a numpy matrix if there’s at least one base and one comp variant. Otherwise, returns a list of the variants placed in MatchResults

compare_chunk(chunk)[source]: Given a filtered chunk, (from chunker) compare all of the calls

param_dict()[source]: Returns the parameters as a dict

run()[source]: Runs bench and returns the resulting truvari.BenchOutput

class truvari.BenchOutput(bench, params)[source]

Makes all of the output files for a Bench.run

The variable BenchOutput.vcf_filenames holds a dictonary. The keys are tpb, tpc for true positive base/comp vcf filename and fn, fp. The variable stats_box holds a StatsBox.

close_outputs()[source]: Close all the files

write_match(match)[source]: Annotate a MatchResults’ entries then write to the apppropriate file and do the stats counting. Writer is responsible for handling FPs between sizefilt-sizemin

class truvari.StatsBox[source]

Make a blank stats box for counting TP/FP/FN and calculating performance

calc_performance()[source]: Calculate the precision/recall

write_json(out_name)[source]: Write stats as json to file

class truvari.LogFileStderr(fn)[source]

Write to stderr and a file

Useful in conjunction with setup_logging()

Example

>>> import truvari
>>> truvari.setup_logging(stream=LogFileStderr('log.txt'))

flush()[source]: Flush handlers

write(*args)[source]: Write to handlers

Extra Methods

truvari.bed_ranges(bed, chunk_size=10000000)[source]

Chunk bed regions into pieces. Useful for multiprocessing.

Parameters:

bed (string) – Filename of bed to chunk
chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example

>>> import truvari
>>> gen = truvari.bed_ranges("repo_utils/test_files/beds/giab.bed", 1000)
>>> print(len([_ for _ in gen]))
992

truvari.benchdir_count_entries(benchdir, regions, within=True, threads=4)[source]

Count the number of variants per bed region in Truvari bench directory by state

Returns a pandas dataframe of the counts

truvari.best_seqsim(a_seq, b_seq, st_dist)[source]: Returns best of roll, unroll, and direct sequence similarity

Warning

roll_seqsim is only called when both sequences are < 500bp in length

truvari.build_region_tree(vcfA, vcfB=None, includebed=None)[source]

Builds a dictionary of chromosome names mapped to IntervalTree objects containing regions based on the input VCF files and an optional BED file.

Parameters

vcfApysam.VariantFile: The primary VCF file used to define the regions and contigs.
vcfBpysam.VariantFile, optional: The secondary VCF file used for comparison. If provided, its contigs are considered to exclude any contigs not present in vcfA. Default is None.
includebedstr, optional: Path to a BED file specifying regions to include. If provided, these regions override those defined by the VCF files. Contigs in the BED file not found in vcfA are excluded. Default is None.

Returns

dict: A dictionary where keys are chromosome names (str), and values are IntervalTree objects representing the regions.

Notes

If includebed is provided:
- The function reads regions from the BED file and uses them to define the regions in all_regions.
- Contigs in the BED file that are not present in vcfA are excluded, and a warning is logged.
Contigs in vcfB that are not present in vcfA are ignored, and a warning is logged.
If a contig in vcfA lacks a length definition, the program logs an error and exits with code 10.

Raises

SystemExit: If a contig in vcfA has no length defined, the program terminates with an error message and exit code 10.

Logging

Logs warnings if contigs in vcfB are not found in vcfA.
Logs warnings if contigs in the BED file are not found in vcfA.
Logs an error and exits if any contig in vcfA lacks a length definition.

truvari.read_bed_tree(filename, chrom_col=0, start_col=1, end_col=2, one_based=False, comment='#', idxfmt=None)[source]

Build an dictionary of IntervalTrees for each chromosome from tab-delimited annotation file

By default, the file is assumed to be a bed-format. If custom chrom/start/end are used, the columns can be specified.

idxfmt is a string that will be formatted with the line number from the file (e.g. “num %d” will make intervals with data=”num 0”, “num 1”. By default the data will be interger line number. If intervals will be compared between anno_trees, set idxfmt to “”

Parameters:

filename (string) – Path to file to parse, can be compressed
chrom_col (integer, optional) – Index of column in file with chromosome
start_col (integer, optional) – Index of column in file with start
end_col (integer, optional) – Index of column in file with end
one_based (bool, optional) – True if coordinates are one-based
is_pyintv (bool, optional) – add 1 to end position to correct pyintervaltree.overlap behavior
comment (string, optional) – ignore lines if they start with this string
idxfmt (string, optional) – Index of column in file with chromosome

Returns:

dictionary with chromosome keys and intervaltree.IntervalTree values, number of lines parsed

Return type:

tuple (dict, int)

truvari.check_vcf_index(vcf_path)[source]: Return true if an index file is found for the vcf

truvari.chunker(params, *files)[source]

Given a truvari.VariantParams and multiple files, zip them and create chunks

Yields tuple of the chunk of calls, and an identifier of the chunk

truvari.cmd_exe(cmd, stdin=None, timeout=-1, cap_stderr=True, pipefail=False)[source]

Executes a command through the shell and captures the output while enabling process handling with timeouts and keyboard interrupts.

Parameters:

cmd (string) – The command to be executed.
stdin (bytes) – stdinput to pipe to command
timeout (int) – Timeout for the command in minutes. So 1440 means 24 hours. -1 means never
cap_stderr (boolean) – If True, capture the stderr and return it as part of the returned cmd_results. Otherwise, stderr will be streamed through to the terminal
pipefail (boolean) – Set to True if the cmd contains pipes |

Returns:

namedtuple of
ret_code - integer, exit code of the command
stdout - string, captured standard output of the command
stderr - binary string, captured standard error of the command
run_time - datetime.timedelta, execution time

Return type:

namedtuple (ret_code, stdout, stderr, run_time)

Example

>>> import truvari
>>> ret = truvari.cmd_exe("echo 'hello world'")
>>> ret.ret_code
0
>>> ret.stdout
'hello world\n'
>>> import truvari
>>> ret = truvari.cmd_exe("sleep 5", pipefail=True, timeout=2/60) # Error logged is caught
>>> ret.ret_code
214

truvari.compress_index_vcf(fn, fout=None, remove=True)[source]

Compress and index a vcf. If no fout is provided, write to fn + ‘.gz’

Parameters:

fn (string) – filename of vcf to work on
fout (string, optional) – output filename
remove (bool) – remove fn after fout is made

truvari.coords_within(qstart, qend, rstart, rend, end_within)[source]

Returns if a span is within the provided [start, end). All coordinates assumed 0-based One exception is made for REPL types with anchor bases. This is for TR callers and GIABTR benchmark where variants will span the entire includebed region.

Parameters:

qstart (integer) – query start position
qend (integer) – query end position
start (integer) – start of span
end (integer) – end of span
end_within (bool) – if true, qend <= rend, else qend < rend

Returns:

If the coordinates are within the span

Return type:

bool

truvari.count_entries(vcf, chroms, regions, within)[source]: Count the number of variants per bed region a VCF regions is a list of lists with sub-lists having 0:3 of chrom, start, end Returns a list of the counts in the same order as the regions

truvari.extend_region_tree(tree, pad)[source]: Extends all intervals by a fixed number of bases on each side Returns a copy of this tree

truvari.file_zipper(*start_files)[source]

Zip multiple files to yield their entries in order.

The function takes as input tuples of (key, iterable), where:

key is an identifier (used to track which file the yielded entry comes from).
iterable is an iterable object, typically a truvari.VariantFile.

The function iterates through all input files in a coordinated manner, yielding the entries in order.

Parameters:: start_files (tuple) – A variable-length argument list of tuples (key, iterable).
Yields:: A tuple of (key, truvari.VariantRecord), where key is the file identifier and the second element is the next record from the corresponding file.
Return type:: tuple
Raises:: StopIteration – Raised when all input files have been exhausted.

Logs:

Logs a summary of the zipping process after all files have been processed.

truvari.help_unknown_cmd(user_cmd, avail_cmds, threshold=0.8)[source]

Guess the command in avail_cmds that’s most similar to user_cmd. If there is no guess below threshold, don’t guess.

Parameters:

user_cmd (string) – user command
avail_cmds (list of strings) – available commands
threshold (float, optional) – max score (0-1) to give a guess

Returns:

best guess from :param: avail_cmds or None

Return type:

string

Example

>>> import truvari
>>> truvari.help_unknown_cmd('banno', ["bench", "anno"])
'anno'
>>> truvari.help_unknown_cmd('random', ["bench", "anno"])

truvari.get_gt(gt)[source]

Turn a genotype tuple into a GT object

Parameters:: gt (tuple (int, int)) – Genotype tuple
Returns:: A GT of the genotype
Return type:: truvari.GT

Example

>>> import truvari
>>> truvari.get_gt((0, 0))
<GT.REF: 0>
>>> truvari.get_gt((0, 1, 1))
<GT.UNK: 4>

truvari.get_scalebin(x, rmin=0, rmax=100, tmin=0, tmax=100, step=10)[source]

Scale variable x from rdomain to tdomain with step sizes return key, index

Parameters:

x (number) – Number to scale
rmin (number, optional) – The minimum of the range of your measurement
rmax (number, optional) – The maximum of the range of your measurement
tmin (number, optional) – The minimum of the range of your desired target scaling
tmax (number, optional) – The maximum of the range of your measurement
step (number, optional) – The step size of bins of target range

Returns:

The bin-string of the scaled variable

Return type:

string

Example

>>> import truvari
>>> truvari.get_scalebin(4, 1, 5, 0, 20, 5)
('[15,20)', 3)
>>> truvari.get_scalebin(6, 1, 5, 0, 20, 5)
('>=20', 4)

truvari.get_sizebin(sz)[source]

Bin a given size into truvari.SZBINS

Parameters:: sz (integer) – SVLEN to bin into the SZBINS
Returns:: SZBIN
Return type:: string

truvari.get_svtype(svtype)[source]

Turn a SVTYPE string to a truvari.SV object

Parameters:: svtype (string)
Returns:: A SV of the SVTYPE
Return type:: truvari.SV

Example

>>> import truvari
>>> truvari.get_svtype("INS")
<SV.INS: 2>
>>> truvari.get_svtype("foo")
<SV.UNK: 7>

truvari.make_temp_filename(tmpdir=None, suffix='')[source]: Get a random filename in a tmpdir with an optional extension

truvari.merge_region_tree_overlaps(tree)[source]: Runs IntervalTree.merge_overlaps on all trees. Info Logs the count of chromosomes with overlapping regions and Debug Logs the chromosome names with overlapping regions.

truvari.msa2vcf(msa, anchor_base='N')[source]

Parse an MSA dict of {name: alignment, …} and returns its VCF entries as a string

Assumes one entry in the MSA has the name ref_${chrom}:${start}-${end} which gives VCF entries coordinates Provide anchor_base to prevent ‘N’ from being used as an anchor base Returns a string of entries

Example (for dealing with test coverage not being seen)

>>> import truvari
>>> from truvari.phab import fasta_reader
>>> msa_dir = "repo_utils/test_files/external/fake_mafft/lookup/"
>>> msa_file = "fm_03ffe346bf2196b0ceebee5beb9a6acb.msa"
>>> seqs = open(msa_dir + msa_file).read()
>>> fasta = dict(fasta_reader(seqs))
>>> m_entries_str = truvari.msa2vcf(fasta)

truvari.opt_gz_open(in_fn)[source]

Chooses file handler for plain-text files or *.gz files.

returns a generator which yields lines of the file

truvari.optimize_df_memory(df)[source]

Optimize DataFrame memory by trying to downcast every column in-place. Returns the pre/post size

Parameters:: df (pandas.DataFrame) – Dataframe to optimize
Returns:: tuple (pre_size, post_size)
Return type:: int, sizes in bytes

truvari.overlap_percent(astart, aend, bstart, bend)[source]

Calculates the percent of range A which overlaps with range B

Parameters:

astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position

Returns:

overlap percent

Return type:

float

truvari.overlaps(s1, e1, s2, e2)[source]

Check if two start/end ranges have overlap

Parameters:

s1 (int) – range 1 start
e1 (int) – range 1 end
s2 (int) – range 2 start
e2 (int) – range 2 end

Returns:

True if ranges overlap

Return type:

bool

truvari.performance_metrics(tpbase, tp, fn, fp)[source]

Calculates precision, recall, and f1 given counts by state Can return None if values are zero

Parameters:

tpbase (int) – found base count
tp (int) – found comp count
fn (int) – missed base count
fp (int) – missed comp count

Returns:

tuple of precision, recall, f1

Return type:

tuple, float

truvari.reciprocal_overlap(astart, aend, bstart, bend)[source]

Calculates reciprocal overlap of two ranges

Parameters:

astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position

Returns:

reciprocal overlap

Return type:

float

truvari.restricted_float(x)[source]

Restrict float to range (0,1). Raises argparse.ArgumentTypeError if float is out of range Used with argparse.ArgumentParser.add_argument type parameter

Parameters:: x (float) – number to check
Returns:: input number
Return type:: float

Example

>>> import truvari
>>> truvari.restricted_float(0.5)
0.5
>>> truvari.restricted_float(2)
Traceback (most recent call last):
argparse.ArgumentTypeError: 2.0 not in range (0, 1)

truvari.restricted_int(x)[source]

Restrict int to positive. Raises argparse.ArgumentTypeError if int is negative Used with argparse.ArgumentParser.add_argument type parameter

Parameters:: x (int) – number to check
Returns:: input number
Return type:: float

Example

>>> import truvari
>>> truvari.restricted_int(5)
5
>>> truvari.restricted_int(-2)
Traceback (most recent call last):
argparse.ArgumentTypeError: -2 is < 0

truvari.ref_ranges(reference, chunk_size=10000000)[source]

Chunk reference into pieces. Useful for multiprocessing.

Parameters:

reference (string) – Filename of reference to chunk
chunk_size (int, optional) – Length of reference chunks

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example

>>> import truvari
>>> gen = truvari.ref_ranges("repo_utils/test_files/references/reference.fa", 1000)
>>> print(len([_ for _ in gen]))
1000

truvari.roll_seqsim(a_seq, b_seq)[source]: Compare the lexicographically smallest rotations of two sequences

truvari.seqsim(allele1, allele2)[source]

Calculate similarity of two sequences

Parameters:

allele1 (str) – first entry
allele2 (str) – second entry

Returns:

sequence similarity

Return type:

float

truvari.setup_logging(debug=False, stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, log_format='%(asctime)s [%(levelname)s] %(message)s', show_version=False)[source]

Create default logger

Parameters:

debug (boolean, optional) – Set log-level to logging.DEBUG
stream (file handler, optional) – Where log is written
log_format (string, optional) – Format of log lines
show_version (bool, optional) – Start log with truvari version and command line args

truvari.sizesim(sizeA, sizeB)[source]

Calculate the size similarity percent and size diff for two sizes

Parameters:

sizeA (int) – first size
sizeB (int) – second size

Returns:

size similarity percent and size diff (A - B)

Return type:

(float, int)

truvari.unroll_seqsim(seqA, seqB, p)[source]

Unroll two sequences and compare. See https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c for details

Parameters:

seqA (string) – sequence upstream sequence
seqB (string) – sequence downstream sequence
p (integer) – how many bases upstream seqA is from seqB

Returns:

sequence similarity of seqA vs seqB after unrolling

Return type:

float

truvari.vcf_ranges(vcf, min_dist=1000)[source]

Chunk vcf into discrete pieces. Useful for multiprocessing.

Parameters:

vcf (string) – Filename of vcf to find ranges
min_dist (int, optional) – Minimum distance between entries for new range

Returns:

generator of tuples (ref_name, start, stop)

Return type:

iterator

Example

>>> import truvari
>>> gen = truvari.vcf_ranges("repo_utils/test_files/variants/input1.vcf.gz")
>>> print(len([_ for _ in gen]))
228

truvari.vcf_to_df(fn, with_info=True, with_format=True, sample=None, no_prefix=False, alleles=False)[source]

Parse a vcf file and turn it into a dataframe. Tries its best to pull info/format tags from the sample information. For Formats with Number=G, append _ref, _het, _hom. For things with Number=R, append _ref, _alt. Specify which sample with its name or index in the VCF.

Parameters:

fn (string) – File name of VCF to open and turn into a DataFrame
with_info (boolean, optional) – Add the INFO fields from the VCF to the DataFrame columns
with_format – Add the FORMAT fields from the VCF to the DataFrame columns
sample (int/string, optional) – Sample from the VCF to parse. Only used when with_format==True
no_prefix (boolean, optional) – Don’t prefix FORMAT fields with sample name
alleles (boolean, optional) – Add column with allele sequences

Returns:

Converted VCF

Return type:

pandas.DataFrame

Example

>>> import truvari
>>> df = truvari.vcf_to_df("repo_utils/test_files/variants/input2.vcf.gz", True, True)
>>> df.columns
Index(['chrom', 'start', 'end', 'id', 'svtype', 'svlen', 'szbin', 'qual',
       'filter', 'is_pass', 'QNAME', 'QSTART', 'QSTRAND', 'SVTYPE', 'SVLEN',
       'NA12878_GT', 'NA12878_PL_ref', 'NA12878_PL_het', 'NA12878_PL_hom',
       'NA12878_AD_ref', 'NA12878_AD_alt'],
      dtype='object')

Data

HEADERMAT

regular expression of vcf header INFO/FORMAT fields with groups

truvari.HEADERMAT = re.compile('##\\w+=<ID=(?P<name>\\w+),Number=(?P<num>[\\.01AGR]),Type=(?P<type>\\w+)'): Compiled regular expression object.

QUALBINS

0-100 quality score bin strings (step size 10)

truvari.QUALBINS = ['[0,10)', '[10,20)', '[20,30)', '[30,40)', '[40,50)', '[50,60)', '[60,70)', '[70,80)', '[80,90)', '[90,100)', '>=100']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SVTYTYPE

pandas.CategoricalDtype of truvari.SV

SZBINMAX

integer list of maximum size for size bins

truvari.SZBINMAX = [1, 5, 10, 15, 20, 30, 40, 50, 100, 200, 300, 400, 600, 800, 1000, 2500, 5000, 9223372036854775807]

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINS

string list of size bins

truvari.SZBINS = ['SNP', '[1,5)', '[5,10)', '[10,15)', '[15,20)', '[20,30)', '[30,40)', '[40,50)', '[50,100)', '[100,200)', '[200,300)', '[300,400)', '[400,600)', '[600,800)', '[800,1k)', '[1k,2.5k)', '[2.5k,5k)', '>=5k']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

SZBINTYPE

pandas.CategoricalDtype of truvari.SZBINS