Truvari package
Overview
Truvari - SV comparison and annotation toolkit
See help() of specific functions / objects for details
Objects:
Bench
BenchOutput
GT
LogFileStderr
MatchResult
StatsBox
SV
VariantFile
VariantRecord
VariantParams
Extra methods:
bed_ranges()
benchdir_count_entries()
best_seqsim()
build_region_tree()
check_vcf_index()
chunker()
cmd_exe()
compress_index_vcf()
coords_within()
count_entries()
extend_region_tree()
file_zipper()
help_unknown_cmd()
get_gt()
get_scalebin()
get_sizebin()
get_svtype()
make_temp_filename()
merge_region_tree_overlaps()
msa2vcf()
opt_gz_open()
optimize_df_memory()
overlap_percent()
overlaps()
performance_metrics()
phab()
read_bed_tree()
reciprocal_overlap()
restricted_float()
restricted_int()
ref_ranges()
roll_seqsim()
seqsim()
setup_logging()
sizesim()
unroll_seqsim()
vcf_ranges()
vcf_to_df()
Data:
truvari.HEADERMAT
truvari.QUALBINS
truvari.SVTYTYPE
truvari.SZBINMAX
truvari.SZBINS
truvari.SZBINTYPE
Variant Handling
- class truvari.VariantFile(filename, *args, params=None, **kwargs)[source]
Wrapper around pysam.VariantFile with helper functions for iteration.
- fetch(*args, **kwargs)[source]
Fetch variants from the pysam.VariantFile, wrapping them into truvari.VariantRecords.
- Parameters:
args – Positional arguments for the pysam.VariantFile.fetch method.
kwargs – Keyword arguments for the pysam.VariantFile.fetch method.
- Returns:
Iterator of truvari.VariantRecord objects.
- Return type:
iterator
- fetch_bed(bed_fn, inside=True, with_region=False)[source]
Fetch variants from the VCF based on regions defined in a BED file.
- Parameters:
bed_fn (str) – Path to the BED file defining regions of interest.
inside (bool) – If True, fetch variants inside the regions. If False, fetch variants outside the regions.
with_region (bool) – If True, return tuples of (truvari.VariantRecord, region). Defaults to False.
- Returns:
Iterator of truvari.VariantRecord objects or tuples of (truvari.VariantRecord, region).
- Return type:
iterator
- fetch_regions(tree, inside=True, with_region=False)[source]
Fetch variants from the VCF based on regions defined in a tree of chrom:IntervalTree.
- Parameters:
tree (dict) – Tree of chrom:IntervalTree defining regions of interest.
inside (bool) – If True, fetch variants inside the regions. If False, fetch variants outside the regions.
with_region (bool) – If True, return tuples of (truvari.VariantRecord, region). Defaults to False.
- Returns:
Iterator of truvari.VariantRecord objects or tuples of (truvari.VariantRecord, region).
- Return type:
iterator
- class truvari.VariantRecord(record, params=None)[source]
Wrapper around pysam.VariantRecords with helper functions of variant properties and basic comparisons
- allele_freq_annos(samples=None)[source]
Calculate allele annotations for a VCF Entry
- Parameters:
samples (list of strings, optional) – Subset of samples from the entry over which to calculate annos
- Returns:
- Dictonary ofAF - allele frequencyMAF - minor allele frequencyExcHet - excess heterozygosityHWE - hardy weinberg equilibriumAC - allele countMAC - minor allele countAN - number of called alleles
- Return type:
dict
- Example
>>> import truvari >>> v = truvari.VariantFile('repo_utils/test_files/variants/multi.vcf.gz') >>> e = next(v) >>> e.allele_freq_annos() {'AF': 0.5, 'MAF': 0.5, 'ExcHet': 1.0, 'HWE': 1.0, 'MAC': 1, 'AC': [1, 1], 'AN': 2, 'N_HEMI': 0, 'N_HOMREF': 0, 'N_HET': 1, 'N_HOMALT': 0, 'N_MISS': 2}
- bnd_direction_strand()[source]
Parses a BND ALT string to determine its direction and strand.
A BND (breakend) ALT string indicates a structural variant breakpoint. This method parses the ALT string to determine:
The direction: “left” means the piece is anchored on the left side of the breakpoint, while “right” means it’s anchored on the right.
The strand: “direct” indicates the base is on the direct strand, and “complement” indicates the base is on the complement strand.
Note
This method assumes that self.is_bnd() is True, meaning the variant is a BND-type structural variant.
- Returns:
A tuple containing the direction (“left” or “right”) and the strand (“direct” or “complement”).
- Return type:
tuple (str, str)
- Raises:
ValueError – If the ALT string does not follow the expected BND format.
- bnd_position()[source]
Extracts the chromosome and position from a BND ALT string.
Breakend (BND) ALT strings indicate structural variant breakpoints that span across chromosomes or positions. This method parses the ALT string to extract the target chromosome and position of the breakpoint.
- Returns:
A tuple containing the chromosome (as a string) and the position (as an integer).
- Return type:
tuple (str, int)
- Raises:
ValueError – If the ALT string does not follow the expected BND format.
- boundaries(ins_inflate=False)[source]
Get the start and end of an entry
- Parameters:
ins_inflate (bool, optional) – inflate INS boundaries by sv length
- Returns:
start/end
- Return type:
tuple (int, int)
- compare_gts(other, match)[source]
Populates genotype-specific comparison details in a MatchResult.
This method compares the genotypes of a “base” (reference) variant and a “comparison” variant and updates the provided MatchResult object with the results. It computes the genotype counts and determines the difference between the base and comparison genotypes.
- Parameters:
other (truvari.VariantRecord) – The other variant entry.
match (truvari.MatchResult) – The MatchResult object to update with genotype comparison details.
- Updates the MatchResult object with the following attributes:
base_gt: The genotype of the base sample.
base_gt_count: The count of the reference allele (1) in the base genotype.
comp_gt: The genotype of the comparison sample.
comp_gt_count: The count of the reference allele (1) in the comparison genotype.
gt_match: The absolute difference between base_gt_count and comp_gt_count.
- cpx_match(other)[source]
Attempt to match non-BND variants with a BND. This is performed by decomposing the non-BND into its BND representations before calling bnd_match. This will make at least two BNDs, so the MatchResults are sorted and the greater is returned.
- decompose()[source]
Decompose SVs into BND Records
This method decomposes structural variants (SVs) of <DEL>, <DUP>, and <INV> into Breakend (BND) records. The decomposed variants are stored internally and returned as a list.
- Returns:
list: A list of new BND variant records. List will be empty when no valid decomposition
- Notes:
For INV, four BNDs are created.
For DEL and DUP, two BNDs are created.
DUP are assumed to be of type DUP:TANDEM.
- distance(other)[source]
Calculate the start and end distances of the pair. Negative distances indicate self is upstream of other
- Parameters:
other (
truvari.VariantRecord) – Other to compare- Returns:
starts and ends distance
- Return type:
tuple (int, int)
- filter_call(base=False)[source]
Determines whether a variant call should be filtered based on Truvari parameters or specific requirements.
This method evaluates a variant entry (entry) and checks if it should be excluded from further processing based on filtering criteria such as monomorphic reference, multi-allelic records, filtering status, sample presence, or unsupported single-end BNDs.
- Parameters:
entry (truvari.VariantRecord) – The variant entry to evaluate.
base (bool, optional) – A flag indicating whether the entry is the “base” (reference) call or the “comparison” call. Filtering behavior may differ based on this flag.
- Returns:
True if the variant should be filtered (excluded), otherwise False.
- Return type:
bool
- Raises:
ValueError – If the entry is multi-allelic and check_multi is enabled in the Truvari parameters.
- Filtering Logic:
Monomorphic Reference: If check_monref is enabled and the entry is a monomorphic reference, it is filtered.
Multi-Allelic Records: If check_multi is enabled and the entry is multi-allelic, an error is raised.
Filtered Variants: If passonly is enabled and the entry is flagged as filtered, it is excluded.
Sample Presence: If no_ref is set to include the entry’s type (base or comparison) or pick == ‘ac’, the sample must be present in the entry.
Single-End BNDs: Single-end BNDs are always excluded.
- filter_size(base=False)[source]
Determines whether a variant entry should be filtered based on its size.
This method evaluates the size of a variant and checks if it falls outside the specified size thresholds. Filtering criteria depend on whether the entry is a “base” (reference) or “comparison” call.
- Parameters:
entry (truvari.VariantRecord) – The variant entry to evaluate.
base (bool, optional) – A flag indicating whether the entry is the “base” (reference) call. If True, the sizemin parameter is used as the minimum size threshold. Otherwise, the sizefilt parameter is used.
- Returns:
True if the variant should be filtered due to its size, otherwise False.
- Return type:
bool
- Filtering Logic:
Minimum Size (Base): If base=True and the size is less than sizemin, the variant is filtered.
Minimum Size (Comparison): If base=False and the size is less than sizefilt, the variant is filtered.
Maximum Size: If the size exceeds sizemax, the variant is filtered.
- gt(sample=0)[source]
Returns the raw GT field from a record for a specific sample. Returns None if GT is not in FORMAT
- is_filtered(values=None)[source]
Checks if entry should be filtered given the filter values provided. If values is None, assume that filter must have PASS or be blank ‘.’
- Parameters:
values (set, optional) – set of filter values for intersection
- Returns:
True if entry should be filtered
- Return type:
bool
- Example
>>> import truvari >>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> e = next(v) >>> e.is_filtered() # PASS shouldn't be filtered False >>> e = next(v) >>> e.is_filtered(["lowQ"]) # Call isn't lowQ, so filter True
- is_present(sample=0, allow_missing=True)[source]
Checks if entry’s sample genotype is present and is heterozygous or homozygous (a.k.a. present) If allow_missing, just check for a 1 in the genotype. Otherwise, a missing (‘.’) genotype isn’t considered present
- Parameters:
sample (string, optional) – sample name
- Returns:
True if variant is present in the sample
- Return type:
bool
- Example
>>> import truvari >>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> e = next(v) >>> e.is_present(allow_missing=False) True
- is_resolved()[source]
Checks if an SV is a sequence resolved by checking it is has an alt, is not symbolic, is not monref, is not BND
- Returns:
is_resolved
- Return type:
bool
- match(other)[source]
Build a MatchResult from comparison of two VariantRecords If self and other are non-bnd, calls VariantRecord.var_match, If self and other are bnd, calls VariantRecord.bnd_match If decompose is on, calls VariantRecord.cpx_match Otherwise, returns a default MatchResult
- move_record(out_vcf, take_samples=None, into_samples=None)[source]
Similar to pysam.VariantRecord.translate, except this allows adding new FORMAT fields and optionally selects a subset of samples
- recovl(other, ins_inflate=True)[source]
Calculates reciprocal overlap of two entries
- Parameters:
other (
truvari.VariantRecord) – Other entry to compare withins_inflate (bool) – inflate entry boundaries
- Returns:
The reciprocal overlap
- Return type:
float
- Example
>>> import truvari >>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> a = next(v) >>> b = next(v) >>> a.recovl(b) 0
- resolve(ref)[source]
Attempts to resolve the REF/ALT sequences of a structural variant (SV) and updates the instance attributes.
This method tries to reconstruct the REF and ALT sequences for specific types of structural variants (e.g., deletions, inversions, duplications) based on the provided reference genome sequence. If successful, it stores the resolved sequences in self._resolved_ref and self._resolved_alt, and may adjust self.end.
- Parameters:
ref (pysam.FastaFile) – The reference genome used to resolve the structural variant. Typically a pysam.FastaFile or equivalent object providing access to reference sequences.
- Returns:
True if the REF/ALT sequences are successfully resolved, otherwise False.
- Return type:
bool
- Updates:
self._resolved_ref: The resolved REF sequence.
self._resolved_alt: The resolved ALT sequence.
self.end: May be adjusted for specific variant types (e.g., duplications).
- Resolution Logic:
If ref is None, the ALT is <CNV> or <INS>, or the start position is invalid, the method returns False.
The method fetches the sequence from the reference genome using the variant’s chrom, start, and end.
- The resolution depends on the variant type:
Deletion (`DEL`): The REF sequence is the fetched reference sequence, and the ALT is the first base.
Inversion (`INV`): The REF sequence is the fetched reference sequence, and the ALT is the reverse complement of the REF.
Duplication (`DUP`) (when self.params.dup_to_ins is enabled): The REF is the first base of the fetched sequence, and the ALT is the full sequence. The end is adjusted to be start + 1.
Returns False if the variant type is unsupported or cannot be resolved.
- same_type(other)[source]
Check if self.var_type() == other.var_type() with extra handling for dup-to-ins
- Parameters:
other (
truvari.VariantRecord) – Other entry to compare withdup_to_ins (bool) – Convert DUP to INS types
- Returns:
True if entry SVTYPEs match
- Return type:
bool
- seqsim(other)[source]
Calculate sequence similarity of two entries. If reference is not None, compare their shared reference context. Otherwise, use the unroll technique.
Assumes that is either a sequence resolved variant or self.resolve has been called.
- Parameters:
other (
truvari.VariantRecord) – Other to compare withroll (bool) – compare the lexicographically minimum rotation of sequences
- Returns:
sequence similarity
- Return type:
float
- sizesim(other)[source]
Calculate the size similarity and difference for two entries
- Parameters:
other (
truvari.VariantRecord) – Other entry to compare with- Returns:
size similarity and size diff (A - B)
- Return type:
(float, int)
- Example
>>> import truvari >>> v = truvari.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> a = next(v) >>> b = next(v) >>> a.sizesim(b) (0.07142857142857142, 13)
- to_hash(hasher=<built-in function openssl_sha1>)[source]
Turn variant into a key and hash with provided hasher
- Parameters:
hasher – hashing function
- Returns:
hash
- Return type:
string
- to_key(prefix='', bounds=False)[source]
Turn a vcf entry into a hashable key string. Use the prefix (base/comp) to indicate the source VCF when consolidating multiple files’ calls. If bounds: call entry_boundaries for start/stop.
Warning
If a caller redundantly calls a variant exactly the same it will not have a unique key
- Parameters:
prefix (string, optional) – prefix
bounds (bool, optional) – use entry_boundaries
- Returns:
hashable string uniquely identifying the variant
- Return type:
string
- var_size()[source]
Determine the size of the variant.
Note
How size is determined
Starts by trying to use INFO/SVLEN
If SVLEN is unavailable and ALT field is an SV (e.g. <INS>, <DEL>, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS.
If len of vcf.REF and vcf.ALT[0] are equal, return len of vcf.REF
Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0])))
- Returns:
the entry’s size
- Return type:
int
- var_type()[source]
Return entry’s SVTYPE
Note
the first of these checks that succeeds determins the svtype
BND if pysam.VariantRecord.alleles_variant_types[1] == “BND”
Checking INFO/SVTYPE
if is_resolved, REF and ALT seq size difference
if symbolic (e.g. <DEL>) parse the ALT. Subtypes are ignored (e.g. DUP:ME becomes DUP)
svtype is unknown (UNK)
- Returns:
SV type
- Return type:
- class truvari.VariantParams(args=None, **kwargs)[source]
Holds variant parsing and matching parameters.
Attributes
Attribute
Description
refdist
Distance threshold for comparing positions in the reference genome. Default: 500.
pctseq
Minimum percentage of sequence similarity required for a match. Default: 0.70 (70%).
pctsize
Minimum percentage of size similarity required for a match. Default: 0.70 (70%).
pctovl
Minimum percentage of reciprocal overlap required for comparing variants. Default: 0.0 (disabled).
typeignore
Whether to ignore variant type mismatches during comparison. Default: False.
no_roll
Whether to disable rolling of sequences for comparisons. Default: False.
chunksize
Number of entries to process in each chunk. Default: 1000.
bSample
Sample name or index for the “base” (a.k.a. self) variants during comparisons. Default: 0.
cSample
Sample name or index for the “comparison” (a.k.a. other) variants during comparisons. Default: 0.
dup_to_ins
Whether to treat duplications as insertions for some operations. Default: False.
bnddist
Maximum allowed distance for breakend (BND) comparisons. Default: 100.
sizemin
Minimum variant size to consider. Default: 50.
sizefilt
Minimum size filter for comparison in the “comparison” dataset. Default: 30.
sizemax
Maximum variant size to consider. Default: -1, off.
passonly
Whether to only consider variants with a “PASS” filter status. Default: False.
no_ref
Whether to ignore reference homozygous variants in (a)ll, (b)ase, or (c)omp VCF Default: False (off)`.
pick
Strategy for picking matches by Bench (single, ac, multi).
ignore_monref
Whether to ignore monoallelic reference calls. Default: True.
check_multi
Whether to check for and handle multi-allelic records. Default: True.
check_monref
Whether to check for monoallelic reference calls. Default: True.
no_single_bnd
Whether to exclude single-end breakends (BNDs) from comparisons. Default: True.
write_resolved
Whether to write resolved REF/ALT sequences to output. Default: False.
decompose
Decompose symbolic variants (e.g. <DEL>) into BNDs when comparing with BNDs. Default: True
short_circuit
Whether to enable short-circuit logic for early exits in comparisons. Default: False.
skip_gt
Whether to skip genotype comparisons. Default: False.
max_resolve
Maximum size of a variant to attempt sequence resolving. Default: 25000.
Objects
- class truvari.MatchResult[source]
Holds results from a matching operation
Attributes
Attribute
Description
base
The base (a.k.a. self) variant
comp
The comp (a.k.a. other) variant
state
Boolean of if variants match
seqsim
Sequence similarity of variants
sizesim
Size similarity of variants
ovlpct
Reciprocal overlap ov variants
sizediff
Base variant var_size minus comp variant var_size
st_dist
Base variant start position minus comp variant start position
ed_dist
Base variant end position minus comp variant end position
gt_match
Boolean of if genotypes match
score
TruScore of matches
matid
Place to put MatchIds, not populated by truvari.VariantRecord.match
- class truvari.GT(value)[source]
Genotypes
REF = <GT.REF: 0>
HET = <GT.HET: 1>
HOM = <GT.HOM: 2>
NON = <GT.NON: 3> - Non-Genotyped (e.g. ./.)
UNK = <GT.UNK: 4> - Undetermined Genotype
- class truvari.SV(value)[source]
SVtypes
SNP = <SV.SNP: 0>
DEL = <SV.DEL: 1>
INS = <SV.INS: 2>
DUP = <SV.DUP: 3>
INV = <SV.INV: 4>
NON = <SV.NON: 5> - Not a variant (monomorphic ref?)
UNK = <SV.UNK: 6> - Unknown type
- class truvari.Bench(params=None, base_vcf=None, comp_vcf=None, outdir=None, includebed=None, extend=0, debug=False, do_logging=False)[source]
Object to perform operations of truvari bench
You can build a Bench worker with default matching parameters via:
m_bench = truvari.Bench()
If you’d like to customize the parameters, build and edit a
VariantParamsand pass it to the Bench initparams = truvari.VariantParams(pctseq=0.50) m_bench = truvari.Bench(params)
To run on a chunk of
truvari.VariantRecordalready loaded, pass them in as lists to:matches = m_bench.compare_calls(base_variants, comp_variants)
For advanced parsing, one can build the match matrix for a chunk of calls with:
match_matrix = m_bench.build_matrix(base_variants, comp_variants)
This can then be used for things such as creating custom match pickers or adding new matching checks.
If you want to run on existing files, simply supply the arguments to init
m_bench = truvari.Bench(params, base_vcf, comp_vcf, outdir) output = m_bench.run()
Note
Running on files must write to an output directory and is the only way to use things like ‘includebed’. However, the returned BenchOutput has attributes pointing to all the results.
- build_matrix(base_variants, comp_variants, chunk_id=0)[source]
Builds MatchResults, returns them as a numpy matrix
- check_refine_candidate(result)[source]
Adds this region as a candidate for refinement if there are unmatched variants
- compare_calls(base_variants, comp_variants, chunk_id=0)[source]
Builds MatchResults, returns them as a numpy matrix if there’s at least one base and one comp variant. Otherwise, returns a list of the variants placed in MatchResults
- run()[source]
Runs bench and returns the resulting
truvari.BenchOutput
- class truvari.BenchOutput(bench, params)[source]
Makes all of the output files for a Bench.run
The variable BenchOutput.vcf_filenames holds a dictonary. The keys are tpb, tpc for true positive base/comp vcf filename and fn, fp. The variable stats_box holds a
StatsBox.
Extra Methods
- truvari.bed_ranges(bed, chunk_size=10000000)[source]
Chunk bed regions into pieces. Useful for multiprocessing.
- Parameters:
bed (string) – Filename of bed to chunk
chunk_size (int, optional) – Length of reference chunks
- Returns:
generator of tuples (ref_name, start, stop)
- Return type:
iterator
- Example
>>> import truvari >>> gen = truvari.bed_ranges("repo_utils/test_files/beds/giab.bed", 1000) >>> print(len([_ for _ in gen])) 992
- truvari.benchdir_count_entries(benchdir, regions, within=True, threads=4)[source]
Count the number of variants per bed region in Truvari bench directory by state
Returns a pandas dataframe of the counts
- truvari.best_seqsim(a_seq, b_seq, st_dist)[source]
Returns best of roll, unroll, and direct sequence similarity .. warning:
`roll_seqsim` is only called when both sequences are < 500bp in length
- truvari.build_region_tree(vcfA, vcfB=None, includebed=None)[source]
Builds a dictionary of chromosome names mapped to IntervalTree objects containing regions based on the input VCF files and an optional BED file.
Parameters
- vcfApysam.VariantFile
The primary VCF file used to define the regions and contigs.
- vcfBpysam.VariantFile, optional
The secondary VCF file used for comparison. If provided, its contigs are considered to exclude any contigs not present in vcfA. Default is None.
- includebedstr, optional
Path to a BED file specifying regions to include. If provided, these regions override those defined by the VCF files. Contigs in the BED file not found in vcfA are excluded. Default is None.
Returns
- dict
A dictionary where keys are chromosome names (str), and values are IntervalTree objects representing the regions.
Notes
- If includebed is provided:
The function reads regions from the BED file and uses them to define the regions in all_regions.
Contigs in the BED file that are not present in vcfA are excluded, and a warning is logged.
Contigs in vcfB that are not present in vcfA are ignored, and a warning is logged.
If a contig in vcfA lacks a length definition, the program logs an error and exits with code 10.
Raises
- SystemExit
If a contig in vcfA has no length defined, the program terminates with an error message and exit code 10.
Logging
Logs warnings if contigs in vcfB are not found in vcfA.
Logs warnings if contigs in the BED file are not found in vcfA.
Logs an error and exits if any contig in vcfA lacks a length definition.
- truvari.read_bed_tree(filename, chrom_col=0, start_col=1, end_col=2, one_based=False, comment='#', idxfmt=None)[source]
Build an dictionary of IntervalTrees for each chromosome from tab-delimited annotation file
By default, the file is assumed to be a bed-format. If custom chrom/start/end are used, the columns can be specified.
idxfmt is a string that will be formatted with the line number from the file (e.g. “num %d” will make intervals with data=”num 0”, “num 1”. By default the data will be interger line number. If intervals will be compared between anno_trees, set idxfmt to “”
- Parameters:
filename (string) – Path to file to parse, can be compressed
chrom_col (integer, optional) – Index of column in file with chromosome
start_col (integer, optional) – Index of column in file with start
end_col (integer, optional) – Index of column in file with end
one_based (bool, optional) – True if coordinates are one-based
is_pyintv (bool, optional) – add 1 to end position to correct pyintervaltree.overlap behavior
comment (string, optional) – ignore lines if they start with this string
idxfmt (string, optional) – Index of column in file with chromosome
- Returns:
dictionary with chromosome keys and
intervaltree.IntervalTreevalues, number of lines parsed- Return type:
tuple (dict, int)
- truvari.chunker(params, *files)[source]
Given a truvari.VariantParams and multiple files, zip them and create chunks
Yields tuple of the chunk of calls, and an identifier of the chunk
- truvari.cmd_exe(cmd, stdin=None, timeout=-1, cap_stderr=True, pipefail=False)[source]
Executes a command through the shell and captures the output while enabling process handling with timeouts and keyboard interrupts.
- Parameters:
cmd (string) – The command to be executed.
stdin (bytes) – stdinput to pipe to command
timeout (int) – Timeout for the command in minutes. So 1440 means 24 hours. -1 means never
cap_stderr (boolean) – If True, capture the stderr and return it as part of the returned cmd_results. Otherwise, stderr will be streamed through to the terminal
pipefail (boolean) – Set to True if the cmd contains pipes |
- Returns:
- namedtuple ofret_code - integer, exit code of the commandstdout - string, captured standard output of the commandstderr - binary string, captured standard error of the commandrun_time - datetime.timedelta, execution time
- Return type:
namedtuple (ret_code, stdout, stderr, run_time)
- Example
>>> import truvari >>> ret = truvari.cmd_exe("echo 'hello world'") >>> ret.ret_code 0 >>> ret.stdout 'hello world\n' >>> import truvari >>> ret = truvari.cmd_exe("sleep 5", pipefail=True, timeout=2/60) # Error logged is caught >>> ret.ret_code 214
- truvari.compress_index_vcf(fn, fout=None, remove=True)[source]
Compress and index a vcf. If no fout is provided, write to fn + ‘.gz’
- Parameters:
fn (string) – filename of vcf to work on
fout (string, optional) – output filename
remove (bool) – remove fn after fout is made
- truvari.coords_within(qstart, qend, rstart, rend, end_within)[source]
Returns if a span is within the provided [start, end). All coordinates assumed 0-based One exception is made for REPL types with anchor bases. This is for TR callers and GIABTR benchmark where variants will span the entire includebed region.
- Parameters:
qstart (integer) – query start position
qend (integer) – query end position
start (integer) – start of span
end (integer) – end of span
end_within (bool) – if true, qend <= rend, else qend < rend
- Returns:
If the coordinates are within the span
- Return type:
bool
- truvari.count_entries(vcf, chroms, regions, within)[source]
Count the number of variants per bed region a VCF regions is a list of lists with sub-lists having 0:3 of chrom, start, end Returns a list of the counts in the same order as the regions
- truvari.extend_region_tree(tree, pad)[source]
Extends all intervals by a fixed number of bases on each side Returns a copy of this tree
- truvari.file_zipper(*start_files)[source]
Zip multiple files to yield their entries in order.
The function takes as input tuples of (key, iterable), where:
key is an identifier (used to track which file the yielded entry comes from).
iterable is an iterable object, typically a truvari.VariantFile.
The function iterates through all input files in a coordinated manner, yielding the entries in order.
- Parameters:
start_files (tuple) – A variable-length argument list of tuples (key, iterable).
- Yields:
A tuple of (key, truvari.VariantRecord), where key is the file identifier and the second element is the next record from the corresponding file.
- Return type:
tuple
- Raises:
StopIteration – Raised when all input files have been exhausted.
- Logs:
Logs a summary of the zipping process after all files have been processed.
- truvari.help_unknown_cmd(user_cmd, avail_cmds, threshold=0.8)[source]
Guess the command in avail_cmds that’s most similar to user_cmd. If there is no guess below threshold, don’t guess.
- Parameters:
user_cmd (string) – user command
avail_cmds (list of strings) – available commands
threshold (float, optional) – max score (0-1) to give a guess
- Returns:
best guess from :param: avail_cmds or None
- Return type:
string
- Example
>>> import truvari >>> truvari.help_unknown_cmd('banno', ["bench", "anno"]) 'anno' >>> truvari.help_unknown_cmd('random', ["bench", "anno"])
- truvari.get_gt(gt)[source]
Turn a genotype tuple into a GT object
- Parameters:
gt (tuple (int, int)) – Genotype tuple
- Returns:
A
GTof the genotype- Return type:
- Example
>>> import truvari >>> truvari.get_gt((0, 0)) <GT.REF: 0> >>> truvari.get_gt((0, 1, 1)) <GT.UNK: 4>
- truvari.get_scalebin(x, rmin=0, rmax=100, tmin=0, tmax=100, step=10)[source]
Scale variable x from rdomain to tdomain with step sizes return key, index
- Parameters:
x (number) – Number to scale
rmin (number, optional) – The minimum of the range of your measurement
rmax (number, optional) – The maximum of the range of your measurement
tmin (number, optional) – The minimum of the range of your desired target scaling
tmax (number, optional) – The maximum of the range of your measurement
step (number, optional) – The step size of bins of target range
- Returns:
The bin-string of the scaled variable
- Return type:
string
- Example
>>> import truvari >>> truvari.get_scalebin(4, 1, 5, 0, 20, 5) ('[15,20)', 3) >>> truvari.get_scalebin(6, 1, 5, 0, 20, 5) ('>=20', 4)
- truvari.get_sizebin(sz)[source]
Bin a given size into
truvari.SZBINS- Parameters:
sz (integer) – SVLEN to bin into the SZBINS
- Returns:
SZBIN
- Return type:
string
- truvari.get_svtype(svtype)[source]
Turn a SVTYPE string to a
truvari.SVobject- Parameters:
svtype (string)
- Returns:
A
SVof the SVTYPE- Return type:
- Example
>>> import truvari >>> truvari.get_svtype("INS") <SV.INS: 2> >>> truvari.get_svtype("foo") <SV.UNK: 7>
- truvari.make_temp_filename(tmpdir=None, suffix='')[source]
Get a random filename in a tmpdir with an optional extension
- truvari.merge_region_tree_overlaps(tree)[source]
Runs IntervalTree.merge_overlaps on all trees. Info Logs the count of chromosomes with overlapping regions and Debug Logs the chromosome names with overlapping regions.
- truvari.msa2vcf(msa, anchor_base='N')[source]
Parse an MSA dict of {name: alignment, …} and returns its VCF entries as a string
Assumes one entry in the MSA has the name ref_${chrom}:${start}-${end} which gives VCF entries coordinates Provide anchor_base to prevent ‘N’ from being used as an anchor base Returns a string of entries
- Example (for dealing with test coverage not being seen)
>>> import truvari >>> from truvari.phab import fasta_reader >>> msa_dir = "repo_utils/test_files/external/fake_mafft/lookup/" >>> msa_file = "fm_03ffe346bf2196b0ceebee5beb9a6acb.msa" >>> seqs = open(msa_dir + msa_file).read() >>> fasta = dict(fasta_reader(seqs)) >>> m_entries_str = truvari.msa2vcf(fasta)
- truvari.opt_gz_open(in_fn)[source]
Chooses file handler for plain-text files or *.gz files.
returns a generator which yields lines of the file
- truvari.optimize_df_memory(df)[source]
Optimize DataFrame memory by trying to downcast every column in-place. Returns the pre/post size
- Parameters:
df (
pandas.DataFrame) – Dataframe to optimize- Returns:
tuple (pre_size, post_size)
- Return type:
int, sizes in bytes
- truvari.overlap_percent(astart, aend, bstart, bend)[source]
Calculates the percent of range A which overlaps with range B
- Parameters:
astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position
- Returns:
overlap percent
- Return type:
float
- truvari.overlaps(s1, e1, s2, e2)[source]
Check if two start/end ranges have overlap
- Parameters:
s1 (int) – range 1 start
e1 (int) – range 1 end
s2 (int) – range 2 start
e2 (int) – range 2 end
- Returns:
True if ranges overlap
- Return type:
bool
- truvari.performance_metrics(tpbase, tp, fn, fp)[source]
Calculates precision, recall, and f1 given counts by state Can return None if values are zero
- Parameters:
tpbase (int) – found base count
tp (int) – found comp count
fn (int) – missed base count
fp (int) – missed comp count
- Returns:
tuple of precision, recall, f1
- Return type:
tuple, float
- truvari.phab(var_regions, base_vcf, ref_fn, output_fn, bSamples=None, buffer=100, mafft_params='--auto --thread 1', comp_vcf=None, cSamples=None, prefix_comp=False, threads=1, method='mafft', passonly=True, max_size=50000)[source]
Harmonize variants with MSA. Runs on a set of regions given files to create haplotypes and writes results to a compressed/indexed VCF
- Parameters:
var_regions (
list) – List of tuples of region’s (chrom, start, end)base_vcf (
str) – VCF file nameref_fn (
str) – Reference file nameoutput_fn – Output VCF.gz
bSamples (
list) – Samples from base_vcf to create haplotypesbuffer (
int) – Flanking reference bases to added to regionsmafft_params (
str) – Parameters for mafftcomp_vcf (
str) – VCF file namecSamples (
list) – Samples from comp_vcf to create haplotypesprefix_comp (
bool) – Ensure unique sample names by prefixing comp samplesthreads (
int) – Number of threads to usemethod (
str) – Alignment method to use mafft or wfa
- truvari.reciprocal_overlap(astart, aend, bstart, bend)[source]
Calculates reciprocal overlap of two ranges
- Parameters:
astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position
- Returns:
reciprocal overlap
- Return type:
float
- truvari.restricted_float(x)[source]
Restrict float to range (0,1). Raises argparse.ArgumentTypeError if float is out of range Used with
argparse.ArgumentParser.add_argumenttype parameter- Parameters:
x (float) – number to check
- Returns:
input number
- Return type:
float
- Example
>>> import truvari >>> truvari.restricted_float(0.5) 0.5 >>> truvari.restricted_float(2) Traceback (most recent call last): argparse.ArgumentTypeError: 2.0 not in range (0, 1)
- truvari.restricted_int(x)[source]
Restrict int to positive. Raises argparse.ArgumentTypeError if int is negative Used with
argparse.ArgumentParser.add_argumenttype parameter- Parameters:
x (int) – number to check
- Returns:
input number
- Return type:
float
- Example
>>> import truvari >>> truvari.restricted_int(5) 5 >>> truvari.restricted_int(-2) Traceback (most recent call last): argparse.ArgumentTypeError: -2 is < 0
- truvari.ref_ranges(reference, chunk_size=10000000)[source]
Chunk reference into pieces. Useful for multiprocessing.
- Parameters:
reference (string) – Filename of reference to chunk
chunk_size (int, optional) – Length of reference chunks
- Returns:
generator of tuples (ref_name, start, stop)
- Return type:
iterator
- Example
>>> import truvari >>> gen = truvari.ref_ranges("repo_utils/test_files/references/reference.fa", 1000) >>> print(len([_ for _ in gen])) 1000
- truvari.roll_seqsim(a_seq, b_seq)[source]
Compare the lexicographically smallest rotations of two sequences
- truvari.seqsim(allele1, allele2)[source]
Calculate similarity of two sequences
- Parameters:
allele1 (str) – first entry
allele2 (str) – second entry
- Returns:
sequence similarity
- Return type:
float
- truvari.setup_logging(debug=False, stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, log_format='%(asctime)s [%(levelname)s] %(message)s', show_version=False)[source]
Create default logger
- Parameters:
debug (boolean, optional) – Set log-level to logging.DEBUG
stream (file handler, optional) – Where log is written
log_format (string, optional) – Format of log lines
show_version (bool, optional) – Start log with truvari version and command line args
- truvari.sizesim(sizeA, sizeB)[source]
Calculate the size similarity percent and size diff for two sizes
- Parameters:
sizeA (int) – first size
sizeB (int) – second size
- Returns:
size similarity percent and size diff (A - B)
- Return type:
(float, int)
- truvari.unroll_seqsim(seqA, seqB, p)[source]
Unroll two sequences and compare. See https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c for details
- Parameters:
seqA (string) – sequence upstream sequence
seqB (string) – sequence downstream sequence
p (integer) – how many bases upstream seqA is from seqB
- Returns:
sequence similarity of seqA vs seqB after unrolling
- Return type:
float
- truvari.vcf_ranges(vcf, min_dist=1000)[source]
Chunk vcf into discrete pieces. Useful for multiprocessing.
- Parameters:
vcf (string) – Filename of vcf to find ranges
min_dist (int, optional) – Minimum distance between entries for new range
- Returns:
generator of tuples (ref_name, start, stop)
- Return type:
iterator
- Example
>>> import truvari >>> gen = truvari.vcf_ranges("repo_utils/test_files/variants/input1.vcf.gz") >>> print(len([_ for _ in gen])) 228
- truvari.vcf_to_df(fn, with_info=True, with_format=True, sample=None, no_prefix=False, alleles=False)[source]
Parse a vcf file and turn it into a dataframe. Tries its best to pull info/format tags from the sample information. For Formats with Number=G, append _ref, _het, _hom. For things with Number=R, append _ref, _alt. Specify which sample with its name or index in the VCF.
- Parameters:
fn (string) – File name of VCF to open and turn into a DataFrame
with_info (boolean, optional) – Add the INFO fields from the VCF to the DataFrame columns
with_format – Add the FORMAT fields from the VCF to the DataFrame columns
sample (int/string, optional) – Sample from the VCF to parse. Only used when with_format==True
no_prefix (boolean, optional) – Don’t prefix FORMAT fields with sample name
alleles (boolean, optional) – Add column with allele sequences
- Returns:
Converted VCF
- Return type:
pandas.DataFrame
- Example
>>> import truvari >>> df = truvari.vcf_to_df("repo_utils/test_files/variants/input2.vcf.gz", True, True) >>> df.columns Index(['chrom', 'start', 'end', 'id', 'svtype', 'svlen', 'szbin', 'qual', 'filter', 'is_pass', 'QNAME', 'QSTART', 'QSTRAND', 'SVTYPE', 'SVLEN', 'NA12878_GT', 'NA12878_PL_ref', 'NA12878_PL_het', 'NA12878_PL_hom', 'NA12878_AD_ref', 'NA12878_AD_alt'], dtype='object')
Data
HEADERMAT
regular expression of vcf header INFO/FORMAT fields with groups
- truvari.HEADERMAT = re.compile('##\\w+=<ID=(?P<name>\\w+),Number=(?P<num>[\\.01AGR]),Type=(?P<type>\\w+)')
Compiled regular expression object.
QUALBINS
0-100 quality score bin strings (step size 10)
- truvari.QUALBINS = ['[0,10)', '[10,20)', '[20,30)', '[30,40)', '[40,50)', '[50,60)', '[60,70)', '[70,80)', '[80,90)', '[90,100)', '>=100']
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
SVTYTYPE
pandas.CategoricalDtype of truvari.SV
SZBINMAX
integer list of maximum size for size bins
- truvari.SZBINMAX = [1, 5, 10, 15, 20, 30, 40, 50, 100, 200, 300, 400, 600, 800, 1000, 2500, 5000, 9223372036854775807]
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
SZBINS
string list of size bins
- truvari.SZBINS = ['SNP', '[1,5)', '[5,10)', '[10,15)', '[15,20)', '[20,30)', '[30,40)', '[40,50)', '[50,100)', '[100,200)', '[200,300)', '[300,400)', '[400,600)', '[600,800)', '[800,1k)', '[1k,2.5k)', '[2.5k,5k)', '>=5k']
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
SZBINTYPE
pandas.CategoricalDtype of truvari.SZBINS