truvari package
Overview
Truvari - SV comparison and annotation toolkit
See help() of specific functions / objects for details
VariantRecord methods:
entry_boundaries()
entry_distance()
entry_gt_comp()
entry_is_filtered()
entry_is_present()
entry_reciprocal_overlap()
entry_same_variant_type()
entry_shared_ref_context()
entry_seq_similarity()
entry_size()
entry_size_similarity()
entry_to_hash()
entry_to_key()
entry_within()
entry_variant_type()
Extra methods:
allele_freq_annos()
bed_ranges()
build_anno_tree()
calc_af()
calc_hwe()
compress_index_vcf()
create_pos_haplotype()
get_gt()
get_scalebin()
get_sizebin()
get_svtype()
msa2vcf()
overlap_percent()
overlaps()
phab()
reciprocal_overlap()
ref_ranges()
seqsim()
sizesim()
unroll_compare()
vcf_ranges()
Dev methods:
benchdir_count_entries()
chunker()
cmd_exe()
consolidate_phab_vcfs()
coords_within()
count_entries()
file_zipper()
help_unknown_cmd()
make_temp_filename()
opt_gz_open()
optimize_df_memory()
performance_metrics()
region_filter()
restricted_float()
restricted_int()
setup_logging()
vcf_to_df()
Objects:
Bench
BenchOutput
GT
RegionVCFIterator
LogFileStderr
MatchResult
Matcher
StatsBox
SV
Data:
truvari.HEADERMAT
truvari.QUALBINS
truvari.SVTYTYPE
truvari.SZBINMAX
truvari.SZBINS
truvari.SZBINTYPE
VariantRecord Methods
entry_boundaries
entry_distance
- truvari.entry_distance(entryA, entryB)[source]
Calculate the start and end distances of the pair. Negative distances indicate entryA is upstream of entryB
- Parameters:
entryA (
pysam.VariantRecord
) – first entryentryB (
pysam.VariantRecord
) – second entry
- Returns:
starts and ends distance
- Return type:
tuple (int, int)
entry_gt_comp
- truvari.entry_gt_comp(entryA, entryB, sampleA=None, sampleB=None)[source]
Compare the genotypes of two entries
- Parameters:
entryA (
pysam.VariantRecord
) – first entryentryB (
pysam.VariantRecord
) – second entrysampleA (string, optional) – sample of entryA to check
sampleB (string, optional) – sample of entryB to check
- Returns:
True if the genotypes are concordant
- Return type:
bool
- Example
>>> import truvari >>> import pysam >>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> a = next(v) >>> b = next(v) >>> truvari.entry_gt_comp(a, b) True
entry_is_filtered
- truvari.entry_is_filtered(entry, values=None)[source]
Checks if entry should be filtered given the filter values provided. If values is None, assume that filter must have PASS or be blank ‘.’
- Parameters:
entry (
pysam.VariantRecord
) – entry to checkvalues (set, optional) – set of filter values for intersection
- Returns:
True if entry should be filtered
- Return type:
bool
- Example
>>> import truvari >>> import pysam >>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> truvari.entry_is_filtered(next(v)) # PASS shouldn't be filtered False >>> truvari.entry_is_filtered(next(v), set(["lowQ"])) # Call isn't lowQ, so filter True
entry_is_present
- truvari.entry_is_present(entry, sample=None, allow_missing=True)[source]
Checks if entry’s sample genotype is present and is heterozygous or homozygous (a.k.a. present) If allow_missing, just check for a 1 in the genotype. Otherwise, a missing (‘.’) genotype isn’t considered present
- Parameters:
entry (
pysam.VariantRecord
) – entry to checksample (string, optional) – sample name
- Returns:
True if variant is present in the sample
- Return type:
bool
- Example
>>> import truvari >>> import pysam >>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> truvari.entry_is_present(next(v)) True
entry_reciprocal_overlap
- truvari.entry_reciprocal_overlap(entry1, entry2)[source]
Calculates reciprocal overlap of two entries
- Parameters:
entry1 (
pysam.VariantRecord
) – First entryentry2 (
pysam.VariantRecord
) – Second entry
- Returns:
The reciprocal overlap
- Return type:
float
- Example
>>> import truvari >>> import pysam >>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> a = next(v) >>> b = next(v) >>> truvari.entry_reciprocal_overlap(a, b) 0
entry_same_variant_type
- truvari.entry_same_variant_type(entryA, entryB, dup_to_ins=False)[source]
Check if entryA svtype == entryB svtype
- Parameters:
entryA (
pysam.VariantRecord
) – first entryentryB (
pysam.VariantRecord
) – second entrydup_to_ins (bool) – Convert DUP to INS types
- Returns:
True if entry SVTYPEs match
- Return type:
bool
entry_seq_similarity
- truvari.entry_seq_similarity(entryA, entryB, ref=None, min_len=0)[source]
Calculate sequence similarity of two entries. If reference is not None, compare their shared reference context. Otherwise, use the unroll technique.
- Parameters:
entryA (
pysam.VariantRecord
) – first entryentryB (
pysam.VariantRecord
) – second entryref (
pysam.FastaFile
) – Reference genomemin_len (float, optional) – Minimum length of reference context to generate
- Returns:
sequence similarity
- Return type:
float
entry_size
- truvari.entry_size(entry)[source]
Determine the size of the variant.
Note
How size is determined
Starts by trying to use INFO/SVLEN
If SVLEN is unavailable and ALT field is an SV (e.g. <INS>, <DEL>, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS.
If len of vcf.REF and vcf.ALT[0] are equal, return len of vcf.REF
Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0])))
- Parameters:
entry (
pysam.VariantRecord
) – entry to look at- Returns:
the entry’s size
- Return type:
int
entry_size_similarity
- truvari.entry_size_similarity(entryA, entryB)[source]
Calculate the size similarity and difference for two entries
- Parameters:
entryA (
pysam.VariantRecord
) – first entryentryB (
pysam.VariantRecord
) – second entry
- Returns:
size similarity and size diff (A - B)
- Return type:
(float, int)
- Example
>>> import truvari >>> import pysam >>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> a = next(v) >>> b = next(v) >>> truvari.entry_size_similarity(a, b) (0.07142857142857142, 13)
entry_to_hash
entry_to_key
- truvari.entry_to_key(entry, prefix='', bounds=False)[source]
Turn a vcf entry into a hashable key string. Use the prefix (base/comp) to indicate the source VCF when consolidating multiple files’ calls. If bounds: call entry_boundaries for start/stop.
Warning
If a caller redundantly calls a variant exactly the same it will not have a unique key
- Parameters:
entry (
pysam.VariantRecord
) – entry to stringifyprefix (string, optional) – prefix
bounds (bool, optional) – use entry_boundaries
- Returns:
hashable string uniquely identifying the variant
- Return type:
string
entry_variant_type
- truvari.entry_variant_type(entry)[source]
Return entry’s SVTYPE
Note
How svtype is determined:
Starts by trying to use INFO/SVTYPE
If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences
If REF/ALT sequences are not available, try to parse the <INS>, <DEL>, etc from the ALT column.
Otherwise, assume ‘UNK’
- Parameters:
entry (
pysam.VariantRecord
)- Returns:
SV type
- Return type:
entry_within
..autofunction:: entry_within
Extra Methods
allele_freq_annos
- truvari.allele_freq_annos(entry, samples=None)[source]
Calculate allele annotations for a VCF Entry
- Parameters:
entry (
pysam.VariantRecord
) – Entry with samples to parsesamples (list of strings, optional) – Subset of samples from the entry over which to calculate annos
- Returns:
- Dictonary ofAF - allele frequencyMAF - minor allele frequencyExcHet - excess heterozygosityHWE - hardy weinberg equilibriumAC - allele countMAC - minor allele countAN - number of called alleles
- Return type:
dict
- Example
>>> import truvari >>> import pysam >>> v = pysam.VariantFile('repo_utils/test_files/variants/multi.vcf.gz') >>> truvari.allele_freq_annos(next(v)) {'AF': 0.5, 'MAF': 0.5, 'ExcHet': 1.0, 'HWE': 1.0, 'MAC': 1, 'AC': [1, 1], 'AN': 2, 'N_HEMI': 0, 'N_HOMREF': 0, 'N_HET': 1, 'N_HOMALT': 0, 'N_MISS': 2}
bed_ranges
- truvari.bed_ranges(bed, chunk_size=10000000)[source]
Chunk bed regions into pieces. Useful for multiprocessing.
- Parameters:
bed (string) – Filename of bed to chunk
chunk_size (int, optional) – Length of reference chunks
- Returns:
generator of tuples (ref_name, start, stop)
- Return type:
iterator
- Example
>>> import truvari >>> gen = truvari.bed_ranges("repo_utils/test_files/beds/giab.bed", 1000) >>> print(len([_ for _ in gen])) 992
build_anno_tree
- truvari.build_anno_tree(filename, chrom_col=0, start_col=1, end_col=2, one_based=False, comment='#', idxfmt=None)[source]
Build an dictionary of IntervalTrees for each chromosome from tab-delimited annotation file
By default, the file is assumed to be a bed-format. If custom chrom/start/end are used, the columns can be specified.
idxfmt is a string that will be formatted with the line number from the file (e.g. “num %d” will make intervals with data=”num 0”, “num 1”. By default the data will be interger line number. If intervals will be compared between anno_trees, set idxfmt to “”
- Parameters:
filename (string) – Path to file to parse, can be compressed
chrom_col (integer, optional) – Index of column in file with chromosome
start_col (integer, optional) – Index of column in file with start
end_col (integer, optional) – Index of column in file with end
one_based (bool, optional) – True if coordinates are one-based
is_pyintv (bool, optional) – add 1 to end position to correct pyintervaltree.overlap behavior
comment (string, optional) – ignore lines if they start with this string
idxfmt (string, optional) – Index of column in file with chromosome
- Returns:
dictionary with chromosome keys and
intervaltree.IntervalTree
values- Return type:
dict
calc_af
- truvari.calc_af(gts)[source]
Calculate allele annotations for a list of genotypes
- Parameters:
gts (list) – Genotype tuples
- Returns:
- Dictonary ofAF - allele frequencyMAF - minor allele frequencyExcHet - excess heterozygosityHWE - hardy weinberg equilibriumAC - allele count for GT 1MAC - minor allele countAN - number of called allelesN_HEMI - Number of partial genotypes (length of 1 or a single missing)N_HOMREF - homozygous reference GT countN_HET - heterozygous GT countN_HOMALT - homozygous alternate GT countN_MISS - Number of missing genotypes (all .)
- Return type:
dict
calc_hwe
- truvari.calc_hwe(nref, nalt, nhet)[source]
Calculate Hardy Weinberg equilibrium and excess heterozygosity
- Parameters:
nref (int) – Number of reference alleles
nalt (int) – Number of alternate alleles
nhet (int) – Number of heterozygous sites
- Returns:
The HWE and ExcHet calculated
- Return type:
tuple (float, float)
check_vcf_index
compress_index_vcf
create_pos_haplotype
- truvari.create_pos_haplotype(a1, a2, ref, min_len=0)[source]
Create haplotypes of two allele’s regions that are assumed to be overlapping
- Parameters:
a1 (tuple) – tuple of chrom, start, end, seq
a2 (tuple) – tuple of chrom, start, end, seq
ref (
pysam.FastaFile
) – Reference genomemin_len (int, optional) – Minimum length of the haplotype sequence to create
- Returns:
allele haplotype sequences created
- Rtupe:
tuple (string, string)
get_gt
get_scalebin
- truvari.get_scalebin(x, rmin=0, rmax=100, tmin=0, tmax=100, step=10)[source]
Scale variable x from rdomain to tdomain with step sizes return key, index
- Parameters:
x (number) – Number to scale
rmin (number, optional) – The minimum of the range of your measurement
rmax (number, optional) – The maximum of the range of your measurement
tmin (number, optional) – The minimum of the range of your desired target scaling
tmax (number, optional) – The maximum of the range of your measurement
step (number, optional) – The step size of bins of target range
- Returns:
The bin-string of the scaled variable
- Return type:
string
- Example
>>> import truvari >>> truvari.get_scalebin(4, 1, 5, 0, 20, 5) ('[15,20)', 3) >>> truvari.get_scalebin(6, 1, 5, 0, 20, 5) ('>=20', 4)
get_sizebin
- truvari.get_sizebin(sz)[source]
Bin a given size into
truvari.SZBINS
- Parameters:
sz (integer) – SVLEN to bin into the SZBINS
- Returns:
SZBIN
- Return type:
string
get_svtype
- truvari.get_svtype(svtype)[source]
Turn a SVTYPE string to a
truvari.SV
object- Parameters:
svtype (string)
- Returns:
A
SV
of the SVTYPE- Return type:
- Example
>>> import truvari >>> truvari.get_svtype("INS") <SV.INS: 2> >>> truvari.get_svtype("foo") <SV.UNK: 6>
msa2vcf
- truvari.msa2vcf(msa, anchor_base='N')[source]
Parse an MSA dict of {name: alignment, …} and returns its VCF entries as a string
Assumes one entry in the MSA has the name ref_${chrom}:${start}-${end} which gives VCF entries coordinates Provide anchor_base to prevent ‘N’ from being used as an anchor base Returns a string of entries
- Example (for dealing with test coverage not being seen)
>>> import truvari >>> from truvari.phab import fasta_reader >>> msa_dir = "repo_utils/test_files/external/fake_mafft/lookup/" >>> msa_file = "fm_92be9763c7114fd6ce736a7008114ad0.msa" >>> seqs = open(msa_dir + msa_file).read() >>> fasta = dict(fasta_reader(seqs)) >>> m_entries_str = truvari.msa2vcf(fasta)
overlap_percent
- truvari.overlap_percent(astart, aend, bstart, bend)[source]
Calculates the percent of range A which overlaps with range B
- Parameters:
astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position
- Returns:
overlap percent
- Return type:
float
overlaps
phab
- truvari.phab(var_regions, base_vcf, ref_fn, output_fn, bSamples=None, buffer=100, mafft_params='--auto --thread 1', comp_vcf=None, cSamples=None, prefix_comp=False, threads=1, method='mafft', passonly=True, max_size=50000)[source]
Harmonize variants with MSA. Runs on a set of regions given files to create haplotypes and writes results to a compressed/indexed VCF
- Parameters:
var_regions (
list
) – List of tuples of region’s (chrom, start, end)base_vcf (
str
) – VCF file nameref_fn (
str
) – Reference file nameoutput_fn – Output VCF.gz
bSamples (
list
) – Samples from base_vcf to create haplotypesbuffer (
int
) – Flanking reference bases to added to regionsmafft_params (
str
) – Parameters for mafftcomp_vcf (
str
) – VCF file namecSamples (
list
) – Samples from comp_vcf to create haplotypesprefix_comp (
bool
) – Ensure unique sample names by prefixing comp samplesthreads (
int
) – Number of threads to usemethod (
str
) – Alignment method to use mafft or wfa
reciprocal_overlap
- truvari.reciprocal_overlap(astart, aend, bstart, bend)[source]
Calculates reciprocal overlap of two ranges
- Parameters:
astart (int) – First range’s start position
aend (int) – First range’s end position
bstart (int) – Second range’s start position
bend (int) – Second range’s end position
- Returns:
reciprocal overlap
- Return type:
float
ref_ranges
- truvari.ref_ranges(reference, chunk_size=10000000)[source]
Chunk reference into pieces. Useful for multiprocessing.
- Parameters:
reference (string) – Filename of reference to chunk
chunk_size (int, optional) – Length of reference chunks
- Returns:
generator of tuples (ref_name, start, stop)
- Return type:
iterator
- Example
>>> import truvari >>> gen = truvari.ref_ranges("repo_utils/test_files/references/reference.fa", 1000) >>> print(len([_ for _ in gen])) 1000
seqsim
sizesim
unroll_compare
- truvari.unroll_compare(seqA, seqB, p)[source]
Unroll two sequences and compare. See https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c for details
- Parameters:
seqA (string) – sequence upstream sequence
seqB (string) – sequence downstream sequence
p (integer) – how many bases upstream seqA is from seqB
- Returns:
sequence similarity of seqA vs seqB after unrolling
- Return type:
float
vcf_ranges
- truvari.vcf_ranges(vcf, min_dist=1000)[source]
Chunk vcf into discrete pieces. Useful for multiprocessing.
- Parameters:
vcf (string) – Filename of vcf to find ranges
min_dist (int, optional) – Minimum distance between entries for new range
- Returns:
generator of tuples (ref_name, start, stop)
- Return type:
iterator
- Example
>>> import truvari >>> gen = truvari.vcf_ranges("repo_utils/test_files/variants/input1.vcf.gz") >>> print(len([_ for _ in gen])) 228
Dev methods
benchdir_count_entries
chunker
cmd_exe
- truvari.cmd_exe(cmd, stdin=None, timeout=-1, cap_stderr=True, pipefail=False)[source]
Executes a command through the shell and captures the output while enabling process handling with timeouts and keyboard interrupts.
- Parameters:
cmd (string) – The command to be executed.
stdin (bytes) – stdinput to pipe to command
timeout (int) – Timeout for the command in minutes. So 1440 means 24 hours. -1 means never
cap_stderr (boolean) – If True, capture the stderr and return it as part of the returned cmd_results. Otherwise, stderr will be streamed through to the terminal
pipefail (boolean) – Set to True if the cmd contains pipes |
- Returns:
- namedtuple ofret_code - integer, exit code of the commandstdout - string, captured standard output of the commandstderr - binary string, captured standard error of the commandrun_time - datetime.timedelta, execution time
- Return type:
namedtuple (ret_code, stdout, stderr, run_time)
- Example
>>> import truvari >>> ret = truvari.cmd_exe("echo 'hello world'") >>> ret.ret_code 0 >>> ret.stdout 'hello world\n' >>> import truvari >>> ret = truvari.cmd_exe("sleep 5", pipefail=True, timeout=2/60) # Error logged is caught >>> ret.ret_code 214
consolidate_phab_vcfs
coords_within
- truvari.coords_within(qstart, qend, rstart, rend, end_within)[source]
Returns if a span is within the provided [start, end). All coordinates assumed 0-based
- Parameters:
qstart (integer) – query start position
qend (integer) – query end position
start (integer) – start of span
end (integer) – end of span
end_within (bool) – if true, qend <= rend, else qend < rend
- Returns:
If the coordinates are within the span
- Return type:
bool
count_entries
file_zipper
- truvari.file_zipper(*start_files)[source]
Zip files to yield the entries in order. Each file must be sorted in the same order. start_files is a tuple of (‘key’, iterable) where key is the identifier (so we know which file the yielded entry came from) and iterable is usually a pysam.VariantFile
yields key, pysam.VariantRecord
help_unknown_cmd
- truvari.help_unknown_cmd(user_cmd, avail_cmds, threshold=0.8)[source]
Guess the command in avail_cmds that’s most similar to user_cmd. If there is no guess below threshold, don’t guess.
- Parameters:
user_cmd (string) – user command
avail_cmds (list of strings) – available commands
threshold (float, optional) – max score (0-1) to give a guess
- Returns:
best guess from :param: avail_cmds or None
- Return type:
string
- Example
>>> import truvari >>> truvari.help_unknown_cmd('banno', ["bench", "anno"]) 'anno' >>> truvari.help_unknown_cmd('random', ["bench", "anno"])
make_temp_filename
opt_gz_open
optimize_df_memory
performance_metrics
- truvari.performance_metrics(tpbase, tp, fn, fp)[source]
Calculates precision, recall, and f1 given counts by state Can return None if values are zero
- Parameters:
tpbase (int) – found base count
tp (int) – found comp count
fn (int) – missed base count
fp (int) – missed comp count
- Returns:
tuple of precision, recall, f1
- Return type:
tuple, float
region_filter
restricted_float
- truvari.restricted_float(x)[source]
Restrict float to range (0,1). Raises argparse.ArgumentTypeError if float is out of range Used with
argparse.ArgumentParser.add_argument
type parameter- Parameters:
x (float) – number to check
- Returns:
input number
- Return type:
float
- Example
>>> import truvari >>> truvari.restricted_float(0.5) 0.5 >>> truvari.restricted_float(2) Traceback (most recent call last): argparse.ArgumentTypeError: 2.0 not in range (0, 1)
restricted_int
- truvari.restricted_int(x)[source]
Restrict int to positive. Raises argparse.ArgumentTypeError if int is negative Used with
argparse.ArgumentParser.add_argument
type parameter- Parameters:
x (int) – number to check
- Returns:
input number
- Return type:
float
- Example
>>> import truvari >>> truvari.restricted_int(5) 5 >>> truvari.restricted_int(-2) Traceback (most recent call last): argparse.ArgumentTypeError: -2 is < 0
setup_logging
- truvari.setup_logging(debug=False, stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, log_format='%(asctime)s [%(levelname)s] %(message)s', show_version=False)[source]
Create default logger
- Parameters:
debug (boolean, optional) – Set log-level to logging.DEBUG
stream (file handler, optional) – Where log is written
log_format (string, optional) – Format of log lines
show_version (bool, optional) – Start log with truvari version and command line args
vcf_to_df
- truvari.vcf_to_df(fn, with_info=True, with_format=True, sample=None, no_prefix=False, alleles=False)[source]
Parse a vcf file and turn it into a dataframe. Tries its best to pull info/format tags from the sample information. For Formats with Number=G, append _ref, _het, _hom. For things with Number=R, append _ref, _alt. Specify which sample with its name or index in the VCF.
- Parameters:
fn (string) – File name of VCF to open and turn into a DataFrame
with_info (boolean, optional) – Add the INFO fields from the VCF to the DataFrame columns
with_format – Add the FORMAT fields from the VCF to the DataFrame columns
sample (int/string, optional) – Sample from the VCF to parse. Only used when with_format==True
no_prefix (boolean, optional) – Don’t prefix FORMAT fields with sample name
alleles (boolean, optional) – Add column with allele sequences
- Returns:
Converted VCF
- Return type:
pandas.DataFrame
- Example
>>> import truvari >>> df = truvari.vcf_to_df("repo_utils/test_files/variants/input2.vcf.gz", True, True) >>> df.columns Index(['chrom', 'start', 'end', 'id', 'svtype', 'svlen', 'szbin', 'qual', 'filter', 'is_pass', 'QNAME', 'QSTART', 'QSTRAND', 'SVTYPE', 'SVLEN', 'NA12878_GT', 'NA12878_PL_ref', 'NA12878_PL_het', 'NA12878_PL_hom', 'NA12878_AD_ref', 'NA12878_AD_alt'], dtype='object')
Objects
Bench
- class truvari.Bench(matcher=None, base_vcf=None, comp_vcf=None, outdir=None, includebed=None, extend=0, debug=False, do_logging=False, short_circuit=False)[source]
Object to perform operations of truvari bench
You can build a Bench worker with default matching parameters via:
m_bench = truvari.Bench()
If you’d like to customize the parameters, build and edit a
Matcher
and pass it to the Bench initmatcher = truvari.Matcher() matcher.params.pctseq = 0.50 m_bench = truvari.Bench(matcher)
To run on a chunk of
pysam.VariantRecord
already loaded, pass them in as lists to:matches = m_bench.compare_calls(base_variants, comp_variants)
For advanced parsing, one can build the match matrix for a chunk of calls with:
match_matrix = m_bench.build_matrix(base_variants, comp_variants)
This can then be used for things such as creating custom match pickers or adding new matching checks.
If you want to run on existing files, simply supply the arguments to init
m_bench = truvari.Bench(matcher, base_vcf, comp_vcf, outdir) output = m_bench.run()
Note that running on files must write to an output directory and is the only way to use things like ‘includebed’. However, the returned BenchOutput has attributes pointing to all the results.
- build_matrix(base_variants, comp_variants, chunk_id=0, skip_gt=False)[source]
Builds MatchResults, returns them as a numpy matrix
- check_refine_candidate(result)[source]
Adds this region as a candidate for refinement if there are unmatched variants
- compare_calls(base_variants, comp_variants, chunk_id=0)[source]
Builds MatchResults, returns them as a numpy matrix if there’s at least one base and one comp variant. Otherwise, returns a list of the variants placed in MatchResults
- run()[source]
Runs bench and returns the resulting
truvari.BenchOutput
BenchOutput
StatsBox
GT
RegionVCFIterator
LogFileStderr
MatchResult
Matcher
- class truvari.Matcher(args=None)[source]
Holds matching parameters. Allows calls to be checked for filtering and matches to be made
- Example
>>> import pysam >>> import truvari >>> mat = truvari.Matcher() >>> mat.params.pctseq = 0 >>> v = pysam.VariantFile('repo_utils/test_files/variants/input1.vcf.gz') >>> one = next(v); two = next(v) >>> mat.build_match(one, two) <truvari.bench.MatchResult (False 2.381)>
Look at Matcher.make_match_params() for a list of all params and their defaults
- build_match(base, comp, matid=None, skip_gt=False, short_circuit=False)[source]
Build a MatchResult if skip_gt, don’t do genotype comparison if short_circuit, return after first failure
SV
Data
HEADERMAT
regular expression of vcf header INFO/FORMAT fields with groups
- truvari.HEADERMAT = re.compile('##\\w+=<ID=(?P<name>\\w+),Number=(?P<num>[\\.01AGR]),Type=(?P<type>\\w+)')
Compiled regular expression object.
QUALBINS
0-100 quality score bin strings (step size 10)
- truvari.QUALBINS = ['[0,10)', '[10,20)', '[20,30)', '[30,40)', '[40,50)', '[50,60)', '[60,70)', '[70,80)', '[80,90)', '[90,100)', '>=100']
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
SVTYTYPE
pandas.CategoricalDtype
of truvari.SV
SZBINMAX
integer list of maximum size for size bins
- truvari.SZBINMAX = [1, 5, 10, 15, 20, 30, 40, 50, 100, 200, 300, 400, 600, 800, 1000, 2500, 5000, 9223372036854775807]
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
SZBINS
string list of size bins
- truvari.SZBINS = ['SNP', '[1,5)', '[5,10)', '[10,15)', '[15,20)', '[20,30)', '[30,40)', '[40,50)', '[50,100)', '[100,200)', '[200,300)', '[300,400)', '[400,600)', '[600,800)', '[800,1k)', '[1k,2.5k)', '[2.5k,5k)', '>=5k']
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
SZBINTYPE
pandas.CategoricalDtype
of truvari.SZBINS