We performed VNTR genotyping on high depth(35× in average) whole-genome sequencing data from 8,222 samples and we took use of the danbing-tk tool to achieve the genotyping.
We used the “danbing-tk align” module with the options “-gc 80 -ae -kf 4 1 -cth 45 -k 21 -qs pan -fai /dev/stdin -p 4” on each genome and we calculated k-mer counts for each VNTR locus. Then we computed locus-specific biases (LSBs) for both VNTR and non-VNTR loci and used the SAMtools 'bedcov' module to obtain coverage data. K-mer counts along with LSBs, coverage files, and bed files, were input into the script 'kmc2length.py' provided by danbing-tk to estimate VNTR lengths for 8,225 samples. Then we extracted k-mers by sliding a 21-bp window over each motif to caluculate motif dosage for each motif.
For VNTR loci filtering, start with 80,518 loci provided by danbing-tk, first we moved 40,199 loci overlapping with mobile elements detected by RepeatMasker to ensure the loci are VNTRs. Then we calculated batch-r2(for detailed definitions, please refer to danbing-tk) and performed bias correction on VNTR lengths of 8,225samples. Then we limited the estimate rate for loci to 80% and for samples to 50%. As a result, 38,685 loci and 8,222samples were retained.
For motif filtering, start with 4,456,881 motifs provided by danbing-tk, we moved motifs with MAPE (mean absolute percentage error,for detailed definitions, please refer to danbing-tk) values between short-read and long-read sequencing higher than 0.5 and motifs we could not found in our dataset, 1,161,890 were kept. Then, 148,245 motifs with dosage invariant across HGSVC2 haplotypes(for detailed explanation, please refer to danbing-tk) was removed and 96,707 motifs with estimate rate lower than 20% across 8,222 genomewere removed. As a result, 916,928 motifs were kept finally.
The results from danbing-tk were diploid. To enhance clarity, VNTR lengths were rounded to the nearest integer, classifying different lengths as VNTR length polymorphisms (VNTR-LPs). VNTR motifs were converted into copy numbers by dividing the dosage by the motif length, and these variations were termed VNTR motif polymorphisms (VNTR-MPs).In other words, for a given VNTR locus, each unique length identified across samples was considered a VNTR length polymorphism. For a given motif sequence, each unique copy of the motif identified across samples was considered a VNTR motif polymorphism.
In this example, for the given VNTR motif, we identified total five motif polymorphisms across ten samples, where each color represents a unique motif polymorphism. (left) Similarly for the given VNTR locus, we identified total seven length polymorphisms across ten samples, where each color represents a unique length polymorphism, the VNTR length polymorphisms of this locus was count as seven.(right)
Notice! A VNTR locus may certain serveral motifs. In our research, after rigorous quality control, the number of VNTR loci was 38,685 and the number of motifs was 916,938. And we found 2.5 million LPs and 11 million MPs in total.
Here we provided the VNTR genotyping tool danbing-tk we used, the tools was downloaded from GitHub - ChaissonLab/danbing-tk in March 2023. Licensed under the BSD 3-Clause License by danbing-tk,we provide the version(danbing-tk-1.3.1) we used for download, which includes the original motif reference file TR_locus.4456881_motif.tsv.
Here, we also provide 916,938 motifs filtered through the process described in the quality control part above in our dataset. In this file, the first column represents the position of the VNTR loci in the reference genome where the motif is located, and the second column contains the motif sequence. This file can be used to skip the motif quality control step for research similar to ours.
© 2024. Center for Big Data Research in Health, Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences