Last updated: 2022-06-02
Checks: 6 1
Knit directory: EMBRAPAImputation2022/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has staged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220303)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 95ddcaf. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/DArT2018/
Ignored: data/DArT2020/
Ignored: data/DArT2022/
Ignored: data/GBS/
Ignored: output/DArT2022/
Untracked files:
Untracked: DArTCommonMkrs.log
Untracked: MkrsRefandStudyPop.txt
Untracked: analysis/CheckImp.Rmd
Untracked: analysis/Duplicates.Rmd
Untracked: analysis/ImputationEMBRAPA_DCas22_6902.Rmd
Untracked: analysis/PrepareGenData.Rmd
Untracked: code/.DS_Store
Untracked: code/plink/
Untracked: data/AllDArTDuplicates.txt
Untracked: data/CommonMkrsGBSDArT
Untracked: data/CommonMkrsGBSDArT.txt
Untracked: data/CommonMkrsGBSDArTIssues
Untracked: data/DArTClones.csv
Untracked: data/DArTDupDiscordanceIndex.csv
Untracked: data/DArTDuplicates1.txt
Untracked: data/DArTDuplicates2.txt
Untracked: data/DArTDuplicates3.txt
Untracked: data/DArTDuplicates4.txt
Untracked: data/DArTGPInfo.csv
Untracked: data/DArTGPInfo.xlsx
Untracked: data/DArTGPInfo2.csv
Untracked: data/DArTGPInfo2.xlsx
Untracked: data/DArTGenotypingPlates/
Untracked: out.log
Untracked: output/AllChrDCas22_6902_StudyPopimputed.vcf.gz
Untracked: output/AllChrDCas22_6902_StudyPopimputed.vcf.gz.tbi
Untracked: output/AllChrGBSandDArTsitesCommonClones_RefPopImputed.vcf.gz
Untracked: output/AllChrGBSandDArTsitesCommonClones_RefPopImputed.vcf.gz.tbi
Untracked: output/BRTP_Phenotyping2022.txt
Untracked: output/DCas22_6902/
Untracked: output/DCas22_6902RefPopImputed.vcf.gz
Untracked: output/Dados GBS Atualizados.RData
Untracked: output/DadosGBSAtualizados.rds
Untracked: output/DadosGBSAtualizadosRenomeados.rds
Untracked: output/Duplicates/
Untracked: output/Figures/
Untracked: output/GBSDArTPCA.rds
Untracked: output/GmatrixDArTGS.rds
Untracked: output/GmatrixGBSandDArTGS.rds
Untracked: output/RefPop/
Untracked: output/StudyPop/
Unstaged changes:
Modified: .DS_Store
Modified: analysis/DArTImp.Rmd
Modified: analysis/GBS_DArTImp.Rmd
Modified: analysis/_site.yml
Modified: analysis/index.Rmd
Modified: data/.DS_Store
Modified: output/.DS_Store
Modified: output/out.log
Staged changes:
Modified: .DS_Store
New: analysis/DArTImp.Rmd
New: analysis/GBS_DArTImp.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with wflow_publish()
to start tracking its development.
library(here)
system(paste0("bcftools concat ",
"--output ",
"tassel_embrapa_newfastq.bial.recode.vcf.gz ",
"--output-type z --threads 18 ",
paste0("tassel_embrapa_newfastq.chr",1:18,
".bial.recode.vcf.gz",
collapse = " ")))
GBS data has 3,456 genotyped clones for 400,122 SNP markers
Chromosome | N˚Markers | Chromosome | N˚Markers | Chromosome | N˚Markers |
---|---|---|---|---|---|
Chr 1 | 36,510 | Chr 7 | 15,271 | Chr 13 | 18,743 |
Chr 2 | 28,503 | Chr 8 | 21,154 | Chr 14 | 24,541 |
Chr 3 | 25,488 | Chr 9 | 21,216 | Chr 15 | 23,530 |
Chr 4 | 22,497 | Chr 10 | 19,127 | Chr 16 | 18,031 |
Chr 5 | 24,234 | Chr 11 | 22,587 | Chr 17 | 18,385 |
Chr 6 | 23,841 | Chr 12 | 18,705 | Chr 18 | 17,759 |
library(here); library(tidyverse)
here() starts at /Users/lbd54/Documents/GitHub/EMBRAPAImputation2022
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
GBSVCFFile <- read.table(file = here::here("data", "GBS", "tassel_embrapa_newfastq.bial.recode.vcf"),
comment.char = "", header = T, skip = 31)
GBSMkrs <- GBSVCFFile %>% select(ID, REF, ALT, X.CHROM, POS) %>%
mutate(IDPol = paste0(ID,"_", REF, "/", ALT))
write.table(x = GBSMkrs, file = here::here("output", "DArT2022", "GBSMkrs.txt"),
quote = F, row.names = F)
GBSMkrs <- read.table(file = here::here("output", "DArT2022", "GBSMkrs.txt"), header = T)
DArTMkrs <- read.table(file = here::here("output", "DArT2022", "DArTMkrs.txt"), header = T)
CommonMkrs <- list(DArT = DArTMkrs$IDPol,
GBS = GBSMkrs$IDPol)
CMkrs <- inner_join(DArTMkrs, GBSMkrs, by = c("IDPol"), suffix = c("DArT", "GBS"))
CMkrs2 <- CMkrs %>% dplyr::select(IDDArT, REFDArT, ALTDArT, X.CHROMDArT, POS) %>%
rename(ID = IDDArT,
REF = REFDArT,
ALT = ALTDArT,
Chr = X.CHROMDArT,
Pos = POS)
write.table(CMkrs2, file = here::here("data", "CommonMkrsGBSDArT.txt"), row.names = F,
quote = F, sep = "\t")
write.table(CMkrs2 %>% dplyr::select(Chr, Pos), file = here::here("data", "CommonMkrsGBSDArT"),
row.names = F, col.names = F, quote = F, sep = "\t")
CMkrsIssues <- inner_join(DArTMkrs, GBSMkrs, by = c("ID"),
suffix = c("DArT", "GBS")) %>%
filter(IDPolDArT != IDPolGBS)
write.table(CMkrsIssues %>% dplyr::select(X.CHROMGBS, POS),
file = here::here("data", "CommonMkrsGBSDArTIssues"),
row.names = F, col.names = F, quote = F, sep = "\t")
Loading required package: grid
Chr | N˚SNP only DArT | N˚ DArT + GBS | N˚SNP GBS |
---|---|---|---|
1 | 881 | 183 | 36,327 |
2 | 593 | 103 | 28,400 |
3 | 581 | 101 | 25,387 |
4 | 588 | 94 | 22,403 |
5 | 533 | 101 | 24,133 |
6 | 552 | 104 | 23,737 |
7 | 366 | 62 | 15,209 |
8 | 438 | 94 | 21,060 |
9 | 431 | 89 | 21,127 |
10 | 577 | 87 | 19,040 |
11 | 507 | 84 | 22,503 |
12 | 381 | 87 | 18,618 |
13 | 434 | 74 | 18,669 |
14 | 592 | 115 | 24,426 |
15 | 435 | 81 | 23,449 |
16 | 360 | 77 | 17,954 |
17 | 464 | 79 | 18,306 |
18 | 408 | 72 | 17,687 |
Total | 10,628 | 1,560 | 400,122 |
vcftools --gzvcf output/DArT2022/DCas22_6902_DArTseqLD_AllSites_AllChrom_raw.vcf.gz --positions data/CommonMkrsGBSDArT --recode --stdout | bgzip -c -@ 3 > output/DCas22_6902/DArTCommonMkrs.vcf.gz
vcftools --gzvcf data/GBS/tassel_embrapa_newfastq.bial.recode.vcf.gz --positions data/CommonMkrsGBSDArT --recode --stdout | bgzip -c -@ 3 > output/DCas22_6902/GBSCommonMkrs.vcf.gz
GBSClones <- read.table(file = here::here("data", "GBS", "EM_matched_accessions-GBS.csv"),
sep = ",", header = T)
## Duplicates in GBS genotyping platform
dup <- GBSClones[GBSClones$uniquename_in_db %in%
GBSClones$uniquename_in_db[duplicated(GBSClones$uniquename_in_db)],] %>%
.[order(.[,2]),]
dup1 <- GBSClones[duplicated(GBSClones$uniquename_in_db),] %>% .[order(.[,2]),]
dup2 <- dup[!dup$x %in% dup1$x,] %>%
.[order(.[,2]),] %>% mutate(FILE = "2")
dup3 <- dup1[duplicated(dup1$uniquename_in_db),] %>%
.[order(.[,2]),] %>% mutate(FILE = "3")
dup1 <- dup1[!dup1$x %in% dup3$x,] %>% mutate(FILE = "1")
write.table(x = rbind(dup1, dup2, dup3), quote = F, row.names = F, sep = "\t",
file = here::here("data", "GBS", "AllGBSDuplicates.txt"))
write.table(x = dup1$x, file = here::here("data", "GBS", "GBSDuplicates1.txt"),
sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = dup2$x, file = here::here("data", "GBS", "GBSDuplicates2.txt"),
sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = dup3$x, file = here::here("data", "GBS", "GBSDuplicates3.txt"),
sep = "\t", row.names = F, col.names = F, quote = F)
### Marnin had an excellent idea of sums the counts for each allele for the duplicates individuals
## Duplicates in DArT genotyping platform
DArTClones <- read.table(file = here::here("data", "DArTGPInfo2.csv"),
sep = ",", header = T)
Dup <- DArTClones[DArTClones$Observation_unit_name %in%
DArTClones$Observation_unit_name[duplicated(DArTClones$Observation_unit_name)],]
Dup1 <- DArTClones %>% filter(duplicated(DArTClones$Observation_unit_name)) %>%
filter(!duplicated(.[,"Observation_unit_name"])) %>%
.[order(.[,"Observation_unit_name"]),] %>% mutate(FILE = "1")
Dup2 <- Dup %>% filter(!Dup$exported_tissue_sample_name %in%
Dup1$exported_tissue_sample_name) %>%
filter(!duplicated(.[,"Observation_unit_name"])) %>% mutate(FILE = "2")
Dup3 <- Dup %>% filter(!Dup$exported_tissue_sample_name %in%
(rbind(Dup1, Dup2) %>%
.[,"exported_tissue_sample_name"])) %>%
filter(!duplicated(.[,"Observation_unit_name"])) %>% mutate(FILE = "3")
Dup4 <- Dup %>% filter(!Dup$exported_tissue_sample_name %in%
(rbind(Dup1, Dup2, Dup3) %>%
.[,"exported_tissue_sample_name"])) %>%
mutate(FILE = "4")
write.table(x = rbind(Dup1, Dup2, Dup3, Dup4), quote = F, row.names = F, sep = "\t",
file = here::here("data", "AllDArTDuplicates.txt"))
write.table(x = Dup1$DarTMatch, file = here::here("data", "DArTDuplicates1.txt"),
sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = Dup2$DarTMatch, file = here::here("data", "DArTDuplicates2.txt"),
sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = Dup3$DarTMatch, file = here::here("data", "DArTDuplicates3.txt"),
sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = Dup4$DarTMatch, file = here::here("data", "DArTDuplicates4.txt"),
sep = "\t", row.names = F, col.names = F, quote = F)
Check the duplicates in each genotyping platform link
CommonClones <- GBSClones$uniquename_in_db[GBSClones$uniquename_in_db %in%
DArTClones$Observation_unit_name] %>%
unique
GBSComClones <- GBSClones[GBSClones$uniquename_in_db %in% CommonClones,] %>%
arrange(uniquename_in_db)
DArTComClones <- DArTClones[DArTClones$Observation_unit_name %in% CommonClones,] %>%
arrange(Observation_unit_name)
GBSComClones1 <- GBSComClones %>% filter(!duplicated(GBSComClones$uniquename_in_db)) %>%
.[order(.[,"uniquename_in_db"]),] %>% mutate(FILE = "G1")
GBSComClones2 <- GBSComClones %>% filter(!GBSComClones$x %in% GBSComClones1$x) %>%
filter(!duplicated(.[,"uniquename_in_db"])) %>% .[order(.[,"uniquename_in_db"]),] %>%
mutate(FILE = "G2")
GBSComClones3 <- GBSComClones %>% filter(!GBSComClones$x %in%
(rbind(GBSComClones1, GBSComClones2) %>%
.[,"x"])) %>%
filter(!duplicated(.[,"uniquename_in_db"])) %>% .[order(.[,"uniquename_in_db"]),] %>%
mutate(FILE = "G3")
write.table(rbind(GBSComClones1, GBSComClones2, GBSComClones3),
file = here::here("output", "DCas22_6902", "AllGBSComClones.txt"),
quote = F, sep = "\t")
write.table(GBSComClones1$x, file = here::here("output", "DCas22_6902", "GBSComClones1"),
row.names = F, col.names = F, quote = F, sep = "\t")
write.table(GBSComClones2$x, file = here::here("output", "DCas22_6902", "GBSComClones2"),
row.names = F, col.names = F, quote = F, sep = "\t")
write.table(GBSComClones3$x, file = here::here("output", "DCas22_6902", "GBSComClones3"),
row.names = F, col.names = F, quote = F, sep = "\t")
DArTComClones1 <- DArTComClones %>% filter(!duplicated(DArTComClones$Observation_unit_name)) %>%
.[order(.[,"Observation_unit_name"]),] %>% mutate(FILE = "D1")
DArTComClones2 <- DArTComClones %>% filter(!DArTComClones$DarTMatch %in% DArTComClones1$DarTMatch) %>%
filter(!duplicated(.[,"Observation_unit_name"])) %>% .[order(.[,"Observation_unit_name"]),] %>%
mutate(FILE = "D2")
DArTComClones3 <- DArTComClones %>% filter(!DArTComClones$DarTMatch %in%
(rbind(DArTComClones1, DArTComClones2) %>%
.[,"DarTMatch"])) %>%
filter(!duplicated(.[,"Observation_unit_name"])) %>% .[order(.[,"Observation_unit_name"]),] %>%
mutate(FILE = "D3")
DArTComClones4 <- DArTComClones %>% filter(!DArTComClones$DarTMatch %in%
(rbind(DArTComClones1, DArTComClones2, DArTComClones3) %>%
.[,"DarTMatch"])) %>%
filter(!duplicated(.[,"Observation_unit_name"])) %>% .[order(.[,"Observation_unit_name"]),] %>%
mutate(FILE = "D4")
write.table(rbind(DArTComClones1, DArTComClones2, DArTComClones3, DArTComClones4),
file = here::here("output", "DCas22_6902", "AllDArTComClones.txt"),
quote = F, sep = "\t")
write.table(DArTComClones1$DarTMatch, file = here::here("output", "DCas22_6902", "DArTComClones1"),
row.names = F, col.names = F, quote = F, sep = "\t")
write.table(DArTComClones2$DarTMatch, file = here::here("output", "DCas22_6902", "DArTComClones2"),
row.names = F, col.names = F, quote = F, sep = "\t")
write.table(DArTComClones3$DarTMatch, file = here::here("output", "DCas22_6902", "DArTComClones3"),
row.names = F, col.names = F, quote = F, sep = "\t")
write.table(DArTComClones4$DarTMatch, file = here::here("output", "DCas22_6902", "DArTComClones4"),
row.names = F, col.names = F, quote = F, sep = "\t")
### GBS Selecting files
system(paste0("vcftools --gzvcf output/DCas22_6902/GBSCommonMkrs.vcf.gz --keep output/DCas22_6902/GBSComClones", 1:3, " --recode --stdout > output/DCas22_6902/GBSComClones", 1:3, ".vcf"))
### DArT Selecting files
system(paste0("vcftools --gzvcf output/DCas22_6902/DArTCommonMkrs.vcf.gz --keep output/DCas22_6902/DArTComClones", 1:4, " --recode --stdout > output/DCas22_6902/DArTComClones", 1:4, ".vcf"))
### GBS
GBSC1 <- read.table(file = here::here("output", "DCas22_6902", "GBSComClones1.vcf"),
skip = 31, header = T, comment.char = "", check.names = F)
GBSC2 <- read.table(file = here::here("output", "DCas22_6902", "GBSComClones2.vcf"),
skip = 31, header = T, comment.char = "", check.names = F)
GBSC3 <- read.table(file = here::here("output", "DCas22_6902", "GBSComClones3.vcf"),
skip = 31, header = T, comment.char = "", check.names = F)
GBSClones <- read.table(file = here::here("data", "GBS", "EM_matched_accessions-GBS.csv"),
sep = ",", header = T)
colnames(GBSC1)[10:ncol(GBSC1)] <- GBSClones$uniquename_in_db[match(colnames(GBSC1)[10:ncol(GBSC1)], GBSClones$x)]
colnames(GBSC2)[10:ncol(GBSC2)] <- GBSClones$uniquename_in_db[match(colnames(GBSC2)[10:ncol(GBSC2)], GBSClones$x)]
colnames(GBSC3)[10:ncol(GBSC3)] <- GBSClones$uniquename_in_db[match(colnames(GBSC3)[10:ncol(GBSC3)], GBSClones$x)]
header <- c("##fileformat=VCFv4.0",
"##FILTER=<ID=PASS,Description=\"All filters passed\">",
"##Tassel=<ID=GenotypeTable,Version=5,Description=\"Reference allele is not known. The major allele was used as reference allele\">",
"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">",
"##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths for the reference and alternate alleles in the order listed\">",
"##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth (only filtered reads used for calling)\">",
"##FORMAT=<ID=GQ,Number=1,Type=Float,Description=\"Genotype Quality\">",
"##FORMAT=<ID=PL,Number=.,Type=Float,Description=\"Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic\">",
"##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">",
"##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">",
"##INFO=<ID=AF,Number=.,Type=Float,Description=\"Allele Frequency\">",
"##contig=<ID=1>",
"##contig=<ID=2>",
"##contig=<ID=3>",
"##contig=<ID=4>",
"##contig=<ID=5>",
"##contig=<ID=6>",
"##contig=<ID=7>",
"##contig=<ID=8>",
"##contig=<ID=9>",
"##contig=<ID=10>",
"##contig=<ID=11>",
"##contig=<ID=12>",
"##contig=<ID=13>",
"##contig=<ID=14>",
"##contig=<ID=15>",
"##contig=<ID=16>",
"##contig=<ID=17>",
"##contig=<ID=18>",
"##bcftools_concatVersion=1.14+htslib-1.14",
"##bcftools_concatCommand=concat --output tassel_embrapa_newfastq.bial.recode.vcf")
write_lines(header, file = here::here("output", "DCas22_6902", "GBSComClones1renamed.vcf"),
append = F)
write_lines(header, file = here::here("output", "DCas22_6902", "GBSComClones2renamed.vcf"),
append = F)
write_lines(header, file = here::here("output", "DCas22_6902", "GBSComClones3renamed.vcf"),
append = F)
write.table(x = GBSC1, file = here::here("output", "DCas22_6902", "GBSComClones1renamed.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
write.table(x = GBSC2, file = here::here("output", "DCas22_6902", "GBSComClones2renamed.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
write.table(x = GBSC3, file = here::here("output", "DCas22_6902", "GBSComClones3renamed.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
### DArT
DArTC1 <- read.table(file = here::here("output", "DCas22_6902", "DArTComClones1.vcf"),
header = T, comment.char = "", check.names = F)
DArTC2 <- read.table(file = here::here("output", "DCas22_6902", "DArTComClones2.vcf"),
header = T, comment.char = "", check.names = F)
DArTC3 <- read.table(file = here::here("output", "DCas22_6902", "DArTComClones3.vcf"),
header = T, comment.char = "", check.names = F)
DArTC4 <- read.table(file = here::here("output", "DCas22_6902", "DArTComClones4.vcf"),
header = T, comment.char = "", check.names = F)
DArTClones <- read.table(file = here::here("data", "DArTGPInfo2.csv"),
sep = ",", header = T)
colnames(DArTC1)[10:ncol(DArTC1)] <- DArTClones$Observation_unit_name[match(colnames(DArTC1)[10:ncol(DArTC1)], DArTClones$DarTMatch)]
colnames(DArTC2)[10:ncol(DArTC2)] <- DArTClones$Observation_unit_name[match(colnames(DArTC2)[10:ncol(DArTC2)], DArTClones$DarTMatch)]
colnames(DArTC3)[10:ncol(DArTC3)] <- DArTClones$Observation_unit_name[match(colnames(DArTC3)[10:ncol(DArTC3)], DArTClones$DarTMatch)]
colnames(DArTC4)[10:ncol(DArTC4)] <- DArTClones$Observation_unit_name[match(colnames(DArTC4)[10:ncol(DArTC4)], DArTClones$DarTMatch)]
header <- c("##fileformat=VCFv4.0",
"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">",
"##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths for the reference and alternate alleles in the order listed\">",
"##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth (only filtered reads used for calling)\">",
"##FORMAT=<ID=PL,Number=3,Type=Float,Description=\"Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic\">")
write_lines(header, file = here::here("output", "DCas22_6902", "DArTComClones1renamed.vcf"),
append = F)
write_lines(header, file = here::here("output", "DCas22_6902", "DArTComClones2renamed.vcf"),
append = F)
write_lines(header, file = here::here("output", "DCas22_6902", "DArTComClones3renamed.vcf"),
append = F)
write_lines(header, file = here::here("output", "DCas22_6902", "DArTComClones4renamed.vcf"),
append = F)
write.table(x = DArTC1, file = here::here("output", "DCas22_6902", "DArTComClones1renamed.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
write.table(x = DArTC2, file = here::here("output", "DCas22_6902", "DArTComClones2renamed.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
write.table(x = DArTC3, file = here::here("output", "DCas22_6902", "DArTComClones3renamed.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
write.table(x = DArTC4, file = here::here("output", "DCas22_6902", "DArTComClones4renamed.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
source(paste0("bgzip -c -@ 6 output/DCas22_6902/GBSComClones", 1:3, "renamed.vcf > output/DCas22_6902/GBSComClones", 1:3, "renamed.vcf.gz"))
source(paste0("bgzip -c -@ 6 output/DCas22_6902/DArTComClones", 1:4, "renamed.vcf > output/DCas22_6902/DArTComClones", 1:4, "renamed.vcf.gz"))
vcftools --gzvcf output/DCas22_6902/GBSComClones1renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones1renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT1_1
vcftools --gzvcf output/DCas22_6902/GBSComClones1renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones2renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT1_2
vcftools --gzvcf output/DCas22_6902/GBSComClones1renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones3renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT1_3
vcftools --gzvcf output/DCas22_6902/GBSComClones1renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones4renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT1_4
vcftools --gzvcf output/DCas22_6902/GBSComClones2renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones1renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT2_1
vcftools --gzvcf output/DCas22_6902/GBSComClones2renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones2renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT2_2
vcftools --gzvcf output/DCas22_6902/GBSComClones2renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones3renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT2_3
vcftools --gzvcf output/DCas22_6902/GBSComClones2renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones4renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT2_4
vcftools --gzvcf output/DCas22_6902/GBSComClones3renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones1renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT3_1
vcftools --gzvcf output/DCas22_6902/GBSComClones3renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones2renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT3_2
vcftools --gzvcf output/DCas22_6902/GBSComClones3renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones3renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT3_3
vcftools --gzvcf output/DCas22_6902/GBSComClones3renamed.vcf.gz --gzdiff \
output/DCas22_6902/DArTComClones4renamed.vcf.gz --diff-indv-discordance --out output/DCas22_6902/GBSDArT3_4
library(tidyverse); library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
GBSDArT3_4 <- read.table(file = here::here("output", "DCas22_6902", "GBSDArT1_1.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G1",
COMP2 = "D1") %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT1_2.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G1",
COMP2 = "D2")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT1_3.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G1",
COMP2 = "D3")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT1_4.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G1",
COMP2 = "D4")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT2_1.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G2",
COMP2 = "D1")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT2_2.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G2",
COMP2 = "D2")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT2_3.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G2",
COMP2 = "D3")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT2_4.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G2",
COMP2 = "D4")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT3_1.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G3",
COMP2 = "D1")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT3_2.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G3",
COMP2 = "D2")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT3_3.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G3",
COMP2 = "D3")) %>%
rbind(read.table(file = here::here("output", "DCas22_6902", "GBSDArT3_4.diff.indv"),
sep = "\t", header = T, check.names = F) %>%
mutate(COMP1 = "G3",
COMP2 = "D4")) %>%
filter(!is.na(DISCORDANCE)) %>%
mutate(N_NA = 1560 - N_COMMON_CALLED) %>%
select(INDV, COMP1, COMP2, N_NA, N_COMMON_CALLED, N_DISCORD, DISCORDANCE) %>% arrange(INDV, COMP1, COMP2)
### Check the names in each vcf file
AllGBSComCl <- read.table(file = here::here("output", "DCas22_6902", "AllGBSComClones.txt"),
header = T, sep = "\t")
AllDArTComCl <- read.table(file = here::here("output", "DCas22_6902", "AllDArTComClones.txt"),
header = T, sep = "\t")
GBSDArT3_4 %<>% left_join(AllGBSComCl, by = c("INDV" = "uniquename_in_db", "COMP1" = "FILE")) %>%
rename(INDV1 = x) %>% left_join(AllDArTComCl, by = c("INDV" = "Observation_unit_name", "COMP2" = "FILE")) %>%
rename(INDV2 = DarTMatch) %>% select(INDV, INDV1, INDV2, N_NA, N_COMMON_CALLED, N_DISCORD, DISCORDANCE)
write.table(GBSDArT3_4, file = here::here("output", "DCas22_6902", "GBSDArTAllDiscordanceIndex.csv"),
quote = F, sep = ",", dec = ".", row.names = F)
DupGBSDArT3_4Cl <- GBSDArT3_4 %>% select(INDV) %>%
.[duplicated(.[,1]),] %>% unique
UnGBSDArT3_4 <- GBSDArT3_4 %>%
.[!.[,"INDV"] %in% DupGBSDArT3_4Cl,] %>% arrange(INDV)
DupGBSDArT3_4 <- GBSDArT3_4 %>% arrange(INDV, DISCORDANCE) %>%
group_by(INDV) %>% slice_min(DISCORDANCE, n = 1) %>% filter(INDV2 != "BGM.0779.1")
DupGBSDArT3_4[duplicated(DupGBSDArT3_4$INDV),]
# A tibble: 0 × 7
# Groups: INDV [0]
# … with 7 variables: INDV <chr>, INDV1 <chr>, INDV2 <chr>, N_NA <dbl>,
# N_COMMON_CALLED <int>, N_DISCORD <int>, DISCORDANCE <dbl>
DupGBSDArT3_4[DupGBSDArT3_4$INDV %in% "BGM-0779",]
# A tibble: 1 × 7
# Groups: INDV [1]
INDV INDV1 INDV2 N_NA N_COMMON_CALLED N_DISCORD DISCORDANCE
<chr> <chr> <chr> <dbl> <int> <int> <dbl>
1 BGM-0779 CNPMF779:250370902 BGM.0… 845 715 220 0.308
length(unique(DupGBSDArT3_4$INDV2))
[1] 1036
write.table(DupGBSDArT3_4, file = here::here("output", "DCas22_6902", "GBSDArTCommonClonesSelected.csv"),
quote = F, sep = ",", dec = ".", row.names = F)
write.table(DupGBSDArT3_4$INDV1, file = here::here("output", "DCas22_6902", "GBSCommonClonesSelected.txt"),
quote = F, sep = "\t", row.names = F, col.names = F)
write.table(DupGBSDArT3_4$INDV2, file = here::here("output", "DCas22_6902", "DArTCommonClonesSelected.txt"),
quote = F, sep = "\t", row.names = F, col.names = F)
VCFFiles <- c("GBSCommonMkrs.vcf.gz",
"DArTCommonMkrs.vcf.gz")
IndFiles <- c("GBSCommonClonesSelected.txt",
"DArTCommonClonesSelected.txt")
OutFiles <- c("GBSCommonMkrsClones.vcf.gz",
"DArTCommonMkrsClones.vcf.gz")
system(paste0("vcftools --gzvcf output/DCas22_6902/", VCFFiles, " --keep output/DCas22_6902/",
IndFiles, " --recode --stdout | bgzip -c -@ 4 > output/DCas22_6902/",
OutFiles))
system("gzcat output/DCas22_6902/GBSCommonMkrsClones.vcf.gz | head -n31 > output/DCas22_6902/HeaderGBS.txt")
system("bgzip -d output/DCas22_6902/DArTCommonMkrsClones.vcf.gz")
header <- c("##fileformat=VCFv4.0",
"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">",
"##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths for the reference and alternate alleles in the order listed\">",
"##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth (only filtered reads used for calling)\">",
"##FORMAT=<ID=PL,Number=3,Type=Float,Description=\"Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic\">")
DArTVCFFileSel <- read.table(file = here::here("output", "DCas22_6902", "DArTCommonMkrsClones.vcf"),
sep = "\t", header = T, comment.char = "", check.names = F)
write_lines(header, file = here::here("output", "DCas22_6902", "DArTCommonMkrsClones2.vcf"),
append = F)
write.table(x = DArTVCFFileSel, file = here::here("output", "DCas22_6902", "DArTCommonMkrsClones2.vcf"),
quote = F, row.names = F, append = T, col.names = T, sep = "\t")
system(paste0("bgzip -c -@ 4 output/DCas22_6902/DArTCommonMkrsClones2.vcf > ",
"output/DCas22_6902/DArTCommonMkrsClones2.vcf.gz"))
system("tabix -f -p vcf output/DCas22_6902/GBSCommonMkrsClones.vcf.gz")
system("tabix -f -p vcf output/DCas22_6902/DArTCommonMkrsClones2.vcf.gz")
system(paste0("bcftools merge",
" --output output/DCas22_6902/AllSamplesGBSDArTCommonSites.vcf.gz",
" --merge snps --output-type z --threads 4",
" output/DCas22_6902/GBSCommonMkrsClones.vcf.gz",
" output/DCas22_6902/DArTCommonMkrsClones2.vcf.gz"))
system(paste0("vcftools --gzvcf output/DCas22_6902/AllSamplesGBSDArTCommonSites.vcf.gz ",
"--min-alleles 2 --max-alleles 2 --minDP 4, --maxDP 50 ",
"--recode --stdout | bgzip -c -@ 4 > ",
"output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf.gz"))
system(paste0("bgzip -d -@ 4 output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf.gz > ",
"output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf"))
system(paste0("code/plink/plink",
" --vcf output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf ",
"--make-bed --const-fid ",
"--out output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf"))
system(paste0("code/plink/plink --bfile output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf ",
"--genome --out output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf;"))
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
genome<-fread("output/DCas22_6902/AllSamplesGBSDArTCommonSitesFiltered.vcf.genome",
stringsAsFactors = F, header = T, check.names = F) %>%
as_tibble
dataset_samples<-read.table(file = here::here("output", "DCas22_6902", "GBSDArTCommonClonesSelected.csv"),
header = T, stringsAsFactors = F, check.names = F, sep = ",")
genome2 <- genome %>% filter(IID1 %in% dataset_samples$INDV1, IID2 %in% dataset_samples$INDV2) %>%
inner_join(dataset_samples, by = c("IID1" = "INDV1", "IID2" = "INDV2")) %>%
select(INDV, everything())
write.table(genome2, file = here::here("output", "DCas22_6902", "IDBPlinkCommonClones.csv"),
sep = ",", row.names = F, quote = F)
IDBMatches <- genome2 %>% filter(DST >= 0.9 | PI_HAT >= 0.65) %>%
select(IID1, IID2, DST, PI_HAT) %>% rename(NamesGBS = IID1,
NamesDArT = IID2) %>%
filter(!is.na(DST))
write.table(IDBMatches, file = here::here("output", "DCas22_6902", "CommonClonesVerifiedIBD.csv"),
sep = ",", quote = F, row.names = F)
write.table(IDBMatches$NamesGBS, file = here::here("output", "DCas22_6902", "GBSClonesVerifiedIDB.txt"),
sep = "\t", quote = F, row.names = F, col.names = F)
write.table(IDBMatches$NamesDArT, file = here::here("output", "DCas22_6902", "DArTClonesVerifiedIDB.txt"),
sep = "\t", quote = F, row.names = F, col.names = F)
GBSMkrs <- read.table(file = here::here("output", "DArT2022", "GBSMkrs.txt"),
header = T)
DArTMkrs <- read.table(file = here::here("output", "DArT2022", "DArTMkrs.txt"), header = T)
gbs_unique_sites <- GBSMkrs %>% # GBS sites
rename(CHROM = X.CHROM) %>%
anti_join(DArTMkrs, by = c("ID")) %>% # that are unique in GBS
select(CHROM, POS) %>% arrange(CHROM, POS); nrow(gbs_unique_sites) # 398,435 Mkrs GBS only
write.table(gbs_unique_sites,
file = here::here("output", "DCas22_6902", "gbs_unique_sites.txt"),
row.names = F, col.names = F, quote = F)
Prepare a VCF file of the GBS with the common clones with DArT
library(tidyverse); library(magrittr); require(furrr)
system(paste0("vcftools --gzvcf data/GBS/tassel_embrapa_newfastq.bial.recode.vcf.gz ",
"--keep output/DCas22_6902/GBSClonesVerifiedIDB.txt ",
"--positions output/DCas22_6902/gbs_unique_sites.txt ",
"--recode ",
"--stdout | bgzip -c -@ 7 > ",
"output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfiltered.vcf.gz"))
Prepare a VCF file of the DArT with the common clones with GBS
system(paste0("vcftools --gzvcf output/DCas22_6902/",
"DCas22_6902_DArTseqLD_AllSites_AllChrom_raw.vcf.gz ",
"--keep output/DCas22_6902/",
"DArTClonesVerifiedIDB.txt ",
" --recode ",
"--stdout | bgzip -c -@ 7 > ",
"output/DCas22_6902/",
"DArTsamples_UniqueSites_SelUnfiltered.vcf.gz"))
Now we need to rename the clones of the GBS VCF file with the same name in DArT, so we can proceed joint these two files.
system(paste0("tabix -f -p vcf output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfiltered.vcf.gz"))
system(paste0("tabix -f -p vcf output/DCas22_6902/",
"DArTsamples_UniqueSites_SelUnfiltered.vcf.gz"))
system(paste0("bcftools query --list-samples output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfiltered.vcf.gz ",
"> output/DCas22_6902/",
"GBSSelsampleNamesInOrder_GBSUniqueSites_unfiltered.txt"))
system(paste0("bcftools query --list-samples output/DCas22_6902/",
"DArTsamples_UniqueSites_SelUnfiltered.vcf.gz ",
"> output/DCas22_6902/",
"DArTSelsampleNamesInOrder_DArTUniqueSites_unfiltered.txt"))
Let’s see what is the match position of the GBS with DArT clones
library(tidyverse); library(magrittr)
GBSSelClones <- read.table(file = here::here("output", "DCas22_6902",
"GBSSelsampleNamesInOrder_GBSUniqueSites_unfiltered.txt"),
header = F, sep = "\t")
DArTSelClones <- read.table(file = here::here("output", "DCas22_6902",
"DArTSelsampleNamesInOrder_DArTUniqueSites_unfiltered.txt"),
header = F, sep = "\t")
GBSDArTCheckNames <- read.table(file = here::here("output", "DCas22_6902",
"CommonClonesVerifiedIBD.csv"),
sep = ",", header = T)
GBSDArTCheckNames %<>% .[match(DArTSelClones$V1, .$NamesDArT),]
NewGBSSelClones <- GBSDArTCheckNames %>%
.[match(GBSSelClones$V1, .$NamesGBS), "NamesDArT"]
write.table(NewGBSSelClones, file = here::here("output", "DCas22_6902",
"NewGBSClonesNames.txt"),
sep = "\t", quote = F, col.names = F, row.names = F)
write.table(GBSDArTCheckNames$NamesDArT,
file = here::here("output", "DCas22_6902", "GBSClonesPos.txt"),
sep = "\t", quote = F, col.names = F, row.names = F)
Change the names and the position of the Clones to the same as DArT
system(paste0("bcftools reheader --samples output/DCas22_6902/",
"NewGBSClonesNames.txt ",
"--output output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfilteredRenamed.vcf.gz ",
"output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfiltered.vcf.gz"))
system(paste0("bcftools view --samples-file ",
"output/DCas22_6902/",
"GBSClonesPos.txt ",
"--output-type z --output-file ",
"output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfilteredRenamed2.vcf.gz ",
"output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfilteredRenamed.vcf.gz"))
system(paste0("vcftools --gzvcf output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfilteredRenamed2.vcf.gz ",
"--exclude-positions output/DCas22_6902/",
"RmMKrsGBSPreImputedFilter.txt",
" --recode ",
"--stdout | bgzip -c -@ 7 > ",
"output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfilteredRenamed2.vcf.gz"))
Index GBS VCF File before concat with DArT VCF
system(paste0("tabix -f -p vcf output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfilteredRenamed2.vcf.gz"))
Concat DArT and GBS Mkrs from the common clones
system(paste0("bcftools concat --allow-overlaps ",
"--output output/DCas22_6902/",
"GBSandDArTsitesCommonClones_unfiltered.vcf.gz ",
"--output-type z --threads 7 ",
"output/DCas22_6902/",
"GBSsamples_UniqueSites_SelUnfilteredRenamed2.vcf.gz ",
"output/DCas22_6902/",
"DArTsamples_UniqueSites_SelUnfiltered.vcf.gz"))
system(paste0("vcftools --gzvcf output/DCas22_6902/",
"GBSandDArTsitesCommonClones_unfiltered.vcf.gz ",
"--min-alleles 2 --max-alleles 2 --minDP 5 --maxDP 50 ",
"--recode --stdout | bgzip -c -@ 7 > ",
"output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter.vcf.gz"))
# Estimating Call rate
system(paste0("vcftools --gzvcf ",
"output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter.vcf.gz ",
"--missing-site ",
"--out output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter"))
# Estimating mean filtered depth per site
system(paste0("vcftools --gzvcf ",
"output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter.vcf.gz ",
"--site-mean-depth ",
"--out output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter"))
# Estimating Hardy-Weinberg equilibrium
system(paste0("vcftools --gzvcf ",
"output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter.vcf.gz ",
"--hardy ",
"--out output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter"))
stats2filterOn <- read.table(file = here::here("output", "DCas22_6902",
"GBSandDArTsitesCommonClones_ready2filter.lmiss"),
stringsAsFactors = F, header = T) %>%
inner_join(read.table(file = here::here("output", "DCas22_6902",
"GBSandDArTsitesCommonClones_ready2filter.ldepth.mean"),
stringsAsFactors = F, header = T), by = c("CHR" = "CHROM", "POS")) %>%
inner_join(read.table(file = here::here("output", "DCas22_6902",
"GBSandDArTsitesCommonClones_ready2filter.hwe"),
stringsAsFactors = F, header = T), by = c("CHR", "POS"))
PreImputeFilterMkrs <- stats2filterOn %>% filter(F_MISS <= 0.4,
MEAN_DEPTH <= 120,
-log10(P_HWE) < 20) %>%
mutate(ID = paste0(CHR, "_", POS))
NDupMkrs <- PreImputeFilterMkrs[duplicated(PreImputeFilterMkrs$ID),] %>%
group_by(ID) %>% slice(1) %>% ungroup
write.table(NDupMkrs %>% select(CHR, POS),
file = here::here("output", "DCas22_6902", "RmMKrsGBSPreImputedFilter.txt"),
row.names = F, col.names = F, quote = F)
write.table(PreImputeFilterMkrs %>% select(CHR, POS),
file = here::here("output", "DCas22_6902", "MKrsPreImputedFilter.txt"),
row.names = F, col.names = F, quote = F)
system(paste0("vcftools --gzvcf output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter.vcf.gz ",
"--positions ",
" output/DCas22_6902/",
"MkrsPreImputedFilter.txt ",
"--chr ", 1:18,
" --recode --stdout | ",
"bgzip -c -@ 7 > ",
"output/RefPop/chr", 1:18,
"_GBSandDArTsitesCommonClones_ready2Impute.vcf.gz"))
With the filters we got 98,066 SNP Mkrs for the Reference population imputation
runBeagle5Luc <- function(targetVCF, mapFile, outName, nthreads, maxmem = "500g",
impute = TRUE, ne = 1e+05, samplesToExclude = NULL){
system(paste0("java -Xms2g -Xmx", maxmem, " -jar /programs/beagle/beagle.jar ",
"gt=", targetVCF, " ", "map=", mapFile, " ",
"out=", outName, " ", "nthreads=", nthreads,
" impute=", impute, " ne=", ne,
ifelse(!is.null(samplesToExclude),
paste0(" excludesamples=", samplesToExclude), "")))}
targetVCFpath<-here::here("data", "RefPop") # location of the targetVCF
mapPath<-here::here("data", "CassavaGeneticMapV6updated/")
outPath<-here::here("output/")
outSuffix<-"DCas22_6902"
library(tidyverse); library(magrittr);
purrr::map(1:18,
~runBeagle5Luc(targetVCF=paste0(targetVCFpath,"/chr",.,
"_GBSandDArTsitesCommonClones_ready2Impute.vcf.gz"),
mapFile=paste0(mapPath,"chr",.,
"_cassava_cM_pred.v6_91019.map"),
outName=paste0(outPath,"chr",.,
"_GBSandDArTsitesCommonClones_RefPopImputed"),
nthreads=110))
system(paste0("vcftools --gzvcf output/DCas22_6902/",
"GBSandDArTsitesCommonClones_ready2filter.vcf.gz ",
"--positions ",
" output/DCas22_6902/",
"MkrsPreImputedFilter.txt ",
"--chr ", 1:18,
" --recode --stdout | ",
"bgzip -c -@ 7 > ",
"output/RefPop/chr", 1:18,
"_GBSandDArTsitesCommonClones_ready2Impute.vcf.gz"))
Prepare a VCF file of the GBS with the common clones with DArT
system(paste0("vcftools --gzvcf data/GBS/tassel_embrapa_newfastq.bial.recode.vcf.gz ",
"--remove output/DCas22_6902/GBSClonesVerifiedIDB.txt ",
"--exclude-positions data/CommonMkrsGBSDArTIssues ",
"--recode ",
"--stdout | bgzip -c -@ 7 > ",
"output/DCas22_6902/",
"GBSonlysamples_UniqueSites_SelUnfiltered.vcf.gz"))
Prepare the VCF File of the DArT genotyping clones not present in GBS, with the SNP markers common with the Reference Population.
purrr::map(1:18, ~system(paste0("vcftools --gzvcf output/DCas22_6902/",
"DCas22_6902_DArTseqLD_AllSites_AllChrom_raw.vcf.gz ",
"--remove output/DCas22_6902/",
"DArTClonesVerifiedIDB.txt ",
"--chr ", 1:18,
" --recode ",
"--stdout | bgzip -c -@ 7 > ",
"output/StudyPop/chr",
1:18, "_DArTonlysamples_UniqueSites_Unfiltered.vcf.gz")))
targetVCFpath <- here::here("data", "StudyPop/") # location of the targetVCF
RefPath <- here::here("output/")
mapPath <- here::here("data", "CassavaGeneticMapV6updated/")
outPath <- here::here("output/")
purrr::map(1:18,
~genomicMateSelectR::runBeagle5(targetVCF=paste0(targetVCFpath,"chr",.,
"_DArTonlysamples_UniqueSites_Unfiltered.vcf.gz"),
refVCF=paste0(RefPath,"chr",.,
"_GBSandDArTsitesCommonClones_RefPopImputed.vcf.gz"),
mapFile=paste0(mapPath,"chr",.,
"_cassava_cM_pred.v6_91019.map"),
outName=paste0(outPath,"chr",.,
"_DCas22_6902_StudyPopimputed"),
nthreads=110))
## Chromosome 4 had issues during the imputation, which required to increase the size of the window in Centimorgans to 731 to finish the imputation. These procedure is okay the window size is an argument in Beagle 5 to reduce the maximum RAM memory demanded for the imputation.
genomicMateSelectR::runBeagle5(targetVCF=paste0(targetVCFpath,"chr4",
"_DArTonlysamples_UniqueSites_Unfiltered.vcf.gz"),
refVCF=paste0(RefPath,"chr4",
"_GBSandDArTsitesCommonClones_RefPopImputed.vcf.gz"),
mapFile=paste0(mapPath,"chr4",
"_cassava_cM_pred.v6_91019.map"),
outName=paste0(outPath,"chr4",
"_DCas22_6902_StudyPopimputed"),
nthreads=110, window = 731)
## Indexing files
purrr::map(1:18,
~system(paste0("tabix -f -p vcf output/chr",.,
"_GBSandDArTsitesCommonClones_RefPopImputed.vcf.gz")))
system(paste0("bcftools concat ",
"--output ",
"output/AllChrGBSandDArTsitesCommonClones_RefPopImputed.vcf.gz ",
"--output-type z --threads 18 ",
paste0("output/chr",1:18,
"_GBSandDArTsitesCommonClones_RefPopImputed.vcf.gz",
collapse = " ")))
system(paste0("tabix -f -p vcf output/",
"AllChrGBSandDArTsitesCommonClones_RefPopImputed.vcf.gz"))
purrr::map(1:18,
~system(paste0("tabix -f -p vcf output/chr",.,
"_DCas22_6902_StudyPopimputed.vcf.gz")))
system(paste0("bcftools concat ",
"--output ",
"output/AllChrDCas22_6902_StudyPopimputed.vcf.gz ",
"--output-type z --threads 18 ",
paste0("output/chr",1.:18,
"_DCas22_6902_StudyPopimputed.vcf.gz",
collapse = " ")))
system(paste0("tabix -f -p vcf output/",
"AllChrDCas22_6902_StudyPopimputed.vcf.gz"))
system(paste0("bcftools merge",
" --output output/",
"DCas22_6902RefPopImputed.vcf.gz",
" --merge snps --output-type z --threads 7",
" output/AllChrGBSandDArTsitesCommonClones_RefPopImputed.vcf.gz",
" output/AllChrDCas22_6902_StudyPopimputed.vcf.gz"))
inPath<-here::here("output/")
outPath<-here::here("output/")
ncores <- 7
nclones <- 7827
require(furrr); options(mc.cores=ncores); plan(multiprocess)
genomicMateSelectR::postImputeFilter(inPath=inPath,
inName=paste0("DCas22_6902RefPopImputed"),
outPath=outPath,
outName=paste0("DCas22_6902_RefandStudyPopimputedAndFiltered"),
DR2thresh = 0, MAFthresh = (1/nclones)*2)
genomicMateSelectR::convertVCFtoDosage(pathIn = "output/", pathOut = "output/",
vcfName = "DCas22_6902_RefandStudyPopimputedAndFiltered")
snps <- read.table(file = here::here("output", "DCas22_6902_RefandStudyPopimputedAndFiltered.raw"),
stringsAsFactor=F, header = T) %>%
dplyr::select(-FID,-PAT,-MAT,-SEX,-PHENOTYPE) %>%
column_to_rownames(var = "IID") %>%
as.matrix()
saveRDS(snps,file = here::here("output", "DCas22_GBSandDArT_ReadyForGP_Dos.rds"))
library(tidyverse); library(CMplot)
snps <- readRDS(file = here::here("output", "DCas22_6902", "DCas22_GBSandDArT_ReadyForGP_Dos.rds"))
CMsnps <- tibble(SNP = colnames(snps),
chr = substring(SNP,1,3),
pos = substring(SNP,4)) %>%
mutate(chr = gsub(pattern = "_", replacement = "", x = chr) %>%
gsub(pattern = "S", replacement = "") %>% as.integer,
pos = gsub(pattern = "_[A-Z]", replacement = "", x = pos) %>%
gsub(pattern = "_", replacement = "", x = .) %>% as.integer)
CMplot(CMsnps, plot.type = "d", bin.size = 1e6, col = c("darkgreen", "yellow", "red"),
file = "jpg", memo = "GBSDArTDensityMkrs", dpi = 500, file.output = T, verbose = TRUE)
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.6.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] data.table_1.14.3 reactable_0.2.3 magrittr_2.0.3 ggvenn_0.1.9
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
[9] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
[13] tidyverse_1.3.1 here_1.0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 lubridate_1.8.0 assertthat_0.2.1 rprojroot_2.0.3
[5] digest_0.6.29 utf8_1.2.2 reactR_0.4.4 R6_2.5.1
[9] cellranger_1.1.0 backports_1.4.1 reprex_2.0.1 evaluate_0.15
[13] highr_0.9 httr_1.4.3 pillar_1.7.0 rlang_1.0.2
[17] readxl_1.4.0 rstudioapi_0.13 jquerylib_0.1.4 rmarkdown_2.14
[21] labeling_0.4.2 htmlwidgets_1.5.4 munsell_0.5.0 broom_0.8.0
[25] compiler_4.1.2 httpuv_1.6.5 modelr_0.1.8 xfun_0.30
[29] pkgconfig_2.0.3 htmltools_0.5.2 tidyselect_1.1.2 workflowr_1.7.0
[33] fansi_1.0.3 crayon_1.5.1 tzdb_0.3.0 dbplyr_2.1.1
[37] withr_2.5.0 later_1.3.0 jsonlite_1.8.0 gtable_0.3.0
[41] lifecycle_1.0.1 DBI_1.1.2 git2r_0.30.1 scales_1.2.0
[45] cli_3.3.0 stringi_1.7.6 farver_2.1.0 fs_1.5.2
[49] promises_1.2.0.1 xml2_1.3.3 bslib_0.3.1 ellipsis_0.3.2
[53] generics_0.1.2 vctrs_0.4.1 tools_4.1.2 glue_1.6.2
[57] crosstalk_1.2.0 hms_1.1.1 fastmap_1.1.0 yaml_2.3.5
[61] colorspace_2.0-3 rvest_1.0.2 knitr_1.38 haven_2.5.0
[65] sass_0.4.1