Last updated: 2022-02-01

Checks: 7 0

Knit directory: QuantitativeGen/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220124) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 79038f7. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  output/QTLgenotypes.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Simulate.Rmd) and HTML (docs/Simulate.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 79038f7 LucianoRogerio 2022-02-01 Homework 2
html 60fd515 LucianoRogerio 2022-01-24 Build site.
Rmd aba23b1 LucianoRogerio 2022-01-24 wflow_publish(files = c("analysis/index.Rmd", "analysis/about.Rmd",
html 4f7264e LucianoRogerio 2022-01-24 Build site.
Rmd 6c15995 LucianoRogerio 2022-01-24 wflow_publish(files = c("analysis/index.Rmd", "analysis/Simulate.Rmd",

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Preliminaries

Pre-preparation

If you rate your familiarity with R as “low” please go to Data Carpentry: R for Genomics in advance of the first class and do the following:

  • Work through the material between at minimum chapters 1 (“Before We Start”) and 4 (“Data frames”), but preferably also 5 (“The dplyr package”).
  • This website includes instructions for installing R and Rstudio and basic instruction on how to interact with R.
  • Please install the AlphaSimR, workflowr, tidyverse, and sommer packages.

Learning objectives

  1. Reproducible scripting
  1. A reproducible analysis workflow includes a predictable input file and data structure, and outputs that are described, interpreted, and put in the context of the overall project in English.
  2. The audience of this workflow is akin to someone who might be reviewing a manuscript derived from the work. The most important audience is yourself, six months later, or your close collaborators (e.g., lab group) who may need to carry on the work after you move on.
  3. Whether you like it or not, you are a computational biologist. Lab biology experiments need to be reproducible. Computational biology analyses need to be reproducible.
  1. R markdown and workflowr
  1. R markdown language. It allows you to mingle script with formatted text, with script outputs. Note that Python and c++ scripts can be incorporated into R markdown. Help pages on rmarkdown:
    Cheat Sheet
    Authoring Basics
    In Literate Programming
  2. workflowr aims to make it easier to follow a coding workflow that will increase the communicability and reproducibility of your code, and constrain you somewhat to following that workflow. The package sets up a standard directory file structure and populates it with useful initial R markdown (.Rmd) files.
  3. workflowr also sets up the directory as a git version control repository. We will (probably) not teach git, but we encourage you all delving into it. Assume for the purpose of this class that you can use workflowr just to set up the file structure.There is more you can do with the package, and it’s worth checking it out (and its extensive documentation!)
  1. Homework
  1. Labs, including this one, will be coupled with homework. One possibility for homework will be to do it in a workflowr-created directory, and hand in that zipped directory.

Document explanation

There are three components to this document:

  1. Discussing “Reproducible Programming” and various practices and packages that can help out.
  2. Writing a simple script and describing it.
  3. Setting up a homework.
    In principle, we should cover the ideas of the homework in class. You will then make a clean rmarkdown script to show the ideas. The rmarkdown script should be embedded in a workflowr directory. The questions to answer through the lab and homework will be indented text.

Further background

  • R markdown is very good for documenting scripts, but less so programs. If code is linear and fairly simple, I call it a script. If code has loops, is potentially multi-purpose, and defines functions, I call it a program. workflowr provides a code directory for programs, and an analysis directory for scripts. Of course, scripts can (should) refer to functions, etc., in the code directory. Any raw data can go in data. Script outputs can go in output. Final figures and so on can go in docs.
  • If you write a series of functions that you will use repeatedly, it’s probably worth making a package out of them. That is not trivial, but it’s less difficult than it sounds. You do not have to submit your package to CRAN, but can just use it internally. The documentation of functions that goes along with making a package is very helpful over time.
  • If you write a program that you imagine will develop over time, learn version control (probably “git”. I don’t know any other version control…), here or here. Note that a public repository like github can be quite useful for making your data available once you publish your research. Note too that workflowr commands wrap a number of git commands to simplify the work flow, particularly if you are working in the RStudio environment.
  • Here are some useful articles:
    1. Ten Simple Rules for Reproducible Computational Research
    2. Good enough practices in scientific computing
    3. Creating and sharing reproducible research code the workflowr way

Loading packages

If your script depends on external packages, load them at the beginning. This shows users early on what the script dependencies are.

ip <- installed.packages()
packages_used <- c("tidyverse", "workflowr")

for (package in packages_used){
  isInstalled <- package %in% rownames(ip)
  if (!isInstalled){
    stop(paste("ERROR: you need to install the", package, "package"))
  } else{
    library(package, character.only=T)
  }
}#END packages_used
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.6     ✓ dplyr   1.0.7
✓ tidyr   1.1.4     ✓ stringr 1.4.0
✓ readr   2.1.1     ✓ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
This is workflowr version 1.7.0
Run ?workflowr for help getting started

Simulating Data

Simulation

Data simulation is an important tool for quantitative geneticists. It can be used for model testing and validation, exploring the impacts of certain assumptions, and in breeding can be used to optimize breeding approaches by simulating various breeding strategies and measuring performance In PLBRG simulation will be used throughout the course in the labs.

In subsequent labs you will be introduced to simulation tools that are particularly useful to plant and animal breeders, but to start we will simply use some random number generators to sample values from a normal distribution and calculate some simple statistics.

#The function rnorm returns simulated values from a normal distribution given input parameters for the mean and variance.
# here we will draw 1000 samples from a normal distribution with mean= 15 and standard deviation = 10
set.seed(12)
s=rnorm(n = 1000, mean = 15, sd = 10)
#The samples from rnorm are stored in the vector s
#Random number generators use seeds to generate a sequence of random numbers. If you want a simulation to be repeated using the same sequence of numbers to need to keep the same seed (workflowr does this by default)
hist(s)

Version Author Date
60fd515 LucianoRogerio 2022-01-24

Calculating the mean and variance

Now that we have sampled the normal random variables, let’s calculate some simple statistics. The purpose of this exercise if to introduce some basic coding concepts: the use of looping statements and basic matrix/vector operations.

First we will use two types of loops the sum the values in the vector s

for loop

  • A “for” loop will loop through the data a specified number of times.
  • Since we have 1000 data points we will need the loop to go from 1 to 1000 to access all values in the vector s
#Here we initialize a value to store the sum of the values in s
sumS = 0.0
#Here we initialize the loop
for(i in c(1:1000)){
  #here we give a function to add the values of s to sumS
  #since the for loop is set up with the indicator variable "i" which will count i = 1 to 1000 we access values in s using s[i]
  sumS=sumS+s[i]
} # end of the for loop to sum values i s
#calculate the mean
meanS=sumS/1000
#this could also be coded as ms=sumS/length(s)
print(meanS)
[1] 14.73563

while loop.

  • A While loop will keep looping through the data until some condition is met.
  • While loops are useful when it is unknown how many loops are needed but there is some success condition.
  • An example might be to go through a vector of numbers until you find a number >= 0.
#Since there is not a predefined set of numbers to loop through we have to initialize the indicator variable.
i=1
sumS=0.0
#In this case the condition to stop the loop will be when i reaches 1000
while(i<=length(s)){
  sumS=sumS+s[i]
  #We increase the value of i by 1 for each loop
  i=i+1
} #end the while loop to sum values in s
#calculate the mean
meanS=sumS/length(s)
print(meanS)
[1] 14.73563

In addition to loops we can also using simple matrix/vector operations to calculate a sum.

Calculating the inner product of two vectors

  • The inner product of two vectors is calculated by summing the product of the multiplication of corresponding elements of each vector.
  • If we have two vectors a and b each having 3 elements the inner product would be \(a[1]*b[1]+a[2]*b[2]+a[3]*b[3]\)
  • If b were a vector of 1s the inner product would be \(a[1]*1+a[2]*1+a[3]*1\), which is the sum of the vector a
  • Matrix algebra will be covered in more depth later in the class but I would encourage you to examin online resources for a refresher if you are not very familiar with matrix operations.
#creating a vector of 1s of length 1000 
#rep is an R function to create a vector of identical elements of some specified length
b=rep(1,1000)
#by default vectors in R are treated as columns
#to calculate and inner product you need the first vector to be a row and the second vector to be a column
#to transpose a column vector into a row vector use the function t()
sumS=t(s)%*%b
#in R %*% is used for matrix/vector multiplication
meanS=sumS/length(s)

print(meanS)
         [,1]
[1,] 14.73563

Variance

Now let’s apply the use of loops to calculate the variance

#To calculate the variance we will need to first get the sum of squares
#First lets create a vector of centered values (mean removed)
sc=s-meanS[1]
#because we just calculate meanS using matrix/vector operations R treats the object meanS as a vector with 1 element - if calculate using loops it would be a scalar and no need for [1] after meanS 
#The above line of code subtracts the mean from each of the 1000 elements in the vector s
#now we initialize a varaible for the sum of squares
sumSQ = 0.0
#Here we initialize the loop
for(i in c(1:1000)){
  sumSQ=sumSQ+sc[i]**2 #a**b is used to raise a to the power b 
} # end of the for loop to calculate the sum of squares
#calculate the variance
varS=sumSQ/(length(sc)-1)
#we lose 1 degree of freedom when using an estimate of the mean

print(varS)
[1] 92.03544

Homework

  1. Use workflowr to set up the file structure for this lab (I will walk you through it during our first lab session)
  2. Using this markdown file as a starting point, or creating your own markdown file, write code to calculate the variance using vector multiplication (no loops). Put the markdown file in the appropriate folder and build the html files using workflowr.

Luciano Solution Compress the full folder structure created by workflowr and upload to Canvas before the next lab period.


sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.6.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] workflowr_1.7.0 forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
 [5] purrr_0.3.4     readr_2.1.1     tidyr_1.1.4     tibble_3.1.6   
 [9] ggplot2_3.3.5   tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8       lubridate_1.8.0  getPass_0.2-2    ps_1.6.0        
 [5] assertthat_0.2.1 rprojroot_2.0.2  digest_0.6.29    utf8_1.2.2      
 [9] R6_2.5.1         cellranger_1.1.0 backports_1.4.1  reprex_2.0.1    
[13] evaluate_0.14    highr_0.9        httr_1.4.2       pillar_1.6.4    
[17] rlang_0.4.12     readxl_1.3.1     rstudioapi_0.13  callr_3.7.0     
[21] whisker_0.4      jquerylib_0.1.4  rmarkdown_2.11   munsell_0.5.0   
[25] broom_0.7.11     compiler_4.1.1   httpuv_1.6.5     modelr_0.1.8    
[29] xfun_0.29        pkgconfig_2.0.3  htmltools_0.5.2  tidyselect_1.1.1
[33] fansi_1.0.2      crayon_1.4.2     tzdb_0.2.0       dbplyr_2.1.1    
[37] withr_2.4.3      later_1.3.0      grid_4.1.1       jsonlite_1.7.3  
[41] gtable_0.3.0     lifecycle_1.0.1  DBI_1.1.2        git2r_0.29.0    
[45] magrittr_2.0.1   scales_1.1.1     cli_3.1.0        stringi_1.7.6   
[49] fs_1.5.2         promises_1.2.0.1 xml2_1.3.3       bslib_0.3.1     
[53] ellipsis_0.3.2   generics_0.1.1   vctrs_0.3.8      tools_4.1.1     
[57] glue_1.6.0       hms_1.1.1        processx_3.5.2   fastmap_1.1.0   
[61] yaml_2.2.1       colorspace_2.0-2 rvest_1.0.2      knitr_1.37      
[65] haven_2.4.3      sass_0.4.0