Importing and annotating quantified data into R
Overview
Teaching: XX min
Exercises: XX minQuestions
How do we get our data into R?
Objectives
Learn how to import the quantifications into a SummarizedExperiment object.
Learn how to add additional gene annotations to the object.
Contribute!
This episode is intended to show how we can assemble a SummarizedExperiment starting from individual count, rowdata and coldata files. Moreover, we will practice adding annotations for the genes, and discuss related concepts and things to keep in mind (annotation sources, versions, ‘helper’ packages like tximeta).
Read the data
Counts
counts <- read.csv("data/GSE96870_counts_cerebellum.csv",
row.names = 1)
Sample annotations
coldata <- read.csv("data/GSE96870_coldata_cerebellum.csv",
row.names = 1)
Gene annotations
Need to be careful - the descriptions contain both commas and ‘ (e.g., 5’)
rowranges <- read.delim("data/GSE96870_rowranges.tsv", sep = "\t",
colClasses = c(ENTREZID = "character"),
header = TRUE, quote = "", row.names = 5)
Mention other ways of getting annotations, and practice querying org package. Important to use the right annotation source/version.
suppressPackageStartupMessages({
library(org.Mm.eg.db)
})
mapIds(org.Mm.eg.db, keys = "497097", column = "SYMBOL", keytype = "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
497097
"Xkr4"
Check feature types
table(rowranges$gbkey)
C_region D_segment exon J_segment misc_RNA
20 23 4008 94 1988
mRNA ncRNA precursor_RNA rRNA tRNA
21198 12285 1187 35 413
V_segment
535
Assemble SummarizedExperiment
stopifnot(rownames(rowranges) == rownames(counts),
rownames(coldata) == colnames(counts))
se <- SummarizedExperiment(
assays = list(counts = as.matrix(counts)),
rowRanges = as(rowranges, "GRanges"),
colData = coldata
)
Save SummarizedExperiment
saveRDS(se, "data/GSE96870_se.rds")
Session info
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Mm.eg.db_3.13.0 AnnotationDbi_1.54.1
[3] knitr_1.33 SummarizedExperiment_1.22.0
[5] Biobase_2.52.0 MatrixGenerics_1.4.0
[7] matrixStats_0.60.0 GenomicRanges_1.44.0
[9] GenomeInfoDb_1.28.1 IRanges_2.26.0
[11] S4Vectors_0.30.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 compiler_4.1.0 XVector_0.32.0
[4] bitops_1.0-7 tools_4.1.0 zlibbioc_1.38.0
[7] bit_4.0.4 evaluate_0.14 RSQLite_2.2.7
[10] memoise_2.0.0 lattice_0.20-44 pkgconfig_2.0.3
[13] png_0.1-7 rlang_0.4.11 Matrix_1.3-3
[16] DelayedArray_0.18.0 DBI_1.1.1 xfun_0.24
[19] fastmap_1.1.0 GenomeInfoDbData_1.2.6 stringr_1.4.0
[22] httr_1.4.2 Biostrings_2.60.1 vctrs_0.3.8
[25] bit64_4.0.5 grid_4.1.0 R6_2.5.0
[28] blob_1.2.2 magrittr_2.0.1 KEGGREST_1.32.0
[31] stringi_1.7.3 RCurl_1.98-1.3 cachem_1.0.5
[34] crayon_1.4.1
Key Points
Key point 1