my R journeys

Friday, January 17, 2020

Retrieve GO terms for a gene list from pantherdb.org

Pantherdb.org has what looks like a fairly up to date and comprehensive listing of genes and gene ontologies. I recently wanted to generate a comma separated list of gene ontology IDs and descriptions from a list of gene names using R, but the description of the API (here: http://www.pantherdb.org/help/PANTHERhelp.jsp#V.E.) uses javascript. Below is the strategy I used to get what I wanted in R.

BiocManager::install("httr")
library(stringr)
library(httr)
#parameters listed on the pantherdb web site...
# StringBody organism = new StringBody("Homo sapiens", ContentType.TEXT_PLAIN);
# FileBody fileData = new FileBody(new File("c:\\data_files\\gene_expression_files\\7_data\\humanEnsembl"), ContentType.TEXT_PLAIN);
# StringBody enrichmentType = new StringBody("process", ContentType.TEXT_PLAIN); //"function", "process", "cellular_location", "protein_class", "pathway", "fullGO_function", "fullGO_process", "fullGO_component", "reactome"
# StringBody testType = new StringBody("FISHER", ContentType.TEXT_PLAIN); // "FISHER", "BINOMIAL" . This parameter is optional; If not specified, the system will perform Fisher’s Exact test
# //StringBody cor = new StringBody("FDR", ContentType.TEXT_PLAIN);
# //StringBody cor = new StringBody("BONFERRONI", ContentType.TEXT_PLAIN);
# //StringBody cor = new StringBody("NONE", ContentType.TEXT_PLAIN);
# StringBody type = new StringBody("enrichment", ContentType.TEXT_PLAIN);

url<-"http://www.pantherdb.org/webservices/garuda/tools/enrichment/VER_2/enrichment.jsp?"
enrichmentType <- "fullGO_process"
organism <- "Homo sapiens"
cor <- "BONFERRONI"
type <- "enrichment"
genes<-"TSPAN6\nDPM1\nSCYL3\nC1orf112\nFGR\nCFH\nFUCA2\nGCLC\nNFYA\nSTPG1\nNIPAL3\nLAS1L\nENPP4\nSEMA3F\nCFTR\nANKIB1\nCYP51A1\nKRIT1\nRAD52\nBAD\nLAP3\nCD99\nHS3ST1\nAOC1\nWNT16\nHECW1\nMAD1L1\nLASP1\nSNX11\nTMEM176A\nM6PR"
response<-POST(url = url, encode = "multipart", body = list(organism = organism, enrichmentType = enrichmentType, geneList = genes, type = type, cor = cor))

content<-content(response, as = "parsed")
content<-unlist(str_split(content, "\n"))
content<-content[content != "Id\tName\tGeneId\traw P-value\tFDR"]
content<-content[content != "\r"]
content<-content[content != ""]
myColnames<-c("ID","Name","GeneId","raw_pValue","FDR")
myResult<-matrix("",nrow = length(content), ncol = 5)
for (i in seq_along(content)){
myResult[i,]<-unlist(str_split(content[i], "\t"))
}
colnames(myResult)<-myColnames
myResult<-as.data.frame(myResult, stringsAsFactors = FALSE)
myResult$raw_pValue<-as.numeric(myResult$raw_pValue)
myResult$FDR<-as.numeric(myResult$FDR)

myResult<-myResult[myResult$raw_pValue < .05,]
agg<-aggregate(myResult[,"ID"] ~ myResult[,"GeneId"], list(myResult[,"GeneId"]), toString)
agg2<-aggregate(myResult[,"Name"] ~ myResult[,"GeneId"], list(myResult[,"GeneId"]), toString)
finalResult<-cbind(agg,agg2[,2])
colnames(finalResult)<-c("geneID", "GOIDs", "descriptions")
View(finalResult)

here's what the output looks like:

Wednesday, January 15, 2020

Custom color gradient legend for a ggplot graph

This is likely a bad idea, but....

I've been adding colors to geom_points using scale_fill_manual(), and wanted some way of adding a figure legend at the bottom to show the color scale. Here's my solution:

nColors<-100

pal <- colorRampPalette(c("red","yellow","yellow","yellow","blue"))
myPal<-rev(pal(nColors))
myColors<-character(nColors)
for (i in 1:nColors){
myColors[i]<-myPal[i]
}

x1<-seq(1:nColors)/50
x2<-(x1+(1/50))
d<-data.frame(x1 = x1, xmax = x2, ymin = 1, ymax = 3, color = as.factor(x2))
ggplot() + geom_point(data = mpg, aes(displ, hwy)) + geom_rect(data = d, aes(xmin=x1, xmax=xmax, ymin=ymin, ymax=ymax), fill = myColors)

And the result:

Wednesday, February 27, 2019

R script Transcript lengths for RPKM, FPKM, TPM

I struggled with this one for a while and there are probably better methods, but here's how I did this:

Get the data for genes and transcript from Ensembl and find the biomart link or tab:

Then chose the relevant database (Ensembl Gene 95), dataset (Human Genes GRCh38.p12) and attributes (at least Gene Stable ID, Transcript Stable ID and Transcript length (including UTRs and CDS)), then click results and download the results. I used a csv format.

Open the downloaded file in R using your favorite method (read.csv(file.choose()) in my case, which created the data.frame "mart_export") and use the r "aggregate" function to combine the transcript length data by gene ID. In my case, I did this twice, once to obtain the numeric median of all transcript lengths, and again to get a comma separated list of the transcript lengths in case I wanted to use some other method in the future.

dim(mart_export) #227492 x 6 entries, because there are multiple transcripts for each gene
colnames(mart_export) #check what the column names are
length(unique(mart_export$`Gene stable ID`)) #64914 unique gene IDs
mMart<-mart_export[order(mart_export$`Gene stable ID`),] #I ordered by Gene ID just so I could check that things worked right
View(mMart) #you can see that for the the first gene "TSPAN6" there are 5 possible transcripts

mMart<-as.data.frame(mMart)
#aggregate the data in column 3 (transcript lengths) by Gene Stable ID (column 1) using toString
agg<-aggregate(mMart[,3] ~ mMart[,1], list(mMart[,1]), toString)
#aggregate the data in column 3 (transcript lengths) by Gene Stable ID (column 1) using median
agg1<-aggregate(mMart[,3] ~ mMart[,1], list(mMart[,1]), median)
tLengths<-agg1
#change the column names to something nice
colnames(tLengths)<-c("Gene Stable ID","Median transcript length")
#add the comma delimited list of transcript lengths to tLengths as a third column
tLengths$'transcript lengths'<-agg$`mMart[, 3]`
#check the results
View(tLengths)

#save the table.
write.csv(tLengths, file = "transcriptLengths.csv")