my R journeys: R script Transcript lengths for RPKM, FPKM, TPM

I struggled with this one for a while and there are probably better methods, but here's how I did this:

Get the data for genes and transcript from Ensembl and find the biomart link or tab:

Then chose the relevant database (Ensembl Gene 95), dataset (Human Genes GRCh38.p12) and attributes (at least Gene Stable ID, Transcript Stable ID and Transcript length (including UTRs and CDS)), then click results and download the results. I used a csv format.

Open the downloaded file in R using your favorite method (read.csv(file.choose()) in my case, which created the data.frame "mart_export") and use the r "aggregate" function to combine the transcript length data by gene ID. In my case, I did this twice, once to obtain the numeric median of all transcript lengths, and again to get a comma separated list of the transcript lengths in case I wanted to use some other method in the future.

dim(mart_export) #227492 x 6 entries, because there are multiple transcripts for each gene
colnames(mart_export) #check what the column names are
length(unique(mart_export$`Gene stable ID`)) #64914 unique gene IDs
mMart<-mart_export[order(mart_export$`Gene stable ID`),] #I ordered by Gene ID just so I could check that things worked right
View(mMart) #you can see that for the the first gene "TSPAN6" there are 5 possible transcripts

mMart<-as.data.frame(mMart)
#aggregate the data in column 3 (transcript lengths) by Gene Stable ID (column 1) using toString
agg<-aggregate(mMart[,3] ~ mMart[,1], list(mMart[,1]), toString)
#aggregate the data in column 3 (transcript lengths) by Gene Stable ID (column 1) using median
agg1<-aggregate(mMart[,3] ~ mMart[,1], list(mMart[,1]), median)
tLengths<-agg1
#change the column names to something nice
colnames(tLengths)<-c("Gene Stable ID","Median transcript length")
#add the comma delimited list of transcript lengths to tLengths as a third column
tLengths$'transcript lengths'<-agg$`mMart[, 3]`
#check the results
View(tLengths)

#save the table.
write.csv(tLengths, file = "transcriptLengths.csv")

my R journeys

Wednesday, February 27, 2019

R script Transcript lengths for RPKM, FPKM, TPM

No comments:

Post a Comment

Retrieve GO terms for a gene list from pantherdb.org

Report Abuse