I struggled with this one for a while and there are probably better methods, but here's how I did this:
Get the data for genes and transcript from Ensembl and find the biomart link or tab:
Then chose the relevant database (Ensembl Gene 95), dataset (Human Genes GRCh38.p12) and attributes (at least Gene Stable ID, Transcript Stable ID and Transcript length (including UTRs and CDS)), then click results and download the results. I used a csv format.
Open the downloaded file in R using your favorite method (read.csv(file.choose()) in my case, which created the data.frame "mart_export") and use the r "aggregate" function to combine the transcript length data by gene ID. In my case, I did this twice, once to obtain the numeric median of all transcript lengths, and again to get a comma separated list of the transcript lengths in case I wanted to use some other method in the future.
dim(mart_export) #227492 x 6 entries, because there are multiple transcripts for each gene
colnames(mart_export) #check what the column names are
length(unique(mart_export$`Gene stable ID`)) #64914 unique gene IDs
mMart<-mart_export[order(mart_export$`Gene stable ID`),] #I ordered by Gene ID just so I could check that things worked right
View(mMart) #you can see that for the the first gene "TSPAN6" there are 5 possible transcripts
mMart<-as.data.frame(mMart)
#aggregate the data in column 3 (transcript lengths) by Gene Stable ID (column 1) using toString
agg<-aggregate(mMart[,3] ~ mMart[,1], list(mMart[,1]), toString)
#aggregate the data in column 3 (transcript lengths) by Gene Stable ID (column 1) using median
agg1<-aggregate(mMart[,3] ~ mMart[,1], list(mMart[,1]), median)
tLengths<-agg1
#change the column names to something nice
colnames(tLengths)<-c("Gene Stable ID","Median transcript length")
#add the comma delimited list of transcript lengths to tLengths as a third column
tLengths$'transcript lengths'<-agg$`mMart[, 3]`
#check the results
View(tLengths)
#save the table.
write.csv(tLengths, file = "transcriptLengths.csv")
Subscribe to:
Post Comments (Atom)
Retrieve GO terms for a gene list from pantherdb.org
Pantherdb.org has what looks like a fairly up to date and comprehensive listing of genes and gene ontologies. I recently wanted to generate ...
-
I struggled with this one for a while and there are probably better methods, but here's how I did this: Get the data for genes and tra...
-
This is likely a bad idea, but.... I've been adding colors to geom_points using scale_fill_manual(), and wanted some way of adding a f...
-
Pantherdb.org has what looks like a fairly up to date and comprehensive listing of genes and gene ontologies. I recently wanted to generate ...
No comments:
Post a Comment