Example: How to analyze the expression difference between LINC00152 in ovarian cancer and normal tissue using the TCGA database?So with this topic as a record analysis process as follows: First, download data a) enter the website page screenshots as follows: b) into the data download Launch Data Portal, screenshots are as follows: After entering the data download interface, there are four columns of Projects Exploration Analysis, our data download can enter the Repository menu bar, screenshots are as follows: the page is divided into left and right sides, the left side is mainly to provide user data selection and filtering window, the right side is based on the user's choice and statisticsThe left selection is divided into two categories: Cases and Files
According to our study, the purpose is to see the difference in the expression of LINC RNA in ovarian cancer and normal tissue, so we select Ovary under the case of the left column, select RNA-seq under Files, these options are selected, the above screenshot c) download the path file after selecting the file, such as The image above adds the file to the shopping cart, screenshot sits as follows: Then click on the cart in the upper right corner, the following screenshot appears: Click sample sheet, the .tsv file containing the required file directory gdc_sample_sheet.2018-05-22.tsv can be downloaded, put under the corresponding directoryOpen the file with NotePad as follows: d) Bulk download file under linux Put the file in the linux/home/zdwu/rnaseq/11_source_data directory and download the data in bulk under that directory, code as follows: cat gdc_sample_sheet2018 - 05 - 22 .tsvRead line do echo https:/ portal.gdc.cancer.gov/files/$.line:0: (36-0) wget-c/$:line:0: (36-0)- -O$ The file is viewed after downloading the file after downloading the file: (line:184-167) and after downloading it, the file is confirmed as follows, confirming that the number of files is complete and the data is available after the completels A-swc -l II, data analysis a) Data decompression with the command line decompression, decompression to obtain readable datazdwu@ubuntu/home/zdwu/rnaseq/11_source_data/ovary$gunzip scounts b) Find out the amount of expression of Linc00152 Because the genes in the data downloaded from THE TCGA are all the ensemble ID, you need to find the corresponding ensemble of Linc00152 from NCBI ID, find out the result is Ensembl:ENSG00000222041 Note: There is only one gene here, the use of manual from ncBI to find the ensemble ID is simple, but if you look at a large number of genes, then this will be very passable, then need to be converted through the ID conversion file to program conversion
The download address of the gene ID conversion file: ftp://ftp.ftp.ncbi.nlm.nih.gov/gene/DATA/, with gene2ensemble.gzgene2accession.gzgene2go.gz, etccan be downloaded, according to which a small script can be converted in bulk
c) Integrate multiple samples of LINC00152 gene expression counts zdwu@ubuntu: // home/zdwu/rnaseq/11_source_data/ovary$for file in scounts and do s echo $ovary_linc00152:(16 - 13 ) ovary_linc00152 Grep' ENSG000022041' .ovary_linc00152 ovary_linc00152 Analysis Author: weixin_30295091 Source: weixin_30295091