Problem Set #3

The last problem set is free form and gives you a chance to design a script using the data files on RNA-seq quantification and GENCODE from last week, or use some other data that you find interesting. There are four requirements for your analysis:

Use Python to process the data. The point is to use Python, not necessarily to conduct a significant scientific analysis.
Create at least one graphic with matplotlib, seaborn or plotly.
Use pandas for part of your data processing, more than just reading in a TSV.
Do all of this in a Jupyter notebook, although if you need to use a command line tool for part of the data processing that is okay.

Using docstring at the beginning of your script to provide a short description of the logic of your script as we’ll not necessarily know the data you are using and the point of the script. Include inline comments as appropriate. Include a description of the information you graphed as a comment near the graph generation code.

Submit three files: your .ipynb file with inline output, HTML version of the notebook, and upload at least one image your script created. Upload an image file created from the script, not a screen capture. We DO NOT want the data files, these are often very large. The notebook file will show us enough so we can see the intermediate processing.

-Mike

'''
I just want to know the different expression genes between sigmoid colon and stomach.
'''
import os as os
os.getcwd()

'D:\\my github\\shen\\content\\en\\post\\2019-12-15-python-fro-genomics-final-set'

os.listdir()

['.ipynb_checkpoints',
 'BIOS274-2019-PS2-description.pdf',
 'ENCFF016CBS.tsv',
 'example_URL_code.ipynb',
 'final_problem_xiaotao_shen.ipynb',
 'output_file',
 'Problem_Set2_Part1_Xiaotao_Shen.ipynb',
 'ps2_gencode_url.tsv',
 'ps2_gene_quant_URLs.tsv',
 'Xiaotao Shen code for problem set2.ipynb']

import numpy as np
import pandas as pd

ps2_gene_quant_url = pd.read_csv("ps2_gene_quant_URLs.tsv", sep='\t', header = None)
ps2_gene_quant_url

Read sigmoid colon and stomach gene expression data¶

sigmoid_colon_data = pd.read_csv(list(ps2_gene_quant_url.iloc[1])[2], sep = '\t')

stomach_data = pd.read_csv(list(ps2_gene_quant_url.iloc[8])[2], sep = '\t')

sigmoid_colon_data.head(5)

stomach_data.head(5)

Clean data¶

##only remain gene id and TPM and remove the genes with TPM = 0
sigmoid_colon_data = sigmoid_colon_data[['gene_id', 'TPM']]
sigmoid_colon_data = sigmoid_colon_data[sigmoid_colon_data.TPM > 0]

print(sigmoid_colon_data.shape)
sigmoid_colon_data.head(5)

(20750, 2)

##only remain gene id and TPM and remove the genes with TPM = 0
stomach_data = stomach_data[['gene_id', 'TPM']]
stomach_data = stomach_data[stomach_data.TPM > 0]

print(stomach_data.shape)
stomach_data.head(5)

(16461, 2)

Align stomach_data and sigmoid_colon_data¶

stomach_data.columns = ['gene_id', 'TPM_1']
print(stomach_data.columns)
sigmoid_colon_data.columns = ['gene_id', 'TPM_2']
print(sigmoid_colon_data.columns)

Index(['gene_id', 'TPM_1'], dtype='object')
Index(['gene_id', 'TPM_2'], dtype='object')

gene_data = stomach_data.join(sigmoid_colon_data.set_index('gene_id'), on='gene_id')

print(gene_data.shape)
gene_data.head(5)

(16461, 3)

Find different expression genes¶

import math as math
fold_change = gene_data.TPM_2/gene_data.TPM_1
log2_fold_change = [math.log2(x) for x in fold_change]

gene_data['fold_change'] = fold_change
gene_data['log2_fold_change'] = log2_fold_change

gene_data.head(5)

from seaborn as sns

  File "<ipython-input-21-886ad395f1de>", line 1
    from seaborn as sns
                  ^
SyntaxError: invalid syntax

sns.scatterplot(x="gene_id", y="log2_fold_change", data=gene_data)

d:\software\python\lib\site-packages\plotnine\scales\scale.py:93: MatplotlibDeprecationWarning: 
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if cbook.iterable(self.breaks) and cbook.iterable(self.labels):
d:\software\python\lib\site-packages\plotnine\layer.py:449: UserWarning: geom_point : Removed 1071 rows containing missing values.
  self.data = self.geom.handle_na(self.data)
d:\software\python\lib\site-packages\plotnine\utils.py:553: MatplotlibDeprecationWarning: 
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  return cbook.iterable(var) and not is_string(var)

<ggplot: (-9223371844339231456)>

	0	1	2
0	ENCFF016CBS	omental fat pad	https://www.encodeproject.org/files/ENCFF016CB...
1	ENCFF365ZMW	sigmoid colon	https://www.encodeproject.org/files/ENCFF365ZM...
2	ENCFF408GFZ	subcutaneous adipose tissue	https://www.encodeproject.org/files/ENCFF408GF...
3	ENCFF505TUS	prostate gland	https://www.encodeproject.org/files/ENCFF505TU...
4	ENCFF633OSJ	suprapubic skin	https://www.encodeproject.org/files/ENCFF633OS...
5	ENCFF862LZL	heart left ventricle	https://www.encodeproject.org/files/ENCFF862LZ...
6	ENCFF863ERP	testis	https://www.encodeproject.org/files/ENCFF863ER...
7	ENCFF916ODF	vagina	https://www.encodeproject.org/files/ENCFF916OD...
8	ENCFF918KPC	stomach	https://www.encodeproject.org/files/ENCFF918KP...

	gene_id	transcript_id(s)	length
0	10904	10904	93.0
1	12954	12954	94.0
2	12956	12956	72.0
3	12958	12958	82.0
4	12960	12960	73.0

	gene_id	transcript_id(s)	length
0	10904	10904	93.0
1	12954	12954	94.0
2	12956	12956	72.0
3	12958	12958	82.0
4	12960	12960	73.0

	gene_id	TPM
649	ENSG00000000003.14	0.46
650	ENSG00000000005.5	0.01
651	ENSG00000000419.12	4.51
652	ENSG00000000457.13	0.78
653	ENSG00000000460.16	0.36

	gene_id	TPM
649	ENSG00000000003.14	0.49
651	ENSG00000000419.12	0.24
652	ENSG00000000457.13	0.28
653	ENSG00000000460.16	2.18
654	ENSG00000000938.12	0.15