March 2015 Feng Lab meeting¶
The Data Scientist’s Toolbox for ChIP-Seq and Beyond
Feng Lab Group meeting by Wayne in March.
A Google Doc for sharing today
3:40am-5:30pm, March 19th, 2015. (2nd session; see first session info here; next session hands-on here)
Getting started¶
Preparation¶
Read
Tech prep
Be sure you have a modern, updated browser on your system. Preferably Chrome or Firefox.
Register and do the follow-up activation at SourceLair.
Register for Sagemath Cloud.
Be sure to have a good text editor on your computer. Sounds like you may have been using AquaMac in the past and so this shouldn’t be a problem. I highly recommend Sublime Text. However, for what we’ll be doing Thursday, even TextWrangler on a Mac will be sufficient. For those not on a Mac, I’d recommend Sublime Text or Notepad++ or jEdit.
Intro to technology¶
We’ll use as a group two technologies today.
The idea for using cloud-based tools is to make it easier upfront to get coding and then you can modify what you use as you develop your coding workflow preferences. (Sorry for needing two, but finding a good interface that has all the features desired and works on the Upstate network is not easy.)
For today¶
ChIP-seq¶
Background on ChIP-seq in preparation for running through an anlysis workflow next session.
I’ll sprinkle in some real world examples of using some available tooks with at least one possibly being hands-on for those who wish to participate
Examples from the Wild I: REGULAR EXPRESSIONS¶
NGS Analysis of ChIP-seq data with NUCwave¶
ChIP-Seq example at NUCwave site
S. cerevisiae reference genome was downloaded from SGD and FASTA headers for chromosome names were replaced with chrI-chrXVI.
Of course, there are only sixteen chromosomes in yeast, plus the mitochondrial genome, so this is not an overly difficult to do by hand. But it is tedious and offers a good place to utilize regular expressions.
Highly recommend the following combination for learning Regular
Expressions, or Regex
or Regexp
as it is often called:
- Chapters 2 & 3 of Practical Computing for Biologists book by Haddock and Dunn. The related appendix #2 is freely available as part of tables of Appendices from Practical Computing for Biologists book by Haddock and Dunn
- Regular Expressions Primer
- Regular Expressions 101: online regex editor and debugger
tool (This seems best with
g
global modifier on.)
First I’ll demonstrate doing this with Sublime Text using the process I already worked out.
So what are Regular Expressions
? See Exploring with Regular
Expressions 101.
I’ll demo wildcards, character sets, qantifiers and capturing.
Finally, we’ll use Regular Expressions 101 to really follow what was going on in this example.
Examples from the Wild 2: IPython Notebooks¶
NGS Analysis of ChIP-seq data using IPython Notebooks to Explore¶
Determining Average ChIP-seq signal over promoters with Metaseq > This example demonstrates the use of :mod:metaseq for performing a common task when analyzing ChIP-seq data: what is the average signal over transcription start sites (TSS) throughout the genome?
That looks interesting but what framework is being used to make and host this?
So what are IPython Notebooks
?
Allows you to code interactively in your browser and take advantage of all the aspects of HTML and other special web features, including sharing online
These are especially useful for exploring data and developing code or developing approaches to analyzing your data.
Titus Brown’s screencast and associated notebook illustrates much of this.
I’ll show two other notebooks I have made and show them interactively.
Q: You’ve run your notebook and populated the cells, now how can you share it with colleagues? A: If you follow these steps you or anyone else you share it with can see your notebook on the web.
- Upload your notebook code to somewhere. Github or simply even as a Gist will work fine.
- Place the
URL
here and click ‘Go!’
Note that the notebooks shared in this form will not be interactive. You can though download them and run them locally.
The future ...¶
The IPython Notebooks concept goes beyond Python and now they are developing a language-agnostic version of the Notebook as the Jupyter Project project.
Back to Metaseq¶
The page here actually has a legend with some of the plots that describes additionally apects of the exploratory analyses done on the page Determing Average ChIP-seq signal over promoters with Metaseq.
Examples from the Wild 3: Git, Github, and Gists¶
Git, Github, and Gists¶
Quick tour of Github site and Gists since most useful for getting started in the world of using git for version control software.
See a section under ‘Going forward’ for additional resources.
Examples from the Wild 4: R, the Bioconductor Project for R, RStudio¶
Sources¶
The sources for the information used today came from those linked throughout the content.
However, certain sources deserve special highlighting as they were particularly useful in developing this presentation, contain a wealth of related resources, or are especially pertinent at this stage.
- Practical Computing for Biologists book by Haddock and Dunn
- Interactive notebooks: Sharing the code by Helen Shen. Nature. 2014 Nov 6;515(7525):151-2. doi: 10.1038/515151a. PMID: 25373681
- Programming tools: Adventures with R by Sylvia Tippmann. Nature. 2015 Jan 1;517(7532):109-10. doi: 10.1038/517109a. PMID: 25557714
- Feng et al. 2012. Identifying ChIP-seq enrichment using MACS.Nat Protoc. 2012 Sep; 7(9): 10.1038/nprot.2012.101.
- Titus Brown and Colleague’s Next-Gen Sequence Analysis Workshops, most recent is [Next-Gen Sequence Analysis Workshop (2014)(http://angus.readthedocs.org/en/2014/) Particularly pertinent are the sections Istvan Albert’s 2012 ChIP-Seq lecture, Day 7: ChIP-seq: Peak Predictions and Cis-regulatory Element Annotations, Using MEME to identify TF binding motif from ChIP-seq data and here.
- ChIP- and DNase-seq data analysis workshop 2014
Going forward¶
Look into¶
- Practical Computing for Biologists book by Haddock and Dunn
- Interactive notebooks: Sharing the code by Helen Shen. Nature. 2014 Nov 6;515(7525):151-2. doi: 10.1038/515151a. PMID: 25373681
- Programming tools: Adventures with R by Sylvia Tippmann. Nature. 2015 Jan 1;517(7532):109-10. doi: 10.1038/517109a. PMID: 25557714
- a two year-old screencast intro of IPython notebook by Titus Brown (Skip to the three-minute mark since we aren’t necessarily interested in running it on Amazon web services right now.) A non-interactive version of the notebook he demonstrates is here.
- Another take on the wonders of the IPython notebook, from a blog.
- Orchestrating high-throughput genomic analysis with Bioconductor. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Nat Methods. 2015 Jan 29;12(2):115-21. doi: 10.1038/nmeth.3252. PMID: 25633503
- Programming: Pick up Python by Jeffrey M. Perkel. Nature. 2015 February 5;518:125–6.doi:10.1038/518125a. PMID:25653001 too
- March 2015 blog post suggesting mandatory primer courses for basic skills for students in cellular & molecular biology, genetics, and related subfields
- A History of Bioinformatics (in the Year 2039), a presentation by Titus Brown encouraging good data practices in his unique way
Regular Expressions¶
- Chapters 2 & 3 of Practical Computing for Biologists book by Haddock and Dunn. The related appendix #2 is freely available as part of tables of Appendices from Practical Computing for Biologists book by Haddock and Dunn
- Regular Expressions Primer
- Regular Expressions 101: online regex editor and debugger
tool (Seems best with
g
global modifier on.) - RegExr v2.0: online tool to learn, build, & test Regular Expressions
- regex tester
- Python Regular Expression Testing Tool
- For using regular expressions in Sublime Text, you need to click on
the box
.*
next toFind What
when Find tool open to activate. See here for more information. - In TextWrangler the trick to activate Regular Expressions is to
toggle on
Grep
box underMatching
in theFind and Replace
panel. See here about 1 and half minutes into the video. - BBEdit-TextWrangler_RegEx_Cheat_Sheet.txt
- See here under ‘Search and replace with special characters (regular expressions)’ for using Regular Expressions in AquaMacs.
IPython Notebook¶
- Interactive notebooks: Sharing the code by Helen Shen. Nature. 2014 Nov 6;515(7525):151-2. doi: 10.1038/515151a. PMID: 25373681
- a two year-old screencast intro of IPython notebook by Titus Brown (Skip to the three-minute mark since we aren’t necessarily interested in running it on Amazon web services right now.) A non-interactive version of the notebook he demonstrates is here.
- Another take on the wonders of the IPython notebook, from a blog
- The future of the IPython Notebook is the Jupyter project
- Analyzing data with R in the IPython notebook
R and Bioconductor in general¶
- Programming tools: Adventures with R by Sylvia Tippmann. Nature. 2015 Jan 1;517(7532):109-10. doi: 10.1038/517109a. PMID: 25557714
- Orchestrating high-throughput genomic analysis with Bioconductor. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Nat Methods. 2015 Jan 29;12(2):115-21. doi: 10.1038/nmeth.3252. PMID: 25633503
Learning R¶
- See the box in the article Programming tools: Adventures with R by Sylvia Tippmann. Nature. 2015 Jan 1;517(7532):109-10. doi: 10.1038/517109a. PMID: 25557714.
- The Coursera courses in Johns Hopkins’ Data Science Specialization, in particular the R Programming and Getting and Cleaning Data courses. If you are brand new to this and don’t yet know how to use Github, The Data Scientist’s Toolbox would probably be helpful as a starting point.
- You do get a bit of flavor for the use of R in data analysis in the Coursera courses Bioinformatic Methods I and Bioinformatic Methods II
- Comparing Python and R for Data Science
- How to Transition from Excel to R: An Intro to R for Microsoft Excel Users
ChIP-seq data analysis¶
- Titus Brown and Colleague’s Next-Gen Sequence Analysis Workshops, most recent is [Next-Gen Sequence Analysis Workshop (2014)(http://angus.readthedocs.org/en/2014/) Particularly pertinent are the sections Istvan Albert’s 2012 ChIP-Seq lecture, Day 7: ChIP-seq: Peak Predictions and Cis-regulatory Element Annotations, Using MEME to identify TF binding motif from ChIP-seq data and here.
- ChIP- and DNase-seq data analysis workshop 2014
- Cis-regulatory Element Annotation System by Hyunjin Shin and Tao Liu from Xiaole Shirley Liu’s Lab
- ab initio motif finder MEME and the related MEME suite
- MEME-LaB wraps the popular ab initio motif finder in a web tool
- Motif enrichment tool. Blatti C, Sinha S. Nucleic Acids Res. 2014 Jul;42(Web Server issue):W20-5. doi: 10.1093/nar/gku456. Epub 2014 May 23. PMID: 24860165
- Motif-based analysis of large nucleotide data sets using MEME-ChIP
Plus see the ‘literature’ page in this collection of pages from the session.
R and ChIP-seq¶
I need to add the other main ones I saw here still.
Git and Github¶
Questions¶
- Try Google, probably will lead you to one of my listed resources or...
- Biostars
- Stackoverflow for general scripting and computing
- SEQanswers - a high throughput sequencing community
- Try Twitter - for example this
Literature Selections for ChIP-seq¶
ChIP-Seq¶
-
We used this method to map the binding sites for Cse4, Ste12 and Pol II throughout the yeast genome and we found 148 binding targets for Cse4, 823 targets for Ste12 and 2508 targets for PolII. Cse4 was strongly bound to all yeast centromeres as expected and the remaining non-centromeric t
-
MACS is coded in Python, an increasingly popular programming language in bioinformatics, which is pre-loaded with the majority of UNIX, Linux, or Mac OS installations. MACS works in Python version 2.6 or 2.7, and version 2.6.5 is recommended. To run MACS in a 64-bit environment, Python for the 64-bit CPU should be installed.
-
Emphasizes adjusting for gene locus length and that two commonly used gene set enrichment methods, Fisher’s exact test and the binomial test implemented in Genomic Regions Enrichment of Annotations Tool (GREAT), can have highly inflated type I error rates and biases in ranking.
Bias issues¶
-
We analyzed ChIP-Seq peaks of the Sir2, Sir3, and Sir4 silencing proteins and discovered 238 unexpected euchromatic loci that exhibited enrichment of all three. Surprisingly, published ChIP-Seq datasets for the Ste12 transcription factor and the centromeric Cse4 protein indicated that these proteins were also enriched in the same euchromatic regions with the high Sir protein levels. The 238 loci, termed ”hyper-ChIPable“, were in highly expressed regions with strong polymerase II and polymerase III enrichment signals, and the correlation between transcription level and ChIP enrichment was not limited to these 238 loci but extended genome-wide. ... Whereas ChIP is a broadly valuable technique, some published conclusions based upon ChIP procedures may merit reevaluation in light of these findings.
Insightful post on PubPeer related to this
Oh, also, if you are not aware of it, Vishy Iyer’s recent PLOS One paper finds the exact same artifact as we do. http://www.plosone.org/article/authors/info%3Adoi%2F10.1371%2Fjournal.pone.0083506;jsessionid=F590D75E9C265BA38D012211B9B97E33 And related to this discussion: http://www.biomedcentral.com/1471-2164/11/414 http://www.biomedcentral.com/1471-2164/14/254/abstrac http://www.biomedcentral.com/1471-2164/14/638 And this paper from Kevin Struhl: http://www.plosone.org/article/related/info%3Adoi%2F10.1371%2Fjournal.pone.0005029;jsessionid=93B6A4A5F2062E6B1F15E8997133060D
Below are other publications reporting the expression-associated ChIP artifact. Fan X, 2009 “Where Does Mediator Bind In Vivo?” (Work in S. cerevisiae questioning reports of pervasive genome-wide binding of the Mediator complex.) Waldminghaus T, 2010 “ChIP on Chip: surprising results are often artifacts” (Work in E. coli; also see arising correspondence Schindler D, 2013.) Park D, 2013 “Widespread Misinterpretable ChIP-seq Bias in Yeast” (Analysis very similar to ours.) Kasinathan S, 2014 “High-resolution mapping of transcription factor binding sites on native chromatin” (Questions specificity of standard ChIP in S. cerevisiae and at HOT regions of Drosophila. This work possibly provides a solution to the artifact with a modification of the ChIP technique.) http://www.ncbi.nlm.nih.gov/pubmed/24173036#cm24173036_3919
Non-canonical protein-DNA interactions identified by ChIP are not artifacts. Bonocora RP, Fitzgerald DM, Stringer AM, Wade JT. BMC Genomics. 2013 Apr 15;14:254. doi: 10.1186/1471-2164-14-254. (Concerns the E. coli data.)
-
The resulting occupied regions of genomes from affinity-purified naturally isolated chromatin (ORGANIC) profiles of Saccharomyces cerevisiae Abf1 and Reb1 provide high-resolution maps that are accurate, as defined by the presence of known TF consensus motifs in identified binding sites, that are not biased toward accessible chromatin and that do not require input normalization.
Motif identification¶
Cis-regulatory Element Annotation System by Hyunjin Shin and Tao Liu from Xiaole Shirley Liu’s Lab
A tool designed to characterize genome-wide protein-DNA interaction patterns from ChIP-chip and ChIP-Seq of both sharp and broad binding factors. As a stand-alone extension of our web application CEAS (Cis-regulatory Element Annotation System), it provides statistics on ChIP enrichment at important genome features such as specific chromosome, promoters, gene bodies, or exons, and infers genes most likely to be regulated by a binding factor. CEAS also enables biologists to visualize the average ChIP enrichment signals over specific genomic features, allowing continuous and broad ChIP enrichment to be perceived which might be too subtle to detect from ChIP peaks alone.
ab initio motif finder MEME and the related MEME suite
MEME-LaB wraps the popular ab initio motif finder in a web tool
Motif-based analysis of large nucleotide data sets using MEME-ChIP