CSE5290: Algorithms for Bioinformatics
Course Objectives
- Familiarity with Computational Problems in Biology
- Applying algorithmic ideas
- See real-life computational challenges
- Improve understanding of algorithms
News:
The course is over. I thank you for your participation and contribution in this course. It was a great experience for me. I have put brief comments on your project in ePost, along with all the grades.
Administrivia
- Instructor: Suprakash Datta ( email: [lastname in lowercase] @ cse.yorku.ca , web: http://www.cse.yorku.ca/~[lastname in lowercase])
-
Textbook: [available at the University bookstore] An Introduction to Bioinformatics Algorithms, by Neil C. Jones and Pavel A. Pevzner
August 2004, ISBN-10: 0-262-10106-8, ISBN-13: 978-0-262-10106-6
publisher site. and book site maintained by the authors.
-
Class hours: Tu-Th 13:00-14:30 at Ross S 537.
-
Office hours: Wed, 1-4 pm, or by appointment, at CSEB 3043.
References:
- Gusfield D (1997) Algorithms on Strings, Trees and Sequences
- A. Tozeren, S. Byers (2004):
New Biology for Engineers and Computer Scientists. Pearson Prentice Hall.
Evaluation:
- Midterm: 30%: October 27, in class
- HW: 30%
- Project: 40%
Your marks are here
Project:
- Choose a topic
- Read a few (typically 2-3) papers
- Critique them
- Implement with modifications/combination of ideas (the instructor will offer suggestions)
- Present an overview as well as your results.
Possible project topics:
- Probe design
- Protein-protein interactions
- Phylogenetic trees
- Clustering
- Exon prediction
- RNA secondary structure
- Retroviruses
List of topics and Lecture schedule (tentative):
- Introduction:
Lecture 1 (Sep 10, 09):
Administrivia: Objectives, course overview.
R-demo
My slides are here.
Some notes on installing and using R are here.
Please consider participating in the York Programming contests. Details are
here.
- Biology overview
Lecture 2: Slides are here.
A good article is here. If you are looking for more detailed information, read the book by Tozeren and Byers.
The genetic code is "almost" universal. There is more information and references
here.
Lecture 3: Biology overview - contd. Manipulating DNA. Measuring gene expression levels.
We will use slides from the previous class as well as these.
- Ch 4: Exhaustive search
Lecture 4: the Partial Digest problem.
My slides are here.
Lecture 5: The motif finding problem.
My slides are here.
- Ch 5: Greedy algorithms: genome rearrangements
Lecture 6: My slides are here.
- Ch 6: Dynamic programming: sequence alignment.
Lecture 7:
My slides are here.
Lecture 8: DP continued. Local and global alignment. We will use the last set of sides and (if time permits) these.
Lecture 9: DP continued. Multiple alignment. Slides from the last set will be used.
This wiki page contains a good summary of multiple alignment.
Gene finding slides.
- Ch 7: Divide and Conquer: more efficient alignment. My slides are here.
- Ch 8: Graph Algorithms: Sequence assembly algorithms. My slides are here.
- Ch 9: Pattern Matching: different variants. My slides are here.
BLAST: My slides are here.
- Ch 10: Clustering: My slides are here.
Note: The above slides were updated slightly on Nov 17.
The r-script I ran in class is here.
Phyllogenetic trees: My slides are here.
- Ch 11: Hidden Markov Models: Intro to Markov Chains and Hidden Markov Models. My slides are here.
The last set of slides on HMMs are here.
Assignments
- HW 1. R Basics: the pdf file is here .
Submit your R-code and results. If you like, you use the LaTeX source
here to cut-paste questions into your assignment.
Some hints and notes:
- learn about the sample() and seq() functions.
- As mentioned in class, you may lose points if you use non-vectorized code
when there is an easy way to vectorize it.
- The read.fasta method stores the sequence information in a data frame and
not a simple array. You can use the getsequence method to extract the squence.
Solutions to assignment 1 are in the file a1-sol.pdf in /cs/course/5290
on the departmental servers.
-
HW 2. Exon finding using Fourier spectra. The pdf file is here .
Submit your R-code and results. If you like, you use the LaTeX source
here to cut-paste questions into your assignment.
Note: a window function is a function that starts at x=0, goes up (to 1)
linearly to x=176 and then decreases linearly (to 0) to x=351.
Thus the function (when plotted) makes an isoceles triangle whose base is the
window size and height is 1. Instead of taking the FFT of the sequence
X[i]...X[i+351-1] for window i, you will take the FFT of the sequence
X[i]w[1],X[i+1]w[2],...X[i+351-1]w[351], where w() is the window function.
Here is a simple example of the sapply function.
-
HW 3: Dynamic programming. The pdf file is here .
For Q2, assume that you are allowed to take only one type of nucleotide
in each round. Also, if you do not see a pattern, write down the dynamic programming algorithm for the problem; you will get full credit if your algorithm
(as described) is correct.
-
HW 4: Suffix trees, clustering. The pdf file is here .
This is the last assignment.
The data file in the course directory has the URL for the dataset.
R resources
-
The starting point for R is its homepage.
- Read the introduction here. Then check out some of the documentation available here. If you are looking for a smaller document, here is one I like.
-
To search for R functionality on a certain topic, use google or this search engine.
-
Some tips on R are here.
-
Another resource on R is here.
-
The slides of my evening talk on dynamic programming are here.
Other links
While it is not required for the course, this is a good opportunity to
start using LaTeX. A good starting point is this page.