CSE5290: Algorithms for Bioinformatics

Course Objectives

Familiarity with Computational Problems in Biology
Applying algorithmic ideas
See real-life computational challenges
Improve understanding of algorithms

News:

The course is over. I thank you for your participation and contribution in this course. It was a great experience for me. I have put brief comments on your project in ePost, along with all the grades.

Administrivia

Instructor: Suprakash Datta ( email: [lastname in lowercase] @ cse.yorku.ca , web: http://www.cse.yorku.ca/~[lastname in lowercase])
Textbook: [available at the University bookstore] An Introduction to Bioinformatics Algorithms, by Neil C. Jones and Pavel A. Pevzner
August 2004, ISBN-10: 0-262-10106-8, ISBN-13: 978-0-262-10106-6
publisher site. and book site maintained by the authors.
Class hours: Tu-Th 13:00-14:30 at Ross S 537.
Office hours: Wed, 1-4 pm, or by appointment, at CSEB 3043.

References:

Gusfield D (1997) Algorithms on Strings, Trees and Sequences
A. Tozeren, S. Byers (2004): New Biology for Engineers and Computer Scientists. Pearson Prentice Hall.

Evaluation:

Midterm: 30%: October 27, in class
HW: 30%
Project: 40%

Your marks are here
Project:

Choose a topic
Read a few (typically 2-3) papers
Critique them
Implement with modifications/combination of ideas (the instructor will offer suggestions)
Present an overview as well as your results.

Possible project topics:

Probe design
Protein-protein interactions
Phylogenetic trees
Clustering
Exon prediction
RNA secondary structure
Retroviruses

List of topics and Lecture schedule (tentative):

Introduction:

here

Biology overview
Lecture 2: Slides are here.
A good article is here. If you are looking for more detailed information, read the book by Tozeren and Byers.
The genetic code is "almost" universal. There is more information and references here.
Lecture 3: Biology overview - contd. Manipulating DNA. Measuring gene expression levels.
We will use slides from the previous class as well as these.
Ch 4: Exhaustive search Lecture 4: the Partial Digest problem.
My slides are here.
Lecture 5: The motif finding problem.
My slides are here.
Ch 5: Greedy algorithms: genome rearrangements
Lecture 6: My slides are here.
Ch 6: Dynamic programming: sequence alignment.
Lecture 7: My slides are here.
Lecture 8: DP continued. Local and global alignment. We will use the last set of sides and (if time permits) these.
Lecture 9: DP continued. Multiple alignment. Slides from the last set will be used.
This wiki page contains a good summary of multiple alignment.
Gene finding slides.
Ch 7: Divide and Conquer: more efficient alignment. My slides are here.
Ch 8: Graph Algorithms: Sequence assembly algorithms. My slides are here.
Ch 9: Pattern Matching: different variants. My slides are here.
BLAST: My slides are here.
Ch 10: Clustering: My slides are here.
Note: The above slides were updated slightly on Nov 17.
The r-script I ran in class is here.
Phyllogenetic trees: My slides are here.
Ch 11: Hidden Markov Models: Intro to Markov Chains and Hidden Markov Models. My slides are here.
The last set of slides on HMMs are here.

Assignments

HW 1. R Basics: the pdf file is here .
Submit your R-code and results. If you like, you use the LaTeX source here to cut-paste questions into your assignment.
Some hints and notes:
- learn about the sample() and seq() functions.
- As mentioned in class, you may lose points if you use non-vectorized code when there is an easy way to vectorize it.
- The read.fasta method stores the sequence information in a data frame and not a simple array. You can use the getsequence method to extract the squence.
Solutions to assignment 1 are in the file a1-sol.pdf in /cs/course/5290 on the departmental servers.
HW 2. Exon finding using Fourier spectra. The pdf file is here .
Submit your R-code and results. If you like, you use the LaTeX source here to cut-paste questions into your assignment.
Note: a window function is a function that starts at x=0, goes up (to 1) linearly to x=176 and then decreases linearly (to 0) to x=351. Thus the function (when plotted) makes an isoceles triangle whose base is the window size and height is 1. Instead of taking the FFT of the sequence X[i]...X[i+351-1] for window i, you will take the FFT of the sequence X[i]w[1],X[i+1]w[2],...X[i+351-1]w[351], where w() is the window function.
Here is a simple example of the sapply function.
HW 3: Dynamic programming. The pdf file is here .
For Q2, assume that you are allowed to take only one type of nucleotide in each round. Also, if you do not see a pattern, write down the dynamic programming algorithm for the problem; you will get full credit if your algorithm (as described) is correct.
HW 4: Suffix trees, clustering. The pdf file is here . This is the last assignment.
The data file in the course directory has the URL for the dataset.

R resources

The starting point for R is its homepage.
Read the introduction here. Then check out some of the documentation available here. If you are looking for a smaller document, here is one I like.
To search for R functionality on a certain topic, use google or this search engine.
Some tips on R are here.
Another resource on R is here.
The slides of my evening talk on dynamic programming are here.

Other links

While it is not required for the course, this is a good opportunity to start using LaTeX. A good starting point is this page.