CSE5290: Algorithms for Bioinformatics
Course Objectives
 Familiarity with Computational Problems in Biology
 Applying algorithmic ideas
 See reallife computational challenges
 Improve understanding of algorithms
News:
The course is over. I thank you for your participation and contribution in this course. It was a great experience for me. I have put brief comments on your project in ePost, along with all the grades.
Administrivia
 Instructor: Suprakash Datta ( email: [lastname in lowercase] @ cse.yorku.ca , web: http://www.cse.yorku.ca/~[lastname in lowercase])

Textbook: [available at the University bookstore] An Introduction to Bioinformatics Algorithms, by Neil C. Jones and Pavel A. Pevzner
August 2004, ISBN10: 0262101068, ISBN13: 9780262101066
publisher site. and book site maintained by the authors.

Class hours: TuTh 13:0014:30 at Ross S 537.

Office hours: Wed, 14 pm, or by appointment, at CSEB 3043.
References:
 Gusfield D (1997) Algorithms on Strings, Trees and Sequences
 A. Tozeren, S. Byers (2004):
New Biology for Engineers and Computer Scientists. Pearson Prentice Hall.
Evaluation:
 Midterm: 30%: October 27, in class
 HW: 30%
 Project: 40%
Your marks are here
Project:
 Choose a topic
 Read a few (typically 23) papers
 Critique them
 Implement with modifications/combination of ideas (the instructor will offer suggestions)
 Present an overview as well as your results.
Possible project topics:
 Probe design
 Proteinprotein interactions
 Phylogenetic trees
 Clustering
 Exon prediction
 RNA secondary structure
 Retroviruses
List of topics and Lecture schedule (tentative):
 Introduction:
Lecture 1 (Sep 10, 09):
Administrivia: Objectives, course overview.
Rdemo
My slides are here.
Some notes on installing and using R are here.
Please consider participating in the York Programming contests. Details are
here.
 Biology overview
Lecture 2: Slides are here.
A good article is here. If you are looking for more detailed information, read the book by Tozeren and Byers.
The genetic code is "almost" universal. There is more information and references
here.
Lecture 3: Biology overview  contd. Manipulating DNA. Measuring gene expression levels.
We will use slides from the previous class as well as these.
 Ch 4: Exhaustive search
Lecture 4: the Partial Digest problem.
My slides are here.
Lecture 5: The motif finding problem.
My slides are here.
 Ch 5: Greedy algorithms: genome rearrangements
Lecture 6: My slides are here.
 Ch 6: Dynamic programming: sequence alignment.
Lecture 7:
My slides are here.
Lecture 8: DP continued. Local and global alignment. We will use the last set of sides and (if time permits) these.
Lecture 9: DP continued. Multiple alignment. Slides from the last set will be used.
This wiki page contains a good summary of multiple alignment.
Gene finding slides.
 Ch 7: Divide and Conquer: more efficient alignment. My slides are here.
 Ch 8: Graph Algorithms: Sequence assembly algorithms. My slides are here.
 Ch 9: Pattern Matching: different variants. My slides are here.
BLAST: My slides are here.
 Ch 10: Clustering: My slides are here.
Note: The above slides were updated slightly on Nov 17.
The rscript I ran in class is here.
Phyllogenetic trees: My slides are here.
 Ch 11: Hidden Markov Models: Intro to Markov Chains and Hidden Markov Models. My slides are here.
The last set of slides on HMMs are here.
Assignments
 HW 1. R Basics: the pdf file is here .
Submit your Rcode and results. If you like, you use the LaTeX source
here to cutpaste questions into your assignment.
Some hints and notes:
 learn about the sample() and seq() functions.
 As mentioned in class, you may lose points if you use nonvectorized code
when there is an easy way to vectorize it.
 The read.fasta method stores the sequence information in a data frame and
not a simple array. You can use the getsequence method to extract the squence.
Solutions to assignment 1 are in the file a1sol.pdf in /cs/course/5290
on the departmental servers.

HW 2. Exon finding using Fourier spectra. The pdf file is here .
Submit your Rcode and results. If you like, you use the LaTeX source
here to cutpaste questions into your assignment.
Note: a window function is a function that starts at x=0, goes up (to 1)
linearly to x=176 and then decreases linearly (to 0) to x=351.
Thus the function (when plotted) makes an isoceles triangle whose base is the
window size and height is 1. Instead of taking the FFT of the sequence
X[i]...X[i+3511] for window i, you will take the FFT of the sequence
X[i]w[1],X[i+1]w[2],...X[i+3511]w[351], where w() is the window function.
Here is a simple example of the sapply function.

HW 3: Dynamic programming. The pdf file is here .
For Q2, assume that you are allowed to take only one type of nucleotide
in each round. Also, if you do not see a pattern, write down the dynamic programming algorithm for the problem; you will get full credit if your algorithm
(as described) is correct.

HW 4: Suffix trees, clustering. The pdf file is here .
This is the last assignment.
The data file in the course directory has the URL for the dataset.
R resources

The starting point for R is its homepage.
 Read the introduction here. Then check out some of the documentation available here. If you are looking for a smaller document, here is one I like.

To search for R functionality on a certain topic, use google or this search engine.

Some tips on R are here.

Another resource on R is here.

The slides of my evening talk on dynamic programming are here.
Other links
While it is not required for the course, this is a good opportunity to
start using LaTeX. A good starting point is this page.