ML Machine Learning Made Easy
Seven talks for IEEE

               Jeff Edmonds

Jeff's ML Chapter
0) Python/Labs    slides   
1) Introduction to Machine Learning    (slides)    (video)
   - Defining key concepts for gradient decent
2) Algebra Review (slides) (video)
   - Dot-product for correlations, matrix multiplication, vectors spaces,
      & simple derivatives needed for forward and backward propagation
2') Some Practical Considerations (slides) (video)
   - Step size, local min, momentum, batches, formatting data,
      k-fold, dimensions, convolution, drop out,
      support vector machine, example
3.1) Generalizing from Training Data (slides) (video)
   - PAC learning needs probabilities, union bound,
      Chernoff bounds. VC Dimension
3.2) Learning in Poly Time (slides) (video York) (video Tor)
   - Over parameterized neural networks starting in a random place learn optimal solution fast.
   - Repeat stuff on generalizing
3.3) Jeff's "All" Critical Gradient Decent Solutions Are Optimal
       (Given Minimal Over-Parameterization) (slides) (paper)
   - # nodes in first level (M) > # training data (D)
4) Reinforcement Learning Game Tree / Markov Chains    (slides) (voice)
   - Each node of the game tree / markov process is
      a max, or min, or random.
5) Dimension Reduction & Maximum Likelihood (slides) (video)
   - Maximum likelihood, variance, co-variance, ellipses,
      projection from high to low dimension, Bayesian, causality.
6) Other Models of Neural Networks (slides)
   - Generative adversarial networks (machines learning to compete.)
   - Convolution, recurrent, & residual networks, .....
   - Trees, clustering
   - Chatgpt type network ..... ?
7) Ethics (slides) (video)
   - Positive: Cool, work saving, chatgpt, image and speech,
      financial, legal, medical experts, protein folding, entertainment, ...
   - Negative: losing our jobs and personal purpose, privacy,
      explainable biases, military,
   - Will our (human) overlords be benevolent or evil.

-
IEEE Talks
- IEEE Toronto
- Women in Eng
- A Good Book
- Old talk on Youtube
- Teaching in Cameroon in 2020
- Taught for EECS2001 in 2019
- Some more fun topics

Bio:

Jeff Edmonds has been a computer science professor at York since
1995, after getting his bachelors at Waterloo and his Ph.D. at
University of Toronto. His back ground is theoretical computer
science. His passion is explaining fairly hard mathematics to
novices. He has never done anything practical in machine learning,
but he is eager to help you understand his musings about the topic.
https://www.eecs.yorku.ca/~jeff/

Seminar 0: Python/Labs: slides
   Discord Instructions, GitHub, Lab document Jeff's Code

Seminar 1: Introduction to Machine Learning: slides video
   Though it is amazing, all it is is gradient decent!

Computers can now drive cars and find cancer in x-rays. For better
or worse, this will change the world (and the job market). Strangely
designing these algorithms is not done by telling the computer what to
do or even by understanding what the computer does. The computers
learn themselves from lots and lots of data and lots of trial and
error. This learning process is more analogous to how brains evolved
over billions of years of learning. The machine itself is a neural
network which models both the brain and silicon and-or-not circuits,
both of which are great for computing. The only difference with neural
networks is that what they compute is determined by weights and small
changes in these weights give you small changes in the result of the
computation. The process for finding an optimal setting of these
weights is analogous to finding the bottom of a valley. "Gradient
Decent" achieves this by using the local slope of the hill
(derivatives) to direct the travel down the hill, i.e. small changes
to the weights.

Seminar 2: Algebra Review: slides video
   How does one best think about all of these numbers.


An input data item, eg a image of a cat, is just a large tuple of real
values. As such it can be thought as a point in some high dimensional
vector space. Whether the image is of a cat or a dog partitions this
vector space into regions. Classifying your image amounts to knowing
which region the corresponding point is in.

The dot product of two vectors tell us: whether our data scaled by
coefficients meets a threshold; how much two lists of properties
correlate; the cosine of the angle between two directions; and which
side of a hyperplane your points is on.

A novice reading a machine learning paper might not get that many of
the symbols are not real numbers but are matrices. Hence the product
of two such symbols is matrix multiplication. Computing the output of
your current neural network on each of your training data items
amounts to an alternation of such a matrix multiplications and of some
non-linear rounding of your numbers to be closer to being 0-1 valued.
Similarly, back propagation computes the direction of steepest decent
using a similar alternation, except backwards.

The matrix way of thinking about a neural network also helps us
understand how a neural network effectively performs a sequence linear
and non-linear transformations changing the representation of our
input until the representation is one for which the answer can be
determined based which side of a hyperplane your point is on.

Though people say that it is "obvious", it was never clear to me
which direction to head to get the steepest decent.

Seminar 3.1: Generalizing from Training Data: slides video
   Just because your learned machine works well on the training data,
   why should it generalize to work on examples never seen before?

There is some theory. If a machine is found that gives the correct
answers on the randomly chosen training data without simply
memorizing, then we can prove that with high probability this same
machine will also work (1-ε) well on never seen before instances drawn fro
the same distribution. The easy proof requires ε2 D>m, where m is the
number of bits needed to describe your learned machine and D is the
number of train data items.

This gap is necessary because if the labels are random then
the number of weights needed is more than # training data
and the machine cannot generalize to unseen data.

A much harder proof (which we likely won't cover) requires only
D>VC, where VC is VC-dimension (Vapnik-Chervonenkis) of your
machine. The second requirement is easier to meet because VC is less than m.

When the machine is described by real numbers, proofs that you can
generalize require the amount of training data to be exponential,
which is not very practical.

A better approach is to make your machine be smooth, ie not very
sensitive to the input. The hope is that then inputs that are between
training inputs will get answer that are between the training outputs.

Seminar 3.2: Generalizing from Training Data: (slides) (video York) (video Tor)
   Over parameterized neural networks starting in a random place learn optimal solution fast.

Some have argued that the result is not interesting because it requires so many parameters.
But Jeff was shocked that there was any theory about it.

Let the number of parameters of your neural net > (# training data)^{O(1)}
ie table lookup over fits the data just fine. Starting
with random setting of the weights. Do gradient decent. Jeff assumed
that there was no theory about how quickly it will converge and
whether it will get stuck in a local minimum. However, this paper
proves that whp it will converge in time (# training data)^{O(1)} to
epsilon error on the training data. The weights change very little
from their initial weights. Hence, such optimal neural networks look
random and are as abundant as the stars. This is why the weight space
needs to be so much bigger than the space of train data.

To get a guarantee that your machine generalizes to never seen
before instances, one needs the number of bit parameters of your
neural net < # training data. This gap is necessary because if
the labels are random then the number of weights needed is more than #
training data and the machine cannot generalize to unseen data.

Seminar 3.3: Jeff's
"All" Critical Gradient Decent Solutions Are Optimal
(Given Minimal Over-Parameterization)
(slides) (paper)
Theorem:
Consider any network on which there ∃ a layer ℓ'-1 for which # nodes (M) ≥ # training data (D).
   There are weights w, on which it computes the training data perfectly.
Consider any critical weights w, i.e., ∀m, |δError/δwm| ≤ O(ε).
   Consider any training input xd on which the output NNw(xd) is sensitive,
      i.e., ∃ input element [xd]i or a weight wm in layer ℓ' or earlier, that effects the output,
         |δNNw(xd)/δ[xd]i| or |δNNw(xd)/δwm| ≥ Ω(1).
   On this input, the answer is optimal |NNw(xd)-yd| ≤ O(ε)
A variety of activation and error functions are considered.
O(1) vs Infinitesimals: We assume all inputs, weights, node values, and parameters are O(1),
   and we have assumed that the sensitivity derivatives are Ω(1). In contrast, we assume ε is an infinitesimal.
Noisy Weights: Starting with critical weights w for which ∀m, |δError/δwm| ≤ O(ε), the proof adds O(ε) noise.
   Because the second derivative |δError2/δwm2| is considered O(1), ε changes in the weights do not change |δError/δwm| ≤ O(ε).
   Once we have the result |NNw(xd)-yd| ≤ O(ε), we remove the noise.
   Because |δNNw(xd)/δwm| is considered O(1), ε changes in the weights do not change NNw(xd) by more than O(ε).
Lower Bound: Avrim Blum and Ronald Rivest prove that this neural network optimization problem is NP-complete
   when # nodes per layer (M) << # training data (D).

Seminar 4: Reinforcement Learning Game Tree / Markov Chains slides video
   The goal of Reinforcement Learning is to get an agent to learn how
to solve some complex multi-step task,
   eg make a pina colada or win at Go.

At the risk of being non-standard, Jeff will tell you the way he
thinks about this topic. Both "Game Trees" and "Markov Chains"
represent the graph of states through which your agent will traverse a
path while completing the task. Suppose we could learn for each such
state a value measuring "how good" this state is for the agent. Then
competing the task in an optimal way would be easy. If our current
state is one within which our agent gets to choose the next action,
then she will choose the action that maximizes the value of our next
state. On the other hand, if our adversary gets to choose, he will
choose the action that minimizes this value. Finally, if our current
state is one within which the universe flips a coin, then each edge
leaving this state will be labeled with the probability of taking
it. Knowing that that is how the game is played, we can compute how
good each state is. The states in which the task is complete is worth
whatever reward the agent receives in the said state. These values
somehow trickle backwards until we learn the value of the start
state. The computational challenge is that there are way more states
then we can ever look at.

Seminar 5: Dimension Reduction & Maximum Likelihood: slides video
   How to compress your data while retaining the key features.

A randomly chosen bit string cannot be compressed at all. But if there
is a pattern to it, eg it represents an image, then maybe it can be compressed.

Each pixel of an image is specified by one (or three) real numbers.
If an image has thousands/millions of pixels, then each of these acts
as a coordinate of the point where the image sits in a very high
dimensional space. A set of such images then corresponds to a set of
these points.

We can understand the pattern of points/images as follows. Maximum
Likelihood assumes that the given set of points/images were randomly
chosen according a multi-dimensional normal distribution and then
adjusts the parameters of this normal distribution in the way that
maximizes the probability of getting the images that we have. The
obtained parameters effectively fits an ellipse around the
points/images in this high dimensional space.

We then reduce the number of dimensions in our space by collapsing
this ellipse along its least significant axis. Projecting each
point/image to this lower dimensional space compresses the amount of
information needed to represent each image.

- Bayesian Inference: Given the probability of a symptom given a
disease, we can compute the probability of a disease given a symptom.

Seminar 6: Generative Adversarial Networks: slides video
   Used for understanding and producing a random data item.

Suppose you have a distributions of random images of cats. Suppose
you want to learn a neural network that takes uniformly random bits as
input and outputs an image of a cat according to this same
distribution. One fun thing is that this neural network won't be perfect
and hence it will output images of "cats" that it has never seen
before. Also you can make small changes in the network input bits and
see how it changes the resulting image of a cat.

The way we do this is with Generative Adversarial Networks. This is
formed by having two competing agents. The task of the first agent,
as described above, is to output random images of cats. The task of
the second is to discern whether a given image was produced by the true
random distribution or by the first agent. By competing, they learn.

If we have more time in the talk then we will talk about Convolutional
& Recurrent Networks which are used for learning images and sound
that are invariant over location and time.

Some Practical Considerations slides video
- Decision Trees & Clustering: The basics.
- Practical: Step Size, Local Min, Momentum, Batches,
Formulating Data, k-fold, Support Vector Machine, Dimensions,
Convolution, Drop out

Seminar 7: Ethics: slides video
   How might the machine learning make the world a better place?
   How might it make the world worse?
   I have some thoughts. Likely you do too.

Non-Machine Learning topics: Fun

Request: Jeff tends to talk too fast. Please help him go
pole pole slowly.