-
Notifications
You must be signed in to change notification settings - Fork 0
/
Slides.Rmd
577 lines (460 loc) · 16.9 KB
/
Slides.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
---
title: "Introduction to Machine Learning"
author: "Raphael Cobe"
institute: "Advanced Institute for Artificial Intelligence"
date: "June, 2019"
output:
xaringan::moon_reader:
css: ["theme.css", "theme-fonts.css"]
lib_dir: libs
seal: false
nature:
ratio: "16:9"
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: center, middle
<img src="images/ai2_logo_256h-1.png" alt="drawing" style="width:15%;"/>
# Introduction to Machine Learning
### Raphael Cobe
### Advanced Institute for Artificial Intelligence
### June, 2019
---
class: center, middle
<img src="images/ai2_logo_full_128h-1.png" style="width:80%;"/>
# http://advancedinstitute.ai
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
```
---
class: center, middle, inverse
# A little bit more of Machine Learning
#### Some of the slides were borrowed from Andrew Ng Coursera Course.
---
class: left
# Machine Learning
## Some example applications
* Database mining
+ Machine learning has recently become so big party because of the huge amount of data being generated
+ Sources of data include
+ Web data (click-stream or click through data)
+ Mine to understand users better
+ Medical records
+ Electronic records -> turn records in knowledges
+ Biological data
+ Gene sequences, ML algorithms give a better understanding of human genome
+ Engineering info
+ Data from sensors, log reports, photos etc
---
class: left
# Machine Learning
## Some example applications
* Applications that **we cannot program by hand**
+ Autonomous vehicles
+ Handwriting recognition
+ Very inexpensive, and allows algorithms can automatically route envelopes through the post
+ Natural language processing (NLP)
+ Computer vision
* Self customizing programs
+ Netflix, Amazon, iTunes
+ Take users info
+ Learn based on your behavior
* Understand human learning and the brain
+ If we can build systems that mimic (or try to mimic) how the brain works, this may push our own understanding of the associated neurobiology
---
class: left
# Machine Learning
## Not a well defined definition
* Arthur Samuel (1959)
> Field of study that gives computers the ability to learn without being explicitly programmed
* Tom Michel (1999)
> A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
* The checkers example:
+ E = 10000s games
+ T is playing checkers
+ P if you win or not
???
* Couple of examples of how people have tried to define it
* Samuels wrote a checkers playing program
+ Had the program play 10000 games against itself
+ Work out which board positions were good and bad depending on wins/losses
---
class: left
# Machine Learning
## Kinda how humans evolve by learning
The Cassava Leaf in the North of Brazil
```{r, out.width = "500px", fig.align="center", echo=FALSE}
knitr::include_graphics("images/maniva.jpg")
```
???
* Leaf of the Manioc
* Yuca
* Cassava roots, peels and leaves should not be consumed raw because they contain;
* Used in Brazilians Maniçoba - some kind of stew;
* Learning how to cook it:
+ E = 7h cooking
+ T is eating Manioc Leafs
+ P if you live or not
---
class: left
# Machine Learning
## Several types of learning algorithms
* Supervised learning
+ Teach the computer how to do something, then let it use it;s new found knowledge to do it
* Unsupervised learning
+ Let the computer learn how to do something, and use this to determine structure and patterns in data
* Reinforcement learning
* Recommender systems
---
class: left
# Supervised learning - introduction
* Probably the most common problem type in machine learning
* Starting with an example:
* How do we predict housing prices
* Collect data regarding housing prices and how they relate to size in feet
```{r, out.width = "500px", fig.align="center", echo=FALSE}
knitr::include_graphics("images/lr1.png")
```
Example problem: "Given this data, if I have a house 750 square feet - how much can I expect to get?"
---
class: left
# Supervised learning - introduction
## What approaches can we use to solve this?
* Straight line through data
* Maybe $150 000
* Second order polynomial
* Maybe $200 000
* One thing we discuss later - how to chose straight or curved line?
* Each of these approaches represent a way of doing supervised learning
---
class: left
# Supervised learning - introduction
## What does this mean?
* We gave the algorithm a data set where a "right answer" was provided
* So we know actual prices for houses
* The idea is we can learn what makes the price a certain value from the training data
* The algorithm should then produce more right answers based on new training data where we don't know the price already
* i.e. predict the price
## Regression problem
* Predict continuous valued output (price)
* No real discrete delineation
---
class: left
# Supervised learning - introduction
## Another Example
* Can we definer breast cancer as malignant or benign based on tumour size
```{r, out.width = "500px", fig.align="center", echo=FALSE}
knitr::include_graphics("images/lr2.png")
```
Can you estimate prognosis based on tumor size?
---
class: left
# Supervised learning - introduction
## Another Example continued
* This is an example of a **classification problem**
* Classify data into one of two discrete classes - no in between, either malignant or not
* In classification problems, can have a discrete number of possible values for the output e.g., maybe have four values:
* 0 - benign
* 1 - type 1
* 2 - type 2
* 3 - type 4
---
class: left
# Supervised learning - introduction
## Another Example continued
* In other problems may have multiple attributes
* We may also, for example, know age and tumor size:
```{r, out.width = "500px", fig.align="center", echo=FALSE}
knitr::include_graphics("images/lr3.png")
```
---
class: left
# Unsupervised learning - introduction
* Second major problem type
* In unsupervised learning, we get unlabeled data
* Just told - here is a data set, can you structure it
* One way of doing this would be to cluster data into to groups
* This is a **clustering algorithm**
---
class: left
# Unsupervised learning - introduction
## Clustering - examples
* Google news
* Groups news stories into cohesive groups
* Genomics
* Run algorithm to cluster individuals into types of people
* Organize computer clusters
* Identify potential weak spots or distribute workload effectively
* Social network analysis
* Customer data
Can you automatically generate structure?
Because we don't give it the answer, it's **unsupervised learning**
---
class: center, middle, inverse
# Linear Regression
---
class: left, middle
# Get Started
### Regression analysis is used to describe the relationship between:
* A single response variable: $Y$ ; and
* One or more predictor variables: $X_1$, $X_2$,..., $X_n$
+ $n = 1$: Simple Regression
+ $n > 1$: Multivariate Regression
---
class: left
# The cars R dataset: Speed and Stopping Distances of Cars
## Load the dataset into a variable
```{r, echo=TRUE}
cars.dataset <- cars;
```
--
## Check the first lines of the dataset:
```{r, echo=TRUE}
head(cars.dataset);
```
---
# Check the summary of the data
```{r, echo=TRUE}
summary(cars.dataset);
```
---
# Check scatterplot
```{r, echo=TRUE, out.width = "400px", fig.align="center"}
plot(cars.dataset);
```
---
class: left, middle
# Linear Regression
* Model a continuous variable $Y$ ( $y_1, y_2, ..., y_m$ ) as a function of one or more $X^j$ variable(s);
$$y_i=\theta_0 + \theta_1X_i^1 + \theta_2X_i^2+...+\theta_nX_i^n$$
* Build a model to predict $\widehat{y}_i$ values when the $X^j$ features are known;
$$\widehat{y_i}=\theta_0 + \theta_1X_i^1 + \theta_2X_i^2+...+\theta_nX_i^n + \epsilon_i$$
* Unless an $\epsilon$ error.
---
class: middle
# Linear Regression
## Matrix Notation
$$\widehat{y_i}=\begin{bmatrix} \theta_0 & \theta_1 &\theta_2 & ... & \theta_n\end{bmatrix} \begin{bmatrix} X_i^0 \\ X_i^1 \\ X_i^2 \\ ... \\ X_i^n\end{bmatrix}+\epsilon_i$$
$$\widehat{y_i}=\Theta^TX_i + \epsilon_i$$
---
class: middle
# Linear Regression
## Matrix Notation
$$\widehat{y_i}=\begin{bmatrix}X_i^0 & X_i^1 & X_i^2 & ... & X_i^n\end{bmatrix} \begin{bmatrix}\theta_0 \\ \theta_1 \\ \theta_2 \\ ... \\ \theta_n\end{bmatrix} +\epsilon_i$$
$$\widehat{y_i}=[{X_i}]^T\Theta + \epsilon_i$$
---
class: middle
# Linear Regression
## Matrix Notation - For all observations:
$$\begin{bmatrix}\widehat{y_1} \\ \widehat{y_2} \\ ... \\ \widehat{y_m}\end{bmatrix} = \begin{bmatrix}X_1^0 & X_1^1 & ... & X_1^n \\ X_2^0 & X_2^1 & ... & X_2^n \\ ...& ... &... & ...\\X_m^0 & X_m^1 & ... & X_m^n \end{bmatrix} \begin{bmatrix}\theta_0 \\ \theta_1 \\ ... \\ \theta_n\end{bmatrix} + \begin{bmatrix}\epsilon_1 \\ \epsilon_1 \\ ... \\ \epsilon_m\end{bmatrix}$$
$$\widehat{Y}=X\Theta+\epsilon$$
---
class: middle
# Simple Linear Regression
### Using a single predictor:
$$
\widehat{y_i}=\theta_0 + \theta_1X_i^1+\epsilon_i
$$
#### Where:
* $\theta_0$ (Intercept): point in which the line intercepts the y-axis;
* $\theta_1$ (Slope): increase in Y per unit change in X.
---
# Simple Linear Regression
### For the equation:
$$y = 10x+5$$
```{r, echo=FALSE, out.width = "400px", fig.align="center"}
plot(x=c(0), y=(5), xlim=c(0,4), ylim=c(4,10))
abline(a=5, b=1)
```
---
class:middle
# Simple Linear Regression
We want to find the equation of the line that *best* fits the data.
It means finding $\theta_0$ and $\theta_1$ such that the fitted values of $\widehat{y_i}$, given by
$$
\widehat{y_i} = \theta_0+\theta_1X_i^1+\epsilon_i
$$
are as *close as possible* to the observed values $y_i$, i.e., bring each $\epsilon_i$ to **as close to 0 as possible**.
---
class:middle
# Simple Linear Regression
## Residuals
* The difference between the observed value $y_i$ and the fitted value $\widehat{y_i}$:
$$
e_i = \widehat{y_i} - y_i
$$
---
class:middle
# Simple Linear Regression
## Residuals
* The difference between the observed value $y_i$ and the fitted value $\widehat{y_i}$
```{r, out.width = "600px", fig.align="center", echo=FALSE}
knitr::include_graphics("images/Regression1.png")
```
---
class:middle
#Simple Linear Regression
A usual way of calculating $\theta_0$ and $\theta_1$ is based on the minimization
of the **R**esidual **S**um of **S**quares (*RSS*):
$$\begin{eqnarray}
RSS &=& \frac{1}{2m}\sum_{i=0}^{m}[\epsilon_i]^2 \\
&=& \frac{1}{2m}\sum_{i=0}^{m}(\widehat{y}_i-y_i)^2\\
&=& \frac{1}{2m}\sum_{i=0}^{m}(\theta_0 + \theta_1X_i^1-y_i)^2
\end{eqnarray}$$
**Also known as our Cost function!**
---
# Simple Linear Regression
Let's examine the Error $\epsilon$ in terms of the $\theta_0$ and $\theta_1$ params:
```{r, out.width = "700px", fig.align="center", echo=FALSE}
knitr::include_graphics("images/Regression2.png")
```
---
class: middle
# Simple Linear Regression
## The Gradient
The gradient of a vector is a generalization of the derivative and is represented by the vector operator $\Delta$. This operation is used to minimize the our cost function (*RSS*):
$$\begin{eqnarray}
\Delta RSS(\Theta)&=&\Delta[\frac{1}{2m}\sum_{i=0}^m(\widehat{y_i}-y_i)^2]\\
&=& \Delta[\frac{1}{2}\sum_{i=0}^{m}(\theta_0 + \theta_1X_i^1 - y_i)^2]\\
&=& [\frac{\partial}{\partial \theta_0}(\frac{1}{2m}\sum_{i=0}^{m}(\theta_0 + \theta_1X_i^1-y_i)^2),\frac{\partial}{\partial \theta_1}(\frac{1}{2m}\sum_{i=0}^{m}(\theta_0 - \theta_1X_i^1-y_i)^2)]
\end{eqnarray}$$
---
class: middle
# Simple Linear Regression
## The Gradient Continued
$$\begin{eqnarray}
\frac{\partial RSS}{\partial \theta_0} &=& \frac{\partial}{\partial \theta_0}(\frac{1}{2m}\sum_{i=0}^{m}(\theta_0 + \theta_1X_i^1-y_i)^2)\\
&=& \frac{1}{m}\sum_{i=0}^{m}(\theta_0X_i^0 + \theta_1X_i^1-y_i)\end{eqnarray}$$
---
class: middle
# Simple Linear Regression
## The Gradient Continued
$$\begin{eqnarray}
\frac{\partial RSS}{\partial \theta_1} &=& \frac{\partial}{\partial \theta_1}(\frac{1}{2m}\sum_{i=0}^{m}(\theta_0 + \theta_1X_i^1-y_i)^2)\\
&=& \frac{1}{m}\sum_{i=0}^{m}((\theta_0X_i^0 + \theta_1X_i^1-y_i)X_i^1)\end{eqnarray}$$
---
class: middle
# Simple Linear Regression
## The Analytical Solution:
$$\begin{eqnarray}
\frac{\partial RSS}{\partial \theta_0} &=& 0\\
\theta_0 &=& \frac{\sum_{i=0}^{m}y_i}{m} - \theta_1\frac{\sum_{i=0}^{m}X_i^1}{m}\\
\end{eqnarray}$$
---
class: middle
# Simple Linear Regression
## The Analytical Solution
$$\begin{eqnarray}
\frac{\partial RSS}{\partial \theta_1} &=& 0\\
0 &=& \sum_{i=0}^{m}y_iX_i^1 - \sum_{i=0}^{m}\theta_0X_i^1 - \sum_{i=0}^{m}\theta_1(X_i^1)^2
\end{eqnarray}$$
--
Substituting $\theta_0$... and doing some math:
$$\begin{eqnarray}
\theta_1 &=& \frac{\sum_{i=0}^{m}y_iX_i^1-\frac{\sum_{i=0}^{m}y_i\sum_{i=0}^{m}X_i^1}{m}}{\sum_{i=0}^{m}(X_i^1)^2-\frac{(\sum_{i=0}^{m}(X_i^1))^2}{m}}
\end{eqnarray}$$
Which is the same as $\frac{Cov(X^1,Y)}{Var(X^1)}$.
---
class: middle
# Simple Linear Regression
## The Analytical Solution
By transforming the previous equation into its Vectorial form:
$$\Theta = (X^TX)^{-1}X^TY$$
---
class: top
# Simple Linear Regression
## Drawbacks of using the Analytical Solution
* $X^TX$ is not always invertible;
* Complexity of the inverse is of the order $O(n^3)$:
+ If the number of features is high, it can become **computationally very expensive**;
--
+ Consumes a lot of memory:
--
Imagine a dataset containing $10^5$ features and $10^6$ observations, in this case $X^TX$ would have $10^5$ floating points, which at 8 bytes per number would give **80 gigabytes**, with the $O(n^3)$ that can get a lot worse;
---
class: middle
# Simple Linear Regression
## The Gradient Descent
Iteratively correct the $\Theta$ weight matrix with:
$$\begin{eqnarray}
\theta_0^{(t+1)} &=& \theta_0^{(t)} - \alpha \frac{\partial RSS}{\partial \theta_0}\\
\theta_1^{(t+1)} &=& \theta_1^{(t)} - \alpha \frac{\partial RSS}{\partial \theta_1}
\end{eqnarray}$$
Where $\alpha$ is the Learning Rate, i.e. the stepsize torwards the the minimum cost value;
---
class: middle
# Simple Linear Regression
## The Gradient Descent Algorithm
$$while\ ||\frac{\partial RSS}{\partial \Theta}|| > v \\
\theta_0^{(t+1)} = \theta_0^{(t)} - 2\alpha\sum_{i=0}^{m}(\theta_0X_i^0 + \theta_1X_i^1-y_i)\\
\theta_1^{(t+1)} = \theta_1^{(t)} - 2\alpha(\sum_{i=0}^{m}(\theta_0X_i^0 + \theta_1X_i^1-y_i)X_i^1)$$
Where
$\left\|\frac{\partial}{\partial \Theta} R S S\left(\Theta^{(t)}\right)\right\|=\sqrt{\frac{\partial}{\partial \theta_0} R S S\left(\theta_{0}^{(t)}\right)^{2}+\frac{\partial}{\partial \theta_1} R S S\left(\theta_{1}^{(t)}\right)^{2}}$
---
class: middle
# Linear Regression
## The general case
$$\begin{eqnarray}
\Delta RSS(\Theta) &=& \Delta[\frac{1}{2}(X\Theta-y)^T (X\Theta-y)]\\
&=& X^T(X\Theta-y)
\end{eqnarray}$$
Ref.: $\frac{d}{d\theta}(x\theta-y)(x\theta-y) = 2x(x\theta-y)$
---
class: middle
# Linear Regression
## The general case
Feature by feature update:
$$\frac{\partial RSS(\theta_j)}{\partial \theta_j} = \sum_{i=0}^{m}2(y_i - \theta_0*X_i^0 - \theta_1*X_i^1 -\ ...\ -\ \theta_n*X_i^n) * (-X_i^j)$$
---
class: middle
# Assessing the Model quality
## Residual Standard Error - *RSE*:
* Derived from the Residual Sum of Squares - RSS;
* Associated with each observation is an error term $\epsilon$:
$$y_i = \theta_0+\theta_1X_i^1+\epsilon_i$$
* Even if we knew the true regression line, we would not be able to perfectly predict $Y$ from $X$;
* The RSE is an estimate of the standard deviation of $\epsilon$;
* **The average amount that the response will deviate from the true regression line**
---
class: middle
# Assessing the Model quality
## Residual Standard Error - *RSE*:
* a measure of the lack of fit of the model to the data;
* If the predictions obtained using the model are very close to the true outcome:
+ RSE will be small, and we can conclude that the model fits the data very well;
* If $\widehat{y}_i$ is very far from $y_i$ for one or more observations, then:
+ The RSE may be quite large, indicating that the model doesn't fit the data well;
---
class: middle
# Assessing the Model quality
## The $R^2$
* Provides an alternative measure to RSE;
* "*Unitless*";
* The proportion of variance explained;
* Always takes on a value between 0 and 1;
* Independent of the scale of $Y$;
---
class: middle
# Assessing the Model quality
## The $R^2$
* Statistic close to 1:
+ A large proportion of the variability in the response has been explained by the regression.
* A value near 0:
+ Indicates that the regression did not explain much of the variability in the response;
* it can still be challenging to determine what is a good $R^2$ value;
+ depend on the application;
---
class: center, middle, inverse
# Demo time!
---
class: left
# The Exercises
The barrel calorimeter
```{r, out.width = "500px", fig.align="center", echo=FALSE}
knitr::include_graphics("images/cern_graphic.png")
```
---
class: center, middle
# Thanks!
Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).