Skip to content

Commit 5dfd2a5

Browse files
author
KathiBrown
committed
added polynominal regression, parts of hadoop rollout
1 parent cbc2baf commit 5dfd2a5

7 files changed

+143
-4
lines changed

21_evaluation_of_hadoop_distros.tex

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -376,3 +376,40 @@ \subsection{Recommendation}
376376
\emph{MemoryOutOfExceptions}. Because Hadoop and every single service on top needs resources and these can
377377
quickly bring a commodity computer with standard equipment to its knees. In any case, the HDP cluster is
378378
ready for big data tasks and will be extremely helpful to us in the further course of the project. With Apache Spark, even better performance values might be achieved due to its \glqq in-memory\grqq engine. The test cluster with the four VMs also showed which special features and requirements you should pay attention to (e.g. operating system version).
379+
\subsection{Hadoop Rollout}
380+
For the Hadoop rollout ten fat clients each with 32 GB RAM and 2 TB HDD as well as 128 GB SSD
381+
have been used. The nodes were connected with a 48port Gigabit switch and a routable gateway,
382+
so that all nodes were in a single local domain. For various reasons (e.g. security) the installation
383+
of a hypervisor (Proxmox) has proven to be successful, with which any number of VMs can be
384+
generated and configured (figure \ref{fig:figure1_proxmox}).
385+
\begin{figure}[H]
386+
\hspace{-2.8cm}
387+
\includegraphics[width=1.4\textwidth]{img/figure1_proxmox}\label{fig:figure1_proxmox}
388+
\captionof{figure}{Proxmox surface}\label{fig:figure1_proxmox}
389+
\end{figure}
390+
A VM with Ubuntu Server 18.04.02 LTS was installed for each node. Each VM got an allocated
391+
memory of 256 GB and a reserved memory for the Logical Volume Manager (LVM). Furthermore,
392+
sparse files were allowed and VirtIO SCSI interfaces were configured for the individual VMs, which
393+
should guarantee maximum VM performance (\cite{RN1}).\\\\
394+
For the first rollout attempt, six cluster nodes were configured for Hadoop. The remaining four nodes
395+
were later added to the existing Hadoop cluster via Ambari surface. The chosen Hadoop stack
396+
was HDP from Hortonworks, as already explain in the Hadoop evaluation part of this work. In
397+
addition, Webmin was installed on the master node, which offers extensive monitoring services
398+
on the node. For example, the current CPU load, memory usage, kernel information and much
399+
more can be displayed. Figure \ref{fig:figure2_webmin} shows the surface of Webmin. Although monitoring services
400+
are also offered with Ambari Metrics, they run on top of Hadoop and can only be called when
401+
Ambari is running as well (\cite{RN2}).
402+
\begin{figure}[H]
403+
\hspace{-3.2cm}
404+
\includegraphics[width=1.5\textwidth]{img/figure2_webmin}\label{fig:figure2_webmin}
405+
\captionof{figure}{Webmin dashboard}\label{fig:figure2_webmin}
406+
\end{figure}
407+
Before the Ambari Wizard can install HDP, some pre-configurations have to be done on each VM.
408+
On the one hand, \glqq ulimit\grqq must be set to at least 10000, because Ambari installs several thousand
409+
dependencies. On the other hand, a password-less ssh authentication is necessary so that the
410+
master node can connect to its worker nodes without entering a password. In addition, the master
411+
node must be able to execute \glqq sudo\grqq commands without entering a password. This can be done
412+
by editing the \glqq visudo\grqq file and adding \glqq username ALL=(ALL) NOPASSWD:ALL\grqq . Another
413+
necessity is to add the IP addresses and hostnames of all cluster nodes under \glqq /etc/hosts\grqq . This
414+
must also be done on each node individually. The following table shows the current host
415+
configurations of a worker respectively slave node:

23_data_prediction.tex

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,4 +183,44 @@ \subsubsection{Result}\label{sec:resultmlp}
183183
\includegraphics[width=1.4\textwidth]{img/mlpquantile_least}\label{fig:mlpquantile_least}
184184
\captionof{figure}{Accuracy with recommended features and scaling by the QuantileTransformer}\label{fig:mlpquantile_least}
185185
\end{figure}
186-
The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
186+
The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
187+
\subsection{Modelling (Polynomial Regression)}\label{poly}
188+
Since we don't have a linear relationship between the data, a linear regression will not be helpful.
189+
For example, the regress \glqq Rented Bikes\grqq is not a linear correlation of temperature as even on rainy
190+
days there is slight chance that more people rent a bike than on a sunny day due to a special
191+
holiday. Therefore polynomial regression was a selected machine learning method for the
192+
prediction of rented bikes on a station.\\\\
193+
Polynomial regression belongs to the regression forms \cite{RN9}. In fact, it is just a modified version of a
194+
linear regression. This means the independent variable x and the dependent variable y is modelled
195+
as an nth degress (so-called polynomial) in x \cite{RN9}.\\\\
196+
In a more formal way, the polynomial regression can be expressed as following:
197+
$$Y=\beta_0+\beta_1* x+\beta_2 * x^2 + \beta_3 * x^3 + ... + \beta_n * x^n$$
198+
Where n is the degree of the regression.\\\\
199+
With the scitkit-library in python a data scientist can import the function \glqq PolynomialFeatures\grqq from
200+
\glqq sklearn.preprocessing\grqq which transforms linear data into higher dimensional data. For example
201+
one could apply \glqq poly = PolynomialFeatures(degree = 3)\grqq to get a polynomial
202+
regression in the third dimension. This should improve the accuracy as our underlying data has
203+
no linear relationships but maybe higher dimensional ones. Furthermore, the higher the degree
204+
the better the accuracy should be. Unfortunately the computation time is exponential. A degree of
205+
4 already took several hours to perform and was only slightly better than a regression in the third
206+
dimension.\\\\
207+
The RSME error on a degree of 4 was around 48,4, which is in comparison to the other tested
208+
ones not really bad but maybe also not best one.\\\\
209+
Figure \ref{fig:figure9_polynomial_features} shows the different plots of each feature and the prediction (rented usage). It turns out
210+
that the feature Season, for example, has no German influence on the use of bicycles, whereby
211+
there was apparently an outlier in spring with 800 rented bicycles. It is also becoming apparent
212+
that bicycles are rented more frequently at low wind speeds than at high wind speeds. The average
213+
temperatures between 30 and 70 Fahrenheit are particularly high. Unfortunately, the addition of
214+
the past data has not caused much change in accuracy, as shown in figure \ref{fig:figure9_polynomial_features}.\\\\
215+
Figure \ref{fig:figure10_polynomial_prediction} shows a plot of the tested data (prediction) with the feature \glqq Daily Weather\grqq . The plot looks relatively good, except for a few single outliers at 2 (partly-cloudy-day), the prediction of the
216+
tested data matches the training data.
217+
\begin{figure}[H]
218+
\hspace{-2.8cm}
219+
\includegraphics[width=1.4\textwidth]{img/figure9_polynomial_features}\label{fig:figure9_polynomial_features}
220+
\captionof{figure}{Plot of polynomial regression (features)}\label{fig:figure9_polynomial_features}
221+
\end{figure}
222+
\begin{figure}[H]
223+
\hspace{-2.4cm}
224+
\includegraphics[width=1.3\textwidth]{img/figure10_polynomial_prediction}\label{fig:figure10_polynomial_prediction}
225+
\captionof{figure}{Polynomial Prediction of rental usage on daily weather}\label{fig:figure10_polynomial_prediction}
226+
\end{figure}

24_hourly.tex

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,65 @@ \subsection{Data Preparation}
3939
\includegraphics[width=1.2\textwidth]{img/figure7_weather_df}\label{fig:figure7_weather_df}
4040
\captionof{figure}{Weather dataframe after transformation for two hours}\label{fig:figure7_weather_df}
4141
\end{figure}
42+
This (figure \ref{fig:figure7_weather_df}), however, is more like a jumping window technique as it uses days as start date
43+
and not the the single hours on each day. Nevertheless the structure of the data frame from figure \ref{fig:figure7_weather_df} can be used as ground for applying sliding window method.\\\\
44+
However, before the sliding window procedure could be implemented, the \glqq Start Date\grqq from figure \ref{fig:figure7_weather_df} had to be converted to an hourly format, whereby the individual hour columns had to be
45+
combined so that only one \glqq current column\grqq existed. For example, \glqq hum00\grqq and \glqq hum01\grqq should
46+
become something like \glqq Current Humidity\grqq which shows the humidity value of each hour. This
47+
means that further data preparation steps are necessary.\\\\
48+
First, the exact timestamps of the bicycle data were read in from the provided TfL website and
49+
rounded off to an hourly level. For example, \glqq 23:38\grqq became \glqq 23:00\grqq Minutes and seconds were
50+
ignored. An arithmetic lap (half-round mode) did not make sense, because then there would be
51+
overlaps with the successor day when 23 o'clock is rounded up. However, this also means that a
52+
slight distortion must be assumed. For example, many bicycles could be rented at a bicycle station
53+
at 17:52 and significantly less after 18 o'clock. Then one would notice a peak at 17 o'clock and a
54+
low usage at 18 o'clock, although in fact more bicycles were rented around 18 o'clock.\\\\
55+
Another problem was that not for every hour there was data about the usage of the bicycle stations.
56+
However, this data quality problem could be solved relatively easy. The missing hours can be
57+
determined by a user defined function. Since every day has 24 hours, this can be calculated
58+
manually as following (figure \ref{fig:listing6}):
59+
\begin{figure}[H]
60+
\hspace{-1.6cm}
61+
\includegraphics[width=1.2\textwidth]{img/listing2}\label{fig:listing6}
62+
\captionof{figure}{Adding missing hours on usage data}\label{fig:listing6}
63+
\end{figure}
64+
A prerequisite of this method from figure \ref{fig:listing6} is that at least one maximum value and one minimum
65+
value exist that serve as boundaries. If no data was collected on a day at a station, null values
66+
would appear. This does not happen, however, and even if it did, it would be removed at a later
67+
stage as such data does not provide any advantage for training a machine learning model.\\\\
68+
Another problem was how to transform the individual hour columns of the weather data so that
69+
only one column exists instead of e.g. 24 columns, whereby this new column should contain all
70+
values of the 24 columns arranged correctly. Fortunately the function \glqq explode\grqq exists in PySpark
71+
and is provided via the class \glqq pyspark.sql.functions\grqq With \glqq explode\grqq an array can be passed,
72+
which then returns a new line for each element of the array \cite{RN8}. The corresponding code excerpt
73+
looks as following:
74+
\begin{figure}[H]
75+
\hspace{-1.6cm}
76+
\includegraphics[width=1.2\textwidth]{img/listing7}\label{fig:listing7}
77+
\captionof{figure}{Further transformation of weather data (excerpt)}\label{fig:listing7}
78+
\end{figure}
79+
The code example from figure \ref{fig:listing7} is used to transform the temperature columns, which are
80+
combined into a single column as described above. Thus the weather data frame is prepared to
81+
the extent that every hour of a day belongs to a specific hourly weather value. The lines 35 - 39 show the use of the PySpark function \glqq lag\grqq , that over a selected \glqq window\grqq takes the starting
82+
boundary \cite{RN8}. If one increments the index at this point, he will get the previous value from the
83+
window. Depending on how far one wants to go into the past, the window is moved over the data
84+
set and thus a sliding window is achieved. This can also be adapted for the future. With the function
85+
\glqq lead\grqq the ending boundary of a \glqq window\grqq is retrieved \cite{RN8}. Again, the future sliding window is only
86+
applied for the usage (target variable) and not for the weather data or other columns.\\\\
87+
The prepared hourly temperature data with sliding window looks finally as following:
88+
\begin{figure}[H]
89+
\centering
90+
\includegraphics[width=0.8\textwidth]{img/figure8_temperature_df}\label{fig:figure8_temperature_df}
91+
\captionof{figure}{Temperature dataframe after transformation for 3 hours (past)}\label{fig:figure8_temperature_df}
92+
\end{figure}
93+
In this example from figure \ref{fig:figure8_temperature_df} it is clearly recognizable that the initial values contain \glqq null\grqq values, which is due to the fact that no weather data was available before 04.01.2015 (or at least they
94+
were not fetched from Dark Sky). The number of zero values increases with the number of selected
95+
hours into the past. The same applies to the last lines of the Spark data frame, where null values
96+
may be appear for the \glqq future usage\grqq columns. Therefore the first 20 and the last 20 lines of the
97+
data frame are removed at the end.\\\\
98+
With the described PySpark transformations it was possible to create dynamic hourly data frames,
99+
which can then be used in the next step of the modeling phase. A corresponding test file can be
100+
found on GitHub in the folder data preparation.
42101
\subsubsection{Holidays}
43102

44103
To add the bank holidays are day-based, so they should be set on a 24 hour windows.

DSPRReport.pdf

510 KB
Binary file not shown.

bibtex/library.bib

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ @inproceedings{riedmiller1993direct
347347
}
348348
@article{RN2,
349349
author = {Hortonworks},
350-
title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},
350+
title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk{\_}ambari-operations/content/ch{\_}using{\_}ambari{\_}metrics.html},
351351
year = {2019},
352352
type = {Journal Article}
353353
}
@@ -390,8 +390,11 @@ @article{RN5
390390
}
391391

392392
@misc{RN1,
393-
year = {2019},
394-
type = {Generic}
393+
author = {OVirt},
394+
title = {{Virtio-SCSI}},
395+
url = {https://ovirt.org/develop/release-management/features/storage/virtio-scsi.html},
396+
urldate = {2019-06-20},
397+
year = {2019}
395398
}
396399

397400
@article{RN3,

img/listing6.png

24.1 KB
Loading

img/listing7.png

72.4 KB
Loading

0 commit comments

Comments
 (0)