hackenjoe
diff --git a/‎21_evaluation_of_hadoop_distros.tex
Lines changed: 37 additions & 0 deletions b/‎21_evaluation_of_hadoop_distros.tex
Lines changed: 37 additions & 0 deletions
diff --git a/‎23_data_prediction.tex
Lines changed: 41 additions & 1 deletion b/‎23_data_prediction.tex
Lines changed: 41 additions & 1 deletion
diff --git a/‎24_hourly.tex
Lines changed: 59 additions & 0 deletions b/‎24_hourly.tex
Lines changed: 59 additions & 0 deletions
diff --git a/‎DSPRReport.pdf
510 KB b/‎DSPRReport.pdf
510 KB
diff --git a/‎bibtex/library.bib
Lines changed: 6 additions & 3 deletions b/‎bibtex/library.bib
Lines changed: 6 additions & 3 deletions
diff --git a/‎img/listing6.png
24.1 KB b/‎img/listing6.png
24.1 KB
diff --git a/‎img/listing7.png
72.4 KB b/‎img/listing7.png
72.4 KB
@@ -376,3 +376,40 @@ \subsection{Recommendation}
 \emph{MemoryOutOfExceptions}. Because Hadoop and every single service on top needs resources and these can
 quickly bring a commodity computer with standard equipment to its knees. In any case, the HDP cluster is
 ready for big data tasks and will be extremely helpful to us in the further course of the project. With Apache Spark, even better performance values might be achieved due to its \glqq in-memory\grqq engine. The test cluster with the four VMs also showed which special features and requirements you should pay attention to (e.g. operating system version).
+\subsection{Hadoop Rollout}
+For the Hadoop rollout ten fat clients each with 32 GB RAM and 2 TB HDD as well as 128 GB SSD
+have been used. The nodes were connected with a 48port Gigabit switch and a routable gateway,
+so that all nodes were in a single local domain. For various reasons (e.g. security) the installation
+of a hypervisor (Proxmox) has proven to be successful, with which any number of VMs can be
+generated and configured (figure \ref{fig:figure1_proxmox}).
+\begin{figure}[H]
+\hspace{-2.8cm}
+\includegraphics[width=1.4\textwidth]{img/figure1_proxmox}\label{fig:figure1_proxmox}
+\captionof{figure}{Proxmox surface}\label{fig:figure1_proxmox}
+\end{figure}
+A VM with Ubuntu Server 18.04.02 LTS was installed for each node. Each VM got an allocated
+memory of 256 GB and a reserved memory for the Logical Volume Manager (LVM). Furthermore,
+sparse files were allowed and VirtIO SCSI interfaces were configured for the individual VMs, which
+should guarantee maximum VM performance (\cite{RN1}).\\\\
+For the first rollout attempt, six cluster nodes were configured for Hadoop. The remaining four nodes
+were later added to the existing Hadoop cluster via Ambari surface. The chosen Hadoop stack
+was HDP from Hortonworks, as already explain in the Hadoop evaluation part of this work. In
+addition, Webmin was installed on the master node, which offers extensive monitoring services
+on the node. For example, the current CPU load, memory usage, kernel information and much
+more can be displayed. Figure \ref{fig:figure2_webmin} shows the surface of Webmin. Although monitoring services
+are also offered with Ambari Metrics, they run on top of Hadoop and can only be called when
+Ambari is running as well (\cite{RN2}).
+\begin{figure}[H]
+\hspace{-3.2cm}
+\includegraphics[width=1.5\textwidth]{img/figure2_webmin}\label{fig:figure2_webmin}
+\captionof{figure}{Webmin dashboard}\label{fig:figure2_webmin}
+\end{figure}
+Before the Ambari Wizard can install HDP, some pre-configurations have to be done on each VM.
+On the one hand, \glqq ulimit\grqq must be set to at least 10000, because Ambari installs several thousand
+dependencies. On the other hand, a password-less ssh authentication is necessary so that the
+master node can connect to its worker nodes without entering a password. In addition, the master
+node must be able to execute \glqq sudo\grqq commands without entering a password. This can be done
+by editing the \glqq visudo\grqq file and adding \glqq username ALL=(ALL) NOPASSWD:ALL\grqq . Another
+necessity is to add the IP addresses and hostnames of all cluster nodes under \glqq /etc/hosts\grqq . This
+must also be done on each node individually. The following table shows the current host
+configurations of a worker respectively slave node:
@@ -183,4 +183,44 @@ \subsubsection{Result}\label{sec:resultmlp}
 \includegraphics[width=1.4\textwidth]{img/mlpquantile_least}\label{fig:mlpquantile_least}
 \captionof{figure}{Accuracy with recommended features and scaling by the QuantileTransformer}\label{fig:mlpquantile_least}
 \end{figure}
-The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
+The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
+\subsection{Modelling (Polynomial Regression)}\label{poly}
+Since we don't have a linear relationship between the data, a linear regression will not be helpful.
+For example, the regress \glqq Rented Bikes\grqq is not a linear correlation of temperature as even on rainy
+days there is slight chance that more people rent a bike than on a sunny day due to a special
+holiday. Therefore polynomial regression was a selected machine learning method for the
+prediction of rented bikes on a station.\\\\
+Polynomial regression belongs to the regression forms \cite{RN9}. In fact, it is just a modified version of a
+linear regression. This means the independent variable x and the dependent variable y is modelled
+as an nth degress (so-called polynomial) in x \cite{RN9}.\\\\
+In a more formal way, the polynomial regression can be expressed as following:
+$$Y=\beta_0+\beta_1* x+\beta_2 * x^2 + \beta_3 * x^3 + ... + \beta_n * x^n$$
+Where n is the degree of the regression.\\\\
+With the scitkit-library in python a data scientist can import the function \glqq PolynomialFeatures\grqq from
+\glqq sklearn.preprocessing\grqq which transforms linear data into higher dimensional data. For example
+one could apply \glqq poly = PolynomialFeatures(degree = 3)\grqq to get a polynomial
+regression in the third dimension. This should improve the accuracy as our underlying data has
+no linear relationships but maybe higher dimensional ones. Furthermore, the higher the degree
+the better the accuracy should be. Unfortunately the computation time is exponential. A degree of
+4 already took several hours to perform and was only slightly better than a regression in the third
+dimension.\\\\
+The RSME error on a degree of 4 was around 48,4, which is in comparison to the other tested
+ones not really bad but maybe also not best one.\\\\
+Figure \ref{fig:figure9_polynomial_features} shows the different plots of each feature and the prediction (rented usage). It turns out
+that the feature Season, for example, has no German influence on the use of bicycles, whereby
+there was apparently an outlier in spring with 800 rented bicycles. It is also becoming apparent
+that bicycles are rented more frequently at low wind speeds than at high wind speeds. The average
+temperatures between 30 and 70 Fahrenheit are particularly high. Unfortunately, the addition of
+the past data has not caused much change in accuracy, as shown in figure \ref{fig:figure9_polynomial_features}.\\\\
+Figure \ref{fig:figure10_polynomial_prediction} shows a plot of the tested data (prediction) with the feature \glqq Daily Weather\grqq . The plot looks relatively good, except for a few single outliers at 2 (partly-cloudy-day), the prediction of the
+tested data matches the training data.
+\begin{figure}[H]
+\hspace{-2.8cm}
+\includegraphics[width=1.4\textwidth]{img/figure9_polynomial_features}\label{fig:figure9_polynomial_features}
+\captionof{figure}{Plot of polynomial regression (features)}\label{fig:figure9_polynomial_features}
+\end{figure}
+\begin{figure}[H]
+\hspace{-2.4cm}
+\includegraphics[width=1.3\textwidth]{img/figure10_polynomial_prediction}\label{fig:figure10_polynomial_prediction}
+\captionof{figure}{Polynomial Prediction of rental usage on daily weather}\label{fig:figure10_polynomial_prediction}
+\end{figure}
@@ -39,6 +39,65 @@ \subsection{Data Preparation}
 \includegraphics[width=1.2\textwidth]{img/figure7_weather_df}\label{fig:figure7_weather_df}
 \captionof{figure}{Weather dataframe after transformation for two hours}\label{fig:figure7_weather_df}
 \end{figure}
+This (figure \ref{fig:figure7_weather_df}), however, is more like a jumping window technique as it uses days as start date
+and not the the single hours on each day. Nevertheless the structure of the data frame from figure \ref{fig:figure7_weather_df} can be used as ground for applying sliding window method.\\\\
+However, before the sliding window procedure could be implemented, the \glqq Start Date\grqq from figure \ref{fig:figure7_weather_df} had to be converted to an hourly format, whereby the individual hour columns had to be
+combined so that only one \glqq current column\grqq existed. For example, \glqq hum00\grqq and \glqq hum01\grqq should
+become something like \glqq Current Humidity\grqq  which shows the humidity value of each hour. This
+means that further data preparation steps are necessary.\\\\
+First, the exact timestamps of the bicycle data were read in from the provided TfL website and
+rounded off to an hourly level. For example, \glqq 23:38\grqq became \glqq 23:00\grqq  Minutes and seconds were
+ignored. An arithmetic lap (half-round mode) did not make sense, because then there would be
+overlaps with the successor day when 23 o'clock is rounded up. However, this also means that a
+slight distortion must be assumed. For example, many bicycles could be rented at a bicycle station
+at 17:52 and significantly less after 18 o'clock. Then one would notice a peak at 17 o'clock and a
+low usage at 18 o'clock, although in fact more bicycles were rented around 18 o'clock.\\\\
+Another problem was that not for every hour there was data about the usage of the bicycle stations.
+However, this data quality problem could be solved relatively easy. The missing hours can be
+determined by a user defined function. Since every day has 24 hours, this can be calculated
+manually as following (figure \ref{fig:listing6}):
+\begin{figure}[H]
+\hspace{-1.6cm}
+\includegraphics[width=1.2\textwidth]{img/listing2}\label{fig:listing6}
+\captionof{figure}{Adding missing hours on usage data}\label{fig:listing6}
+\end{figure}
+A prerequisite of this method from figure \ref{fig:listing6} is that at least one maximum value and one minimum
+value exist that serve as boundaries. If no data was collected on a day at a station, null values
+would appear. This does not happen, however, and even if it did, it would be removed at a later
+stage as such data does not provide any advantage for training a machine learning model.\\\\
+Another problem was how to transform the individual hour columns of the weather data so that
+only one column exists instead of e.g. 24 columns, whereby this new column should contain all
+values of the 24 columns arranged correctly. Fortunately the function \glqq explode\grqq exists in PySpark
+and is provided via the class \glqq pyspark.sql.functions\grqq  With \glqq explode\grqq an array can be passed,
+which then returns a new line for each element of the array \cite{RN8}. The corresponding code excerpt
+looks as following:
+\begin{figure}[H]
+\hspace{-1.6cm}
+\includegraphics[width=1.2\textwidth]{img/listing7}\label{fig:listing7}
+\captionof{figure}{Further transformation of weather data (excerpt)}\label{fig:listing7}
+\end{figure}
+The code example from figure \ref{fig:listing7} is used to transform the temperature columns, which are
+combined into a single column as described above. Thus the weather data frame is prepared to
+the extent that every hour of a day belongs to a specific hourly weather value. The lines 35 - 39 show the use of the PySpark function \glqq lag\grqq ,  that over a selected \glqq window\grqq takes the starting
+boundary \cite{RN8}. If one increments the index at this point, he will get the previous value from the
+window. Depending on how far one wants to go into the past, the window is moved over the data
+set and thus a sliding window is achieved. This can also be adapted for the future. With the function
+\glqq lead\grqq the ending boundary of a \glqq window\grqq is retrieved \cite{RN8}. Again, the future sliding window is only
+applied for the usage (target variable) and not for the weather data or other columns.\\\\
+The prepared hourly temperature data with sliding window looks finally as following:
+\begin{figure}[H]
+\centering
+\includegraphics[width=0.8\textwidth]{img/figure8_temperature_df}\label{fig:figure8_temperature_df}
+\captionof{figure}{Temperature dataframe after transformation for 3 hours (past)}\label{fig:figure8_temperature_df}
+\end{figure}
+In this example from figure \ref{fig:figure8_temperature_df} it is clearly recognizable that the initial values contain \glqq null\grqq values, which is due to the fact that no weather data was available before 04.01.2015 (or at least they
+were not fetched from Dark Sky). The number of zero values increases with the number of selected
+hours into the past. The same applies to the last lines of the Spark data frame, where null values
+may be appear for the \glqq future usage\grqq columns. Therefore the first 20 and the last 20 lines of the
+data frame are removed at the end.\\\\
+With the described PySpark transformations it was possible to create dynamic hourly data frames,
+which can then be used in the next step of the modeling phase. A corresponding test file can be
+found on GitHub in the folder data preparation.
 \subsubsection{Holidays}
 
 To add the bank holidays are day-based, so they should be set on a 24 hour windows.
 
@@ -347,7 +347,7 @@ @inproceedings{riedmiller1993direct
 }
 @article{RN2,
    author = {Hortonworks},
-   title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},
+   title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk{\_}ambari-operations/content/ch{\_}using{\_}ambari{\_}metrics.html},
    year = {2019},
    type = {Journal Article}
 }
@@ -390,8 +390,11 @@ @article{RN5
 }
 
 @misc{RN1,
-   year = {2019},
-   type = {Generic}
+author = {OVirt},
+title = {{Virtio-SCSI}},
+url = {https://ovirt.org/develop/release-management/features/storage/virtio-scsi.html},
+urldate = {2019-06-20},
+year = {2019}
 }
 
 @article{RN3,
Original file line number	Diff line number	Diff line change
`@@ -347,7 +347,7 @@ @inproceedings{riedmiller1993direct`
`347`	`347`	`}`
`348`	`348`	`@article{RN2,`
`349`	`349`	`author = {Hortonworks},`
`350`		`- title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},`
	`350`	`+ title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk{\_}ambari-operations/content/ch{\_}using{\_}ambari{\_}metrics.html},`
`351`	`351`	`year = {2019},`
`352`	`352`	`type = {Journal Article}`
`353`	`353`	`}`
`@@ -390,8 +390,11 @@ @article{RN5`
`390`	`390`	`}`
`391`	`391`
`392`	`392`	`@misc{RN1,`
`393`		`- year = {2019},`
`394`		`- type = {Generic}`
	`393`	`+author = {OVirt},`
	`394`	`+title = {{Virtio-SCSI}},`
	`395`	`+url = {https://ovirt.org/develop/release-management/features/storage/virtio-scsi.html},`
	`396`	`+urldate = {2019-06-20},`
	`397`	`+year = {2019}`
`395`	`398`	`}`
`396`	`399`
`397`	`400`	`@article{RN3,`