You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 21_evaluation_of_hadoop_distros.tex
+37Lines changed: 37 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -376,3 +376,40 @@ \subsection{Recommendation}
376
376
\emph{MemoryOutOfExceptions}. Because Hadoop and every single service on top needs resources and these can
377
377
quickly bring a commodity computer with standard equipment to its knees. In any case, the HDP cluster is
378
378
ready for big data tasks and will be extremely helpful to us in the further course of the project. With Apache Spark, even better performance values might be achieved due to its \glqq in-memory\grqq engine. The test cluster with the four VMs also showed which special features and requirements you should pay attention to (e.g. operating system version).
379
+
\subsection{Hadoop Rollout}
380
+
For the Hadoop rollout ten fat clients each with 32 GB RAM and 2 TB HDD as well as 128 GB SSD
381
+
have been used. The nodes were connected with a 48port Gigabit switch and a routable gateway,
382
+
so that all nodes were in a single local domain. For various reasons (e.g. security) the installation
383
+
of a hypervisor (Proxmox) has proven to be successful, with which any number of VMs can be
384
+
generated and configured (figure \ref{fig:figure1_proxmox}).
\captionof{figure}{Accuracy with recommended features and scaling by the QuantileTransformer}\label{fig:mlpquantile_least}
185
185
\end{figure}
186
-
The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
186
+
The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
With the scitkit-library in python a data scientist can import the function \glqq PolynomialFeatures\grqq from
200
+
\glqq sklearn.preprocessing\grqq which transforms linear data into higher dimensional data. For example
201
+
one could apply \glqq poly = PolynomialFeatures(degree = 3)\grqq to get a polynomial
202
+
regression in the third dimension. This should improve the accuracy as our underlying data has
203
+
no linear relationships but maybe higher dimensional ones. Furthermore, the higher the degree
204
+
the better the accuracy should be. Unfortunately the computation time is exponential. A degree of
205
+
4 already took several hours to perform and was only slightly better than a regression in the third
206
+
dimension.\\\\
207
+
The RSME error on a degree of 4 was around 48,4, which is in comparison to the other tested
208
+
ones not really bad but maybe also not best one.\\\\
209
+
Figure \ref{fig:figure9_polynomial_features} shows the different plots of each feature and the prediction (rented usage). It turns out
210
+
that the feature Season, for example, has no German influence on the use of bicycles, whereby
211
+
there was apparently an outlier in spring with 800 rented bicycles. It is also becoming apparent
212
+
that bicycles are rented more frequently at low wind speeds than at high wind speeds. The average
213
+
temperatures between 30 and 70 Fahrenheit are particularly high. Unfortunately, the addition of
214
+
the past data has not caused much change in accuracy, as shown in figure \ref{fig:figure9_polynomial_features}.\\\\
215
+
Figure \ref{fig:figure10_polynomial_prediction} shows a plot of the tested data (prediction) with the feature \glqq Daily Weather\grqq . The plot looks relatively good, except for a few single outliers at 2 (partly-cloudy-day), the prediction of the
\captionof{figure}{Weather dataframe after transformation for two hours}\label{fig:figure7_weather_df}
41
41
\end{figure}
42
+
This (figure \ref{fig:figure7_weather_df}), however, is more like a jumping window technique as it uses days as start date
43
+
and not the the single hours on each day. Nevertheless the structure of the data frame from figure \ref{fig:figure7_weather_df} can be used as ground for applying sliding window method.\\\\
44
+
However, before the sliding window procedure could be implemented, the \glqq Start Date\grqq from figure \ref{fig:figure7_weather_df} had to be converted to an hourly format, whereby the individual hour columns had to be
45
+
combined so that only one \glqq current column\grqq existed. For example, \glqq hum00\grqq and \glqq hum01\grqq should
46
+
become something like \glqq Current Humidity\grqq which shows the humidity value of each hour. This
47
+
means that further data preparation steps are necessary.\\\\
48
+
First, the exact timestamps of the bicycle data were read in from the provided TfL website and
49
+
rounded off to an hourly level. For example, \glqq 23:38\grqq became \glqq 23:00\grqq Minutes and seconds were
50
+
ignored. An arithmetic lap (half-round mode) did not make sense, because then there would be
51
+
overlaps with the successor day when 23 o'clock is rounded up. However, this also means that a
52
+
slight distortion must be assumed. For example, many bicycles could be rented at a bicycle station
53
+
at 17:52 and significantly less after 18 o'clock. Then one would notice a peak at 17 o'clock and a
54
+
low usage at 18 o'clock, although in fact more bicycles were rented around 18 o'clock.\\\\
55
+
Another problem was that not for every hour there was data about the usage of the bicycle stations.
56
+
However, this data quality problem could be solved relatively easy. The missing hours can be
57
+
determined by a user defined function. Since every day has 24 hours, this can be calculated
58
+
manually as following (figure \ref{fig:listing6}):
\captionof{figure}{Further transformation of weather data (excerpt)}\label{fig:listing7}
78
+
\end{figure}
79
+
The code example from figure \ref{fig:listing7} is used to transform the temperature columns, which are
80
+
combined into a single column as described above. Thus the weather data frame is prepared to
81
+
the extent that every hour of a day belongs to a specific hourly weather value. The lines 35 - 39 show the use of the PySpark function \glqq lag\grqq , that over a selected \glqq window\grqq takes the starting
82
+
boundary \cite{RN8}. If one increments the index at this point, he will get the previous value from the
83
+
window. Depending on how far one wants to go into the past, the window is moved over the data
84
+
set and thus a sliding window is achieved. This can also be adapted for the future. With the function
85
+
\glqq lead\grqq the ending boundary of a \glqq window\grqq is retrieved \cite{RN8}. Again, the future sliding window is only
86
+
applied for the usage (target variable) and not for the weather data or other columns.\\\\
87
+
The prepared hourly temperature data with sliding window looks finally as following:
\captionof{figure}{Temperature dataframe after transformation for 3 hours (past)}\label{fig:figure8_temperature_df}
92
+
\end{figure}
93
+
In this example from figure \ref{fig:figure8_temperature_df} it is clearly recognizable that the initial values contain \glqq null\grqq values, which is due to the fact that no weather data was available before 04.01.2015 (or at least they
94
+
were not fetched from Dark Sky). The number of zero values increases with the number of selected
95
+
hours into the past. The same applies to the last lines of the Spark data frame, where null values
96
+
may be appear for the \glqq future usage\grqq columns. Therefore the first 20 and the last 20 lines of the
97
+
data frame are removed at the end.\\\\
98
+
With the described PySpark transformations it was possible to create dynamic hourly data frames,
99
+
which can then be used in the next step of the modeling phase. A corresponding test file can be
100
+
found on GitHub in the folder data preparation.
42
101
\subsubsection{Holidays}
43
102
44
103
To add the bank holidays are day-based, so they should be set on a 24 hour windows.
title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},
350
+
title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk{\_}ambari-operations/content/ch{\_}using{\_}ambari{\_}metrics.html},
0 commit comments