Skip to content

Commit 15283fc

Browse files
author
KathiBrown
committed
finished pascals part
1 parent 5dfd2a5 commit 15283fc

7 files changed

+90
-20
lines changed

01_introduction.tex

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
% vim:ft=tex
22

33
\section{Introduction}
4-
54
No matter in which area you move there is a huge chance that data science will be there too. It is one of the fast moving trends during the last years which is caused by the unimaginable amount of data we produce everyday. We could gain a lot of knowledge out of this data, but therefore we must analyze it. That is when data science comes into play. But with the help of data science we can not just analyze the past we can also predict the future.\\\\
65
Let's say we have a bike rental station which provides to rent bikes for a certain amount of time. We track the duration of the rental, the time of rental, the routes the bikes rode and lot more information. By processing all this data it is possible to predict the future usage of the bikes and the most frequented routes to specific times.
76
That is what the main goal of this project represents. This project report deals with tasks which can be summarized by \emph{preparatory steps} to reach this final goal. The project duration is divided into two semesters. The tasks of the first part will be discussed in this paper.

21_evaluation_of_hadoop_distros.tex

Lines changed: 73 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -412,4 +412,76 @@ \subsection{Hadoop Rollout}
412412
by editing the \glqq visudo\grqq file and adding \glqq username ALL=(ALL) NOPASSWD:ALL\grqq . Another
413413
necessity is to add the IP addresses and hostnames of all cluster nodes under \glqq /etc/hosts\grqq . This
414414
must also be done on each node individually. The following table shows the current host
415-
configurations of a worker respectively slave node:
415+
configurations of a worker respectively slave node:
416+
\begin{table}[H]
417+
\centering
418+
\begin{tabular}{|l|l|}
419+
\hline
420+
\textbf{IP-Address} & \textbf{List of hostnames} \\ \hline
421+
10.64.180.163 & \begin{tabular}[c]{@{}l@{}}hortonworks-01.dasc.cs.thu.de\\ hortonworks-01\end{tabular} \\ \hline
422+
10.64.83.106 & \begin{tabular}[c]{@{}l@{}}hortonworks-02.dasc.cs.thu.de\\ hortonworks-02\end{tabular} \\ \hline
423+
10.64.79.161 & \begin{tabular}[c]{@{}l@{}}hortonworks-03.dasc.cs.thu.de\\ hortonworks-03\end{tabular} \\ \hline
424+
10.64.227.154 & \begin{tabular}[c]{@{}l@{}}hortonworks-04.dasc.cs.thu.de\\ hortonworks-04\end{tabular} \\ \hline
425+
10.64.159.100 & \begin{tabular}[c]{@{}l@{}}hortonworks-05.dasc.cs.thu.de\\ hortonworks-05\end{tabular} \\ \hline
426+
10.64.204.57 & \begin{tabular}[c]{@{}l@{}}hortonworks-06.dasc.cs.thu.de\\ hortonworks-06\end{tabular} \\ \hline
427+
\end{tabular}
428+
\caption{etc/hosts of a worker node}
429+
\label{tab:hadooprollout}
430+
\end{table}
431+
Databases (Hive, Ranger, Druid...) are created by the Ambari Wizard as PostgreSQL databases
432+
automatically. Java as well as a corresponding JDBC connector are required for the individual
433+
services to execute database statements. The node \glqq hortonworks-01\grqq (see table \ref{tab:hadooprollout}) is both a
434+
master and a worker node. On the master node the Ambari Wizard installs Ambari server. The Ambari Agents, which are required for communication in the cluster, are installed on each worker
435+
node. Afterwards, the Ambari Wizard can be called under the following URL:\\
436+
\emph{hortonworks-01.dasc.cs.thu.de:8080}.\\
437+
The installation routine guides the user through all necessary steps. It
438+
is important that the Ambari agents are installed on the individual worker nodes, otherwise the
439+
Ambari Wizard cannot add the nodes to the cluster. The most important step is probably the
440+
selection of the HDP services. Similar to the evaluation step with the VMs described in section \ref{intallhadoop},
441+
the identical service packages were selected, i.e. YARN+MapReduce2, Tez, Hive, HBase, Pig,
442+
ZooKeeper, Ambari Metrics, Spark2, Zeppelin and SmartSense. But also some new services were
443+
added for the production cluster: Storm, Accumulo, Infra Solr, Atlas and Kafka. Since the cluster
444+
is to become a long-lived high performance cluster, it might reasonable to rollout the playground
445+
for distributed streaming platforms like Kafka or Storm which can be used for streaming analytics
446+
use-cases.\\\\
447+
Selection of services that are running on top of Hadoop is an important part of the Hadoop cluster
448+
setup process. Following services have been chosen from the HDP stack: \\
449+
all 6 virtualized nodes are DataNodes (workers) and each have a YARN NodeManager (which takes care of the resource
450+
distribution and monitoring of a node).\\
451+
Furthermore, all master components run on \glqq hortonworks-01\grqq . However, the cluster is fault-tolerant, so if the master node fails, the worker \glqq hortonworks-02\grqq
452+
becomes the active master. This is made possible by the secondary NameNode service that runs
453+
on another worker node.\\\\
454+
Once the individual accounts have been created for the different services, a final briefing is
455+
performed by Ambari before the cluster is started. A useful feature of Ambari is to download the
456+
complete configuration as a JSON template \cite{RN3}. This makes another Hadoop installation much
457+
easier because the template can be reused.\\\\
458+
This time, the installation process was done without any problems. HortonWorks (HDP 3.1.0)
459+
supports recently Ubuntu 18.04.02 LTS which makes tweaking of the operating system
460+
superfluous. Initializing and starting Hadoop services may take a few hours, depending on the size
461+
of cluster. The productive cluster now runs on Ambari 2.7.3.0 and HDP 3.1.0. By default, Zeppelin
462+
comes with a Python 2 kernel. However, it is possible to switch to the Python 3 kernel (IPython
463+
with Python 3.x).\\\\
464+
The Ambari interface is intuitive to use. A first look at Ambari Metrics showed that the services
465+
worked properly and all workers in the cluster were active. Only YARN Registry DNS did not seem
466+
to start due to a connection refused error because Hadoop relies heavily on a functioning DNS
467+
server \cite{RN4}. However, changing the YARN RDNS Binding Port from \glqq 53\grqq to \glqq 5300\grqq solved the problem.\\ Remark: the same issue happened in the evaluation part where a port conflict prevented a successful start of the Ambari server.
468+
\\\\
469+
The physical hardware configuration of the cluster consists of 10 fat nodes, as figure \ref{fig:figure3_hadoop} shows.
470+
\begin{figure}[H]
471+
\centering
472+
\includegraphics[width=0.6\textwidth]{img/figure3_hadoop}\label{fig:figure3_hadoop}
473+
\captionof{figure}{Physical Hadoop cluster}\label{fig:figure3_hadoop}
474+
\end{figure}
475+
These nodes were locally connected with a switch and a gateway so that all nodes could
476+
communicate with each other. Unfortunately, access outside the intranet was not possible, as the
477+
necessary infrastructure measures on the part of the data center are still pending. In principle,
478+
however, it would be possible to access the cluster from outside using VPN and a reliable
479+
authentication method.\\\\
480+
First tests with the six configured Hadoop nodes could be carried out successfully. These tests were
481+
based on the Zeppelin notebooks from the previous data profiling chapter \ref{dp1}, which already worked
482+
successfully in the virtual cluster. Compared to the virtual cluster this time the execution was much
483+
faster, because more RAM was available and more workers (six! instead of four) were used.\\\\
484+
Thus the cluster (figure \ref{fig:figure3_hadoop}) is in an operational and ready configured state. Of course, it is possible
485+
that a service may fail on its own or no longer run properly over time. In the evaluation phase, for
486+
example, it was shown that the YARN Timeline service fails more frequently. Usually, however, a
487+
restart of the corresponding service via the Ambari interface is sufficient. Most Hadoop services also run autonomously, i.e. a corrupt service cannot block other running services (exception: HDFS). With the new ready to use Hadoop cluster, further data profiling action of the bicycle data can now be performed in the cluster.

23_data_prediction.tex

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ \subsection{Data Profiling Part II}\label{dp2}
44
As already indicated in Data Profiling Part 1 in chapter \ref{dp1}, the next step is to display the use of the routes at
55
different times between the individual bicycle stations on a map. Since the Python package
66
"folium" uses leaflet maps based on Javascript \cite{RN5}, the plotting of the routes on an hourly level is
7-
not performant, because too many polylines have to be drawn and the map can no longer be
7+
not performant, because too many poly lines have to be drawn and the map can no longer be
88
efficiently displayed in the browser. Therefore only the top 10\% most used routes were plotted on
99
the folium map. Another restriction was the aggregated granularity on a daily base. This means
1010
that the plotted map always showed the route usage for day x. With the folium plugin
@@ -20,11 +20,11 @@ \subsection{Data Profiling Part II}\label{dp2}
2020
\includegraphics[width=1.2\textwidth]{img/listing2}\label{fig:listing2}
2121
\captionof{figure}{Route usage over time plot}\label{fig:listing2}
2222
\end{figure}
23-
As the code from figure \ref{fig:listing2} shows, the polylines on the map have been added iteratively. The
24-
weight parameter can be used to determine the thickness of the polyline. Since these should look
23+
As the code from figure \ref{fig:listing2} shows, the poly lines on the map have been added iteratively. The
24+
weight parameter can be used to determine the thickness of the poly line. Since these should look
2525
as dynamic as possible on the map, the weight has been standardized. The fixed stations were
2626
initially added, but no duplicate stations were plotted. A disadvantage of the plugin is that it needs
27-
the data in a JSON format. Therefore the coordinates for the single points of a polyline as well as
27+
the data in a JSON format. Therefore the coordinates for the single points of a poly line as well as
2828
the time series data had to be converted into a compatible format. As can be seen from the Python
2929
code, the coordinates must be given the type \glqq LineString\grqq. A LineString is defined as a sequence
3030
of uniquely assignable points. In this case, the longitude and latitude previously requested using
@@ -39,15 +39,15 @@ \subsection{Data Profiling Part II}\label{dp2}
3939
From figure \ref{fig:figure4_folium_plot1} it can be seen that on this day there was a moderate use of Santander Bicycles in
4040
inner London. Especially the hubs such as Kings Cross or Hyde Park were obviously used very
4141
often with Santander Bicycles on this spring day. Furthermore, the map with the thickness of the
42-
polylines shows how often this route was used in relation to the total use of all routes. The red
42+
poly lines shows how often this route was used in relation to the total use of all routes. The red
4343
colored routes from figure \ref{fig:figure4_folium_plot1} are the actually used routes on this particular day, while the blue
4444
routes are the \glqq inactive\grqq ones. The bubbles with the numbers represent the respective rental
4545
stations. These have only been aggregated to provide a clearer representation. If the map is
4646
zoomed in (figure \ref{fig:figure4_folium_plot1}), the granularity is refined and the markers of the individual rental station
4747
locations are displayed. The color of these bubbles correlates with the number of aggregated
4848
stations in the vicinity. This method also has the advantage that it is easy to see where most rental
4949
stations are located. In fact, with 112 stations (see figure \ref{fig:figure4_folium_plot1}), the inner districts connected by the
50-
Waterloo Bridge and the London Blackfriars Bridge form the centre of most stations which are
50+
Waterloo Bridge and the London Blackfriars Bridge form the center of most stations which are
5151
close by. It should be noted, however, that new rental stations are constantly being added (even
5252
given up again!), and a new data extract could result in a different picture. Therefore this assumption is valid for the selected date from figure \ref{fig:figure4_folium_plot1}, but not for today or in two years. A useful
5353
feature that comes along with the folium plugin is the automatic data display sequence \cite{RN5}. When
@@ -128,7 +128,7 @@ \subsection{Feature Engineering Kings Cross}\label{king}
128128
future data was only added for the usage of the rental station. The string values were later be
129129
transformed into numerical values since machine learning methods always require number values
130130
instead of string values. Therefore a proper schema was defined, which is described in more detail
131-
in the modelling section.\\\\
131+
in the modeling section.\\\\
132132
With the processed bicycle data from \ref{fig:figure6_kings_cross_df}, machine learning can already be used to predict,
133133
for example, the daily usage of bicycles (i.e. how many bicycles will be borrowed tomorrow starting
134134
from today?). The script \glqq Feature Engineering Kings Cross.ipynb\grqq contains the preparation
@@ -184,27 +184,27 @@ \subsubsection{Result}\label{sec:resultmlp}
184184
\captionof{figure}{Accuracy with recommended features and scaling by the QuantileTransformer}\label{fig:mlpquantile_least}
185185
\end{figure}
186186
The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
187-
\subsection{Modelling (Polynomial Regression)}\label{poly}
187+
\subsection{Modeling (Polynomial Regression)}\label{poly}
188188
Since we don't have a linear relationship between the data, a linear regression will not be helpful.
189189
For example, the regress \glqq Rented Bikes\grqq is not a linear correlation of temperature as even on rainy
190190
days there is slight chance that more people rent a bike than on a sunny day due to a special
191191
holiday. Therefore polynomial regression was a selected machine learning method for the
192192
prediction of rented bikes on a station.\\\\
193193
Polynomial regression belongs to the regression forms \cite{RN9}. In fact, it is just a modified version of a
194-
linear regression. This means the independent variable x and the dependent variable y is modelled
195-
as an nth degress (so-called polynomial) in x \cite{RN9}.\\\\
194+
linear regression. This means the independent variable x and the dependent variable y is modeled
195+
as an nth digress (so-called polynomial) in x \cite{RN9}.\\\\
196196
In a more formal way, the polynomial regression can be expressed as following:
197197
$$Y=\beta_0+\beta_1* x+\beta_2 * x^2 + \beta_3 * x^3 + ... + \beta_n * x^n$$
198198
Where n is the degree of the regression.\\\\
199-
With the scitkit-library in python a data scientist can import the function \glqq PolynomialFeatures\grqq from
199+
With the Scikit-library in python a data scientist can import the function \glqq PolynomialFeatures\grqq from
200200
\glqq sklearn.preprocessing\grqq which transforms linear data into higher dimensional data. For example
201201
one could apply \glqq poly = PolynomialFeatures(degree = 3)\grqq to get a polynomial
202202
regression in the third dimension. This should improve the accuracy as our underlying data has
203203
no linear relationships but maybe higher dimensional ones. Furthermore, the higher the degree
204204
the better the accuracy should be. Unfortunately the computation time is exponential. A degree of
205205
4 already took several hours to perform and was only slightly better than a regression in the third
206206
dimension.\\\\
207-
The RSME error on a degree of 4 was around 48,4, which is in comparison to the other tested
207+
The RMSE error on a degree of 4 was around 48,4, which is in comparison to the other tested
208208
ones not really bad but maybe also not best one.\\\\
209209
Figure \ref{fig:figure9_polynomial_features} shows the different plots of each feature and the prediction (rented usage). It turns out
210210
that the feature Season, for example, has no German influence on the use of bicycles, whereby

24_hourly.tex

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ \subsection{Data Preparation}
88
increase significantly on an hourly granularity. Therefore the normal use of Anaconda and Jupyter
99
on a local computer may be not sufficient due to low physical memory. An ideal use case for our
1010
newly installed Hadoop cluster! There PySpark can be used as already used under Data Profiling
11-
Part 1 in chapter \ref{dp1} to manage \glqq bg data\grqq transformations. Unfortunately the behaviour and syntax of PySpark
11+
Part 1 in chapter \ref{dp1} to manage \glqq big data\grqq transformations. Unfortunately the behavior and syntax of PySpark
1212
is sometimes a little more complicated than Pandas. For example, in PySpark it is not easily
1313
possible to iterate over rows since the data frame is distributed over the worker nodes and thus it
14-
only allows columnwise operations. Moreover, Pandas operations such as „iloc“ are not available
14+
only allows column wise operations. Moreover, Pandas operations such as „iloc“ are not available
1515
in PySpark. But the API comes also with some advantages, e.g. it is quite performant on big data
1616
scale (i.e. it can easily perform several million of records) and it has an SQL approach.
1717
Functions like \glqq select\grqq \glqq where\grqq and \glqq filter\grqq are syntactical close by to SQL as we know from MySQL
@@ -27,7 +27,7 @@ \subsection{Data Preparation}
2727
\includegraphics[width=1.2\textwidth]{img/listing5}\label{fig:listing5}
2828
\captionof{figure}{Transformation of humidity columns (excerpt)}\label{fig:listing5}
2929
\end{figure}
30-
he excerpt from \ref{fig:listing5} shows that \glqq structTypes\grqq can be used to search the individual sublists
30+
he excerpt from \ref{fig:listing5} shows that \glqq structTypes\grqq can be used to search the individual sub lists
3131
of \glqq hourly weather\grqq .
3232
The complete script is contained in the Zeppelin notebook \glqq DFGeneration.json\grqq and can also be found on GitHub.
3333
\\\\

DSPRReport.pdf

1.54 MB
Binary file not shown.

DSPRReport.tex

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,6 @@ \chapter{Project Presentation and Organization}
101101
\input{./03_task_description_2.tex}
102102
\input{./04_collaboration_technologies.tex}
103103
\input{./05_responsabilities.tex}
104-
\input{./07_conclusion_2.tex}
105104
\chapter{Postal Code Database (Nominatim and Graphhopper)}
106105
\input{./11_nominatim.tex}
107106
\input{./13_graphhopper.tex}

bibtex/library.bib

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,7 @@ @article{RN2
354354

355355
@article{RN4,
356356
author = {Hortonworks},
357-
title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/check_dns.html},
357+
title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk{\_}ambari-installation/content/check{\_}dns.html},
358358
year = {2019},
359359
type = {Journal Article}
360360
}
@@ -399,7 +399,7 @@ @misc{RN1
399399

400400
@article{RN3,
401401
author = {Sposetti, Jeff},
402-
title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-Step1:CreateBlueprint},
402+
title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints{\#}Blueprints-Step1:CreateBlueprint},
403403
year = {2017},
404404
type = {Journal Article}
405405
}

0 commit comments

Comments
 (0)