You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 01_introduction.tex
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,6 @@
1
1
% vim:ft=tex
2
2
3
3
\section{Introduction}
4
-
5
4
No matter in which area you move there is a huge chance that data science will be there too. It is one of the fast moving trends during the last years which is caused by the unimaginable amount of data we produce everyday. We could gain a lot of knowledge out of this data, but therefore we must analyze it. That is when data science comes into play. But with the help of data science we can not just analyze the past we can also predict the future.\\\\
6
5
Let's say we have a bike rental station which provides to rent bikes for a certain amount of time. We track the duration of the rental, the time of rental, the routes the bikes rode and lot more information. By processing all this data it is possible to predict the future usage of the bikes and the most frequented routes to specific times.
7
6
That is what the main goal of this project represents. This project report deals with tasks which can be summarized by \emph{preparatory steps} to reach this final goal. The project duration is divided into two semesters. The tasks of the first part will be discussed in this paper.
Databases (Hive, Ranger, Druid...) are created by the Ambari Wizard as PostgreSQL databases
432
+
automatically. Java as well as a corresponding JDBC connector are required for the individual
433
+
services to execute database statements. The node \glqq hortonworks-01\grqq (see table \ref{tab:hadooprollout}) is both a
434
+
master and a worker node. On the master node the Ambari Wizard installs Ambari server. The Ambari Agents, which are required for communication in the cluster, are installed on each worker
435
+
node. Afterwards, the Ambari Wizard can be called under the following URL:\\
436
+
\emph{hortonworks-01.dasc.cs.thu.de:8080}.\\
437
+
The installation routine guides the user through all necessary steps. It
438
+
is important that the Ambari agents are installed on the individual worker nodes, otherwise the
439
+
Ambari Wizard cannot add the nodes to the cluster. The most important step is probably the
440
+
selection of the HDP services. Similar to the evaluation step with the VMs described in section \ref{intallhadoop},
441
+
the identical service packages were selected, i.e. YARN+MapReduce2, Tez, Hive, HBase, Pig,
442
+
ZooKeeper, Ambari Metrics, Spark2, Zeppelin and SmartSense. But also some new services were
443
+
added for the production cluster: Storm, Accumulo, Infra Solr, Atlas and Kafka. Since the cluster
444
+
is to become a long-lived high performance cluster, it might reasonable to rollout the playground
445
+
for distributed streaming platforms like Kafka or Storm which can be used for streaming analytics
446
+
use-cases.\\\\
447
+
Selection of services that are running on top of Hadoop is an important part of the Hadoop cluster
448
+
setup process. Following services have been chosen from the HDP stack: \\
449
+
all 6 virtualized nodes are DataNodes (workers) and each have a YARN NodeManager (which takes care of the resource
450
+
distribution and monitoring of a node).\\
451
+
Furthermore, all master components run on \glqq hortonworks-01\grqq . However, the cluster is fault-tolerant, so if the master node fails, the worker \glqq hortonworks-02\grqq
452
+
becomes the active master. This is made possible by the secondary NameNode service that runs
453
+
on another worker node.\\\\
454
+
Once the individual accounts have been created for the different services, a final briefing is
455
+
performed by Ambari before the cluster is started. A useful feature of Ambari is to download the
456
+
complete configuration as a JSON template \cite{RN3}. This makes another Hadoop installation much
457
+
easier because the template can be reused.\\\\
458
+
This time, the installation process was done without any problems. HortonWorks (HDP 3.1.0)
459
+
supports recently Ubuntu 18.04.02 LTS which makes tweaking of the operating system
460
+
superfluous. Initializing and starting Hadoop services may take a few hours, depending on the size
461
+
of cluster. The productive cluster now runs on Ambari 2.7.3.0 and HDP 3.1.0. By default, Zeppelin
462
+
comes with a Python 2 kernel. However, it is possible to switch to the Python 3 kernel (IPython
463
+
with Python 3.x).\\\\
464
+
The Ambari interface is intuitive to use. A first look at Ambari Metrics showed that the services
465
+
worked properly and all workers in the cluster were active. Only YARN Registry DNS did not seem
466
+
to start due to a connection refused error because Hadoop relies heavily on a functioning DNS
467
+
server \cite{RN4}. However, changing the YARN RDNS Binding Port from \glqq 53\grqq to \glqq 5300\grqq solved the problem.\\ Remark: the same issue happened in the evaluation part where a port conflict prevented a successful start of the Ambari server.
468
+
\\\\
469
+
The physical hardware configuration of the cluster consists of 10 fat nodes, as figure \ref{fig:figure3_hadoop} shows.
These nodes were locally connected with a switch and a gateway so that all nodes could
476
+
communicate with each other. Unfortunately, access outside the intranet was not possible, as the
477
+
necessary infrastructure measures on the part of the data center are still pending. In principle,
478
+
however, it would be possible to access the cluster from outside using VPN and a reliable
479
+
authentication method.\\\\
480
+
First tests with the six configured Hadoop nodes could be carried out successfully. These tests were
481
+
based on the Zeppelin notebooks from the previous data profiling chapter \ref{dp1}, which already worked
482
+
successfully in the virtual cluster. Compared to the virtual cluster this time the execution was much
483
+
faster, because more RAM was available and more workers (six! instead of four) were used.\\\\
484
+
Thus the cluster (figure \ref{fig:figure3_hadoop}) is in an operational and ready configured state. Of course, it is possible
485
+
that a service may fail on its own or no longer run properly over time. In the evaluation phase, for
486
+
example, it was shown that the YARN Timeline service fails more frequently. Usually, however, a
487
+
restart of the corresponding service via the Ambari interface is sufficient. Most Hadoop services also run autonomously, i.e. a corrupt service cannot block other running services (exception: HDFS). With the new ready to use Hadoop cluster, further data profiling action of the bicycle data can now be performed in the cluster.
\captionof{figure}{Route usage over time plot}\label{fig:listing2}
22
22
\end{figure}
23
-
As the code from figure \ref{fig:listing2} shows, the polylines on the map have been added iteratively. The
24
-
weight parameter can be used to determine the thickness of the polyline. Since these should look
23
+
As the code from figure \ref{fig:listing2} shows, the poly lines on the map have been added iteratively. The
24
+
weight parameter can be used to determine the thickness of the poly line. Since these should look
25
25
as dynamic as possible on the map, the weight has been standardized. The fixed stations were
26
26
initially added, but no duplicate stations were plotted. A disadvantage of the plugin is that it needs
27
-
the data in a JSON format. Therefore the coordinates for the single points of a polyline as well as
27
+
the data in a JSON format. Therefore the coordinates for the single points of a poly line as well as
28
28
the time series data had to be converted into a compatible format. As can be seen from the Python
29
29
code, the coordinates must be given the type \glqq LineString\grqq. A LineString is defined as a sequence
30
30
of uniquely assignable points. In this case, the longitude and latitude previously requested using
@@ -39,15 +39,15 @@ \subsection{Data Profiling Part II}\label{dp2}
39
39
From figure \ref{fig:figure4_folium_plot1} it can be seen that on this day there was a moderate use of Santander Bicycles in
40
40
inner London. Especially the hubs such as Kings Cross or Hyde Park were obviously used very
41
41
often with Santander Bicycles on this spring day. Furthermore, the map with the thickness of the
42
-
polylines shows how often this route was used in relation to the total use of all routes. The red
42
+
poly lines shows how often this route was used in relation to the total use of all routes. The red
43
43
colored routes from figure \ref{fig:figure4_folium_plot1} are the actually used routes on this particular day, while the blue
44
44
routes are the \glqq inactive\grqq ones. The bubbles with the numbers represent the respective rental
45
45
stations. These have only been aggregated to provide a clearer representation. If the map is
46
46
zoomed in (figure \ref{fig:figure4_folium_plot1}), the granularity is refined and the markers of the individual rental station
47
47
locations are displayed. The color of these bubbles correlates with the number of aggregated
48
48
stations in the vicinity. This method also has the advantage that it is easy to see where most rental
49
49
stations are located. In fact, with 112 stations (see figure \ref{fig:figure4_folium_plot1}), the inner districts connected by the
50
-
Waterloo Bridge and the London Blackfriars Bridge form the centre of most stations which are
50
+
Waterloo Bridge and the London Blackfriars Bridge form the center of most stations which are
51
51
close by. It should be noted, however, that new rental stations are constantly being added (even
52
52
given up again!), and a new data extract could result in a different picture. Therefore this assumption is valid for the selected date from figure \ref{fig:figure4_folium_plot1}, but not for today or in two years. A useful
53
53
feature that comes along with the folium plugin is the automatic data display sequence \cite{RN5}. When
\captionof{figure}{Accuracy with recommended features and scaling by the QuantileTransformer}\label{fig:mlpquantile_least}
185
185
\end{figure}
186
186
The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
Copy file name to clipboardExpand all lines: bibtex/library.bib
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -354,7 +354,7 @@ @article{RN2
354
354
355
355
@article{RN4,
356
356
author = {Hortonworks},
357
-
title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/check_dns.html},
357
+
title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk{\_}ambari-installation/content/check{\_}dns.html},
358
358
year = {2019},
359
359
type = {Journal Article}
360
360
}
@@ -399,7 +399,7 @@ @misc{RN1
399
399
400
400
@article{RN3,
401
401
author = {Sposetti, Jeff},
402
-
title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-Step1:CreateBlueprint},
402
+
title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints{\#}Blueprints-Step1:CreateBlueprint},
0 commit comments