finished pascals part

KathiBrown · KathiBrown · commit 15283fc982a6 · 2019-06-20T16:32:55.000+02:00
diff --git a/01_introduction.tex b/01_introduction.tex
@@ -1,7 +1,6 @@
 % vim:ft=tex
 
 \section{Introduction}
-
 No matter in which area you move there is a huge chance that data science will be there too. It is one of the fast moving trends during the last years which is caused by the unimaginable amount of data we produce everyday. We could gain a lot of knowledge out of this data, but therefore we must analyze it. That is when data science comes into play. But with the help of data science we can not just analyze the past we can also predict the future.\\\\
 Let's say we have a bike rental station which provides to rent bikes for a certain amount of time. We track the duration of the rental, the time of rental, the routes the bikes rode and lot more information. By processing all this data it is possible to predict the future usage of the bikes and the most frequented routes to specific times.
 That is what the main goal of this project represents. This project report deals with tasks which can be summarized by \emph{preparatory steps} to reach this final goal. The project duration is divided into two semesters. The tasks of the first part will be discussed in this paper.
diff --git a/21_evaluation_of_hadoop_distros.tex b/21_evaluation_of_hadoop_distros.tex
@@ -412,4 +412,76 @@ \subsection{Hadoop Rollout}
 by editing the \glqq visudo\grqq file and adding \glqq username ALL=(ALL) NOPASSWD:ALL\grqq . Another
 necessity is to add the IP addresses and hostnames of all cluster nodes under \glqq /etc/hosts\grqq . This
 must also be done on each node individually. The following table shows the current host
-configurations of a worker respectively slave node:
+configurations of a worker respectively slave node:
+\begin{table}[H]
+\centering
+\begin{tabular}{|l|l|}
+\hline
+\textbf{IP-Address} & \textbf{List of hostnames} \\ \hline
+10.64.180.163 & \begin{tabular}[c]{@{}l@{}}hortonworks-01.dasc.cs.thu.de\\ hortonworks-01\end{tabular} \\ \hline
+10.64.83.106 & \begin{tabular}[c]{@{}l@{}}hortonworks-02.dasc.cs.thu.de\\ hortonworks-02\end{tabular} \\ \hline
+10.64.79.161 & \begin{tabular}[c]{@{}l@{}}hortonworks-03.dasc.cs.thu.de\\ hortonworks-03\end{tabular} \\ \hline
+10.64.227.154 & \begin{tabular}[c]{@{}l@{}}hortonworks-04.dasc.cs.thu.de\\ hortonworks-04\end{tabular} \\ \hline
+10.64.159.100 & \begin{tabular}[c]{@{}l@{}}hortonworks-05.dasc.cs.thu.de\\ hortonworks-05\end{tabular} \\ \hline
+10.64.204.57 & \begin{tabular}[c]{@{}l@{}}hortonworks-06.dasc.cs.thu.de\\ hortonworks-06\end{tabular} \\ \hline
+\end{tabular}
+\caption{etc/hosts of a worker node}
+\label{tab:hadooprollout}
+\end{table}
+Databases (Hive, Ranger, Druid...) are created by the Ambari Wizard as PostgreSQL databases
+automatically. Java as well as a corresponding JDBC connector are required for the individual
+services to execute database statements. The node \glqq hortonworks-01\grqq (see table \ref{tab:hadooprollout}) is both a
+master and a worker node. On the master node the Ambari Wizard installs Ambari server. The Ambari Agents, which are required for communication in the cluster, are installed on each worker
+node. Afterwards, the Ambari Wizard can be called under the following URL:\\
+\emph{hortonworks-01.dasc.cs.thu.de:8080}.\\
+The installation routine guides the user through all necessary steps. It
+is important that the Ambari agents are installed on the individual worker nodes, otherwise the
+Ambari Wizard cannot add the nodes to the cluster. The most important step is probably the
+selection of the HDP services. Similar to the evaluation step with the VMs described in section \ref{intallhadoop},
+the identical service packages were selected, i.e. YARN+MapReduce2, Tez, Hive, HBase, Pig,
+ZooKeeper, Ambari Metrics, Spark2, Zeppelin and SmartSense. But also some new services were
+added for the production cluster: Storm, Accumulo, Infra Solr, Atlas and Kafka. Since the cluster
+is to become a long-lived high performance cluster, it might reasonable to rollout the playground
+for distributed streaming platforms like Kafka or Storm which can be used for streaming analytics
+use-cases.\\\\
+Selection of services that are running on top of Hadoop is an important part of the Hadoop cluster
+setup process. Following services have been chosen from the HDP stack: \\
+all 6 virtualized nodes are DataNodes (workers) and each have a YARN NodeManager (which takes care of the resource
+distribution and monitoring of a node).\\
+Furthermore, all master components run on \glqq hortonworks-01\grqq . However, the cluster is fault-tolerant, so if the master node fails, the worker \glqq hortonworks-02\grqq
+becomes the active master. This is made possible by the secondary NameNode service that runs
+on another worker node.\\\\
+Once the individual accounts have been created for the different services, a final briefing is
+performed by Ambari before the cluster is started. A useful feature of Ambari is to download the
+complete configuration as a JSON template \cite{RN3}. This makes another Hadoop installation much
+easier because the template can be reused.\\\\
+This time, the installation process was done without any problems. HortonWorks (HDP 3.1.0)
+supports recently Ubuntu 18.04.02 LTS which makes tweaking of the operating system
+superfluous. Initializing and starting Hadoop services may take a few hours, depending on the size
+of cluster. The productive cluster now runs on Ambari 2.7.3.0 and HDP 3.1.0. By default, Zeppelin
+comes with a Python 2 kernel. However, it is possible to switch to the Python 3 kernel (IPython
+with Python 3.x).\\\\
+The Ambari interface is intuitive to use. A first look at Ambari Metrics showed that the services
+worked properly and all workers in the cluster were active. Only YARN Registry DNS did not seem
+to start due to a connection refused error because Hadoop relies heavily on a functioning DNS
+server \cite{RN4}. However, changing the YARN RDNS Binding Port from \glqq 53\grqq to \glqq 5300\grqq solved the problem.\\ Remark: the same issue happened in the evaluation part where a port conflict prevented a successful start of the Ambari server.
+\\\\
+The physical hardware configuration of the cluster consists of 10 fat nodes, as figure \ref{fig:figure3_hadoop} shows.
+\begin{figure}[H]
+\centering
+\includegraphics[width=0.6\textwidth]{img/figure3_hadoop}\label{fig:figure3_hadoop}
+\captionof{figure}{Physical Hadoop cluster}\label{fig:figure3_hadoop}
+\end{figure}
+These nodes were locally connected with a switch and a gateway so that all nodes could
+communicate with each other. Unfortunately, access outside the intranet was not possible, as the
+necessary infrastructure measures on the part of the data center are still pending. In principle,
+however, it would be possible to access the cluster from outside using VPN and a reliable
+authentication method.\\\\
+First tests with the six configured Hadoop nodes could be carried out successfully. These tests were
+based on the Zeppelin notebooks from the previous data profiling chapter \ref{dp1}, which already worked
+successfully in the virtual cluster. Compared to the virtual cluster this time the execution was much
+faster, because more RAM was available and more workers (six! instead of four) were used.\\\\
+Thus the cluster (figure \ref{fig:figure3_hadoop}) is in an operational and ready configured state. Of course, it is possible
+that a service may fail on its own or no longer run properly over time. In the evaluation phase, for
+example, it was shown that the YARN Timeline service fails more frequently. Usually, however, a
+restart of the corresponding service via the Ambari interface is sufficient. Most Hadoop services also run autonomously, i.e. a corrupt service cannot block other running services (exception: HDFS). With the new ready to use Hadoop cluster, further data profiling action of the bicycle data can now be performed in the cluster.
diff --git a/23_data_prediction.tex b/23_data_prediction.tex
@@ -4,7 +4,7 @@ \subsection{Data Profiling Part II}\label{dp2}
 As already indicated in Data Profiling Part 1 in chapter \ref{dp1}, the next step is to display the use of the routes at
 different times between the individual bicycle stations on a map. Since the Python package
 "folium" uses leaflet maps based on Javascript \cite{RN5}, the plotting of the routes on an hourly level is
-not performant, because too many polylines have to be drawn and the map can no longer be
+not performant, because too many poly lines have to be drawn and the map can no longer be
 efficiently displayed in the browser. Therefore only the top 10\% most used routes were plotted on
 the folium map. Another restriction was the aggregated granularity on a daily base. This means
 that the plotted map always showed the route usage for day x. With the folium plugin
@@ -20,11 +20,11 @@ \subsection{Data Profiling Part II}\label{dp2}
 \includegraphics[width=1.2\textwidth]{img/listing2}\label{fig:listing2}
 \captionof{figure}{Route usage over time plot}\label{fig:listing2}
 \end{figure}
-As the code from figure \ref{fig:listing2} shows, the polylines on the map have been added iteratively. The
-weight parameter can be used to determine the thickness of the polyline. Since these should look
+As the code from figure \ref{fig:listing2} shows, the poly lines on the map have been added iteratively. The
+weight parameter can be used to determine the thickness of the poly line. Since these should look
 as dynamic as possible on the map, the weight has been standardized. The fixed stations were
 initially added, but no duplicate stations were plotted. A disadvantage of the plugin is that it needs
-the data in a JSON format. Therefore the coordinates for the single points of a polyline as well as
+the data in a JSON format. Therefore the coordinates for the single points of a poly line as well as
 the time series data had to be converted into a compatible format. As can be seen from the Python
 code, the coordinates must be given the type \glqq LineString\grqq.  A LineString is defined as a sequence
 of uniquely assignable points. In this case, the longitude and latitude previously requested using
@@ -39,15 +39,15 @@ \subsection{Data Profiling Part II}\label{dp2}
 From figure \ref{fig:figure4_folium_plot1} it can be seen that on this day there was a moderate use of Santander Bicycles in
 inner London. Especially the hubs such as Kings Cross or Hyde Park were obviously used very
 often with Santander Bicycles on this spring day. Furthermore, the map with the thickness of the
-polylines shows how often this route was used in relation to the total use of all routes. The red
+poly lines shows how often this route was used in relation to the total use of all routes. The red
 colored routes from figure \ref{fig:figure4_folium_plot1} are the actually used routes on this particular day, while the blue
 routes are the \glqq inactive\grqq ones. The bubbles with the numbers represent the respective rental
 stations. These have only been aggregated to provide a clearer representation. If the map is
 zoomed in (figure \ref{fig:figure4_folium_plot1}), the granularity is refined and the markers of the individual rental station
 locations are displayed. The color of these bubbles correlates with the number of aggregated
 stations in the vicinity. This method also has the advantage that it is easy to see where most rental
 stations are located. In fact, with 112 stations (see figure \ref{fig:figure4_folium_plot1}), the inner districts connected by the
-Waterloo Bridge and the London Blackfriars Bridge form the centre of most stations which are
+Waterloo Bridge and the London Blackfriars Bridge form the center of most stations which are
 close by. It should be noted, however, that new rental stations are constantly being added (even
 given up again!), and a new data extract could result in a different picture. Therefore this assumption is valid for the selected date from figure \ref{fig:figure4_folium_plot1}, but not for today or in two years. A useful
 feature that comes along with the folium plugin is the automatic data display sequence \cite{RN5}. When
@@ -128,7 +128,7 @@ \subsection{Feature Engineering Kings Cross}\label{king}
 future data was only added for the usage of the rental station. The string values were later be
 transformed into numerical values since machine learning methods always require number values
 instead of string values. Therefore a proper schema was defined, which is described in more detail
-in the modelling section.\\\\
+in the modeling section.\\\\
 With the processed bicycle data from \ref{fig:figure6_kings_cross_df}, machine learning can already be used to predict,
 for example, the daily usage of bicycles (i.e. how many bicycles will be borrowed tomorrow starting
 from today?). The script \glqq Feature Engineering Kings Cross.ipynb\grqq contains the preparation
@@ -184,27 +184,27 @@ \subsubsection{Result}\label{sec:resultmlp}
 \captionof{figure}{Accuracy with recommended features and scaling by the QuantileTransformer}\label{fig:mlpquantile_least}
 \end{figure}
 The accuracy of the prediction of the least used station measures a \acs{rmse} of 9.311. This shows that the model works not only for highly frequented stations but also for stations like Farringdon Street with only 145 records.
-\subsection{Modelling (Polynomial Regression)}\label{poly}
+\subsection{Modeling (Polynomial Regression)}\label{poly}
 Since we don't have a linear relationship between the data, a linear regression will not be helpful.
 For example, the regress \glqq Rented Bikes\grqq is not a linear correlation of temperature as even on rainy
 days there is slight chance that more people rent a bike than on a sunny day due to a special
 holiday. Therefore polynomial regression was a selected machine learning method for the
 prediction of rented bikes on a station.\\\\
 Polynomial regression belongs to the regression forms \cite{RN9}. In fact, it is just a modified version of a
-linear regression. This means the independent variable x and the dependent variable y is modelled
-as an nth degress (so-called polynomial) in x \cite{RN9}.\\\\
+linear regression. This means the independent variable x and the dependent variable y is modeled
+as an nth digress (so-called polynomial) in x \cite{RN9}.\\\\
 In a more formal way, the polynomial regression can be expressed as following:
 $$Y=\beta_0+\beta_1* x+\beta_2 * x^2 + \beta_3 * x^3 + ... + \beta_n * x^n$$
 Where n is the degree of the regression.\\\\
-With the scitkit-library in python a data scientist can import the function \glqq PolynomialFeatures\grqq from
+With the Scikit-library in python a data scientist can import the function \glqq PolynomialFeatures\grqq from
 \glqq sklearn.preprocessing\grqq which transforms linear data into higher dimensional data. For example
 one could apply \glqq poly = PolynomialFeatures(degree = 3)\grqq to get a polynomial
 regression in the third dimension. This should improve the accuracy as our underlying data has
 no linear relationships but maybe higher dimensional ones. Furthermore, the higher the degree
 the better the accuracy should be. Unfortunately the computation time is exponential. A degree of
 4 already took several hours to perform and was only slightly better than a regression in the third
 dimension.\\\\
-The RSME error on a degree of 4 was around 48,4, which is in comparison to the other tested
+The RMSE error on a degree of 4 was around 48,4, which is in comparison to the other tested
 ones not really bad but maybe also not best one.\\\\
 Figure \ref{fig:figure9_polynomial_features} shows the different plots of each feature and the prediction (rented usage). It turns out
 that the feature Season, for example, has no German influence on the use of bicycles, whereby
diff --git a/24_hourly.tex b/24_hourly.tex
@@ -8,10 +8,10 @@ \subsection{Data Preparation}
 increase significantly on an hourly granularity. Therefore the normal use of Anaconda and Jupyter
 on a local computer may be not sufficient due to low physical memory. An ideal use case for our
 newly installed Hadoop cluster! There PySpark can be used as already used under Data Profiling
-Part 1 in chapter \ref{dp1} to manage \glqq bg data\grqq transformations. Unfortunately the behaviour and syntax of PySpark
+Part 1 in chapter \ref{dp1} to manage \glqq big data\grqq transformations. Unfortunately the behavior and syntax of PySpark
 is sometimes a little more complicated than Pandas. For example, in PySpark it is not easily
 possible to iterate over rows since the data frame is distributed over the worker nodes and thus it
-only allows columnwise operations. Moreover, Pandas operations such as „iloc“ are not available
+only allows column wise operations. Moreover, Pandas operations such as „iloc“ are not available
 in PySpark. But the API comes also with some advantages, e.g. it is quite performant on big data
 scale (i.e. it can easily perform several million of records) and it has an SQL approach.
 Functions like \glqq select\grqq \glqq where\grqq and \glqq filter\grqq are syntactical close by to SQL as we know from MySQL
@@ -27,7 +27,7 @@ \subsection{Data Preparation}
 \includegraphics[width=1.2\textwidth]{img/listing5}\label{fig:listing5}
 \captionof{figure}{Transformation of humidity columns (excerpt)}\label{fig:listing5}
 \end{figure}
-he excerpt from \ref{fig:listing5} shows that \glqq structTypes\grqq can be used to search the individual sublists
+he excerpt from \ref{fig:listing5} shows that \glqq structTypes\grqq can be used to search the individual sub lists
 of \glqq hourly weather\grqq .
 The complete script is contained in the Zeppelin notebook \glqq DFGeneration.json\grqq and can also be found on GitHub.
 \\\\
diff --git a/DSPRReport.pdf b/DSPRReport.pdf
diff --git a/DSPRReport.tex b/DSPRReport.tex
@@ -101,7 +101,6 @@ \chapter{Project Presentation and Organization}
 \input{./03_task_description_2.tex}
 \input{./04_collaboration_technologies.tex}
 \input{./05_responsabilities.tex}
-\input{./07_conclusion_2.tex}
 \chapter{Postal Code Database (Nominatim and Graphhopper)}
 \input{./11_nominatim.tex}
 \input{./13_graphhopper.tex}
diff --git a/bibtex/library.bib b/bibtex/library.bib
@@ -354,7 +354,7 @@ @article{RN2
 
 @article{RN4,
    author = {Hortonworks},
-   title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/check_dns.html},
+   title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk{\_}ambari-installation/content/check{\_}dns.html},
    year = {2019},
    type = {Journal Article}
 }
@@ -399,7 +399,7 @@ @misc{RN1
 
 @article{RN3,
    author = {Sposetti, Jeff},
-   title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-Step1:CreateBlueprint},
+   title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints{\#}Blueprints-Step1:CreateBlueprint},
    year = {2017},
    type = {Journal Article}
 }