hackenjoe
diff --git a/‎22_data_profiling.tex
+1-1 b/‎22_data_profiling.tex
+1-1
diff --git a/‎23_data_prediction.tex
+139-6 b/‎23_data_prediction.tex
+139-6
diff --git a/‎24_hourly.tex
+36-4 b/‎24_hourly.tex
+36-4
diff --git a/‎DSPRReport.pdf
1.82 MB b/‎DSPRReport.pdf
1.82 MB
diff --git a/‎bibtex/library.bib
+65 b/‎bibtex/library.bib
+65
diff --git a/‎img/figure10_polynomial_prediction.png
255 KB b/‎img/figure10_polynomial_prediction.png
255 KB
diff --git a/‎img/figure1_proxmox.png
155 KB b/‎img/figure1_proxmox.png
155 KB
diff --git a/‎img/figure2_webmin.png
98.1 KB b/‎img/figure2_webmin.png
98.1 KB
diff --git a/‎img/figure3_hadoop.png
1.67 MB b/‎img/figure3_hadoop.png
1.67 MB
diff --git a/‎img/figure4_folium_plot1.png
907 KB b/‎img/figure4_folium_plot1.png
907 KB
diff --git a/‎img/figure5_folium_plot2.png
1 MB b/‎img/figure5_folium_plot2.png
1 MB
diff --git a/‎img/figure6_kings_cross_df.png
26.2 KB b/‎img/figure6_kings_cross_df.png
26.2 KB
diff --git a/‎img/figure7_weather_df.png
17.6 KB b/‎img/figure7_weather_df.png
17.6 KB
diff --git a/‎img/figure8_temperature_df.png
14.3 KB b/‎img/figure8_temperature_df.png
14.3 KB
diff --git a/‎img/figure9_polynomial_features.png
152 KB b/‎img/figure9_polynomial_features.png
152 KB
diff --git a/‎img/listing1.png
57.3 KB b/‎img/listing1.png
57.3 KB
diff --git a/‎img/listing2.png
52.5 KB b/‎img/listing2.png
52.5 KB
diff --git a/‎img/listing3.png
54.6 KB b/‎img/listing3.png
54.6 KB
diff --git a/‎img/listing4.png
29.9 KB b/‎img/listing4.png
29.9 KB
diff --git a/‎img/listing5.png
28.5 KB b/‎img/listing5.png
28.5 KB
diff --git a/‎pascal_report/ds2_pascal_report.pdf
1.6 MB b/‎pascal_report/ds2_pascal_report.pdf
1.6 MB
diff --git a/‎pascal_report/references.txt
+65 b/‎pascal_report/references.txt
+65
@@ -1,6 +1,6 @@
 % vim:ft=tex
 
-\section{Data Profiling}
+\section{Data Profiling}\label{dp1}
 
 Data profiling is the process of reviewing source data, understanding structure, content and
 interrelationships, and identifying potential for data projects. For the project, the Santander Bicycle data will be profiled more closely \citep{TFL2019}. For this I mainly used Zeppelin notebooks and the HDP cluster with four worker nodes initialized in chapter \ref{intallhadoop}. Zeppelin works similar to Jupyter notebooks and also supports magic commands. However, Zeppelin stores the notebooks in .json format, while Jupyter notebooks (python) uses \textbf{.ipynb}. A corresponding import of Zeppelin notebooks into Jupyter is therefore not possible and
 
@@ -3,10 +3,42 @@
 \section{Prediction on Hourly Basis}
 
 \subsection{Data Preparation}
-
-% Sliding windows, ...
-% Aggregation stuff...
-
+Since the long term goal was to predict the usage on a hourly base, some further data
+transformation steps are necessary to achieve this goal. Furthermore the data records will
+increase significantly on an hourly granularity. Therefore the normal use of Anaconda and Jupyter
+on a local computer may be not sufficient due to low physical memory. An ideal use case for our
+newly installed Hadoop cluster! There PySpark can be used as already used under Data Profiling
+Part 1 in chapter \ref{dp1} to manage \glqq bg data\grqq transformations. Unfortunately the behaviour and syntax of PySpark
+is sometimes a little more complicated than Pandas. For example, in PySpark it is not easily
+possible to iterate over rows since the data frame is distributed over the worker nodes and thus it
+only allows columnwise operations. Moreover, Pandas operations such as „iloc“ are not available
+in PySpark. But the API comes also with some advantages, e.g. it is quite performant on big data
+scale (i.e. it can easily perform several million of records) and it has an SQL approach.
+Functions like \glqq select\grqq \glqq where\grqq and \glqq filter\grqq are syntactical close by to SQL as we know from MySQL
+and other database management systems.\\\\
+The structure of the weather data is inconsistent due to \glqq hourly weather\grqq  While the other columns
+only contain simple values, the column \glqq hourly weather\grqq contains nested JSON lists. This means
+that these nested lists must somehow become normal columns. This was a little more complicated
+than expected, but not impossible. Fortunately, PySpark allows one to define \glqq schemas\grqq that are
+used as a kind of blueprint by Spark to read the Spark data frame. With the following code one
+can already create normal columns from the JSON lists:
+\begin{figure}[H]
+\hspace{-1.6cm}
+\includegraphics[width=1.2\textwidth]{img/listing5}\label{fig:listing5}
+\captionof{figure}{Transformation of humidity columns (excerpt)}\label{fig:listing5}
+\end{figure}
+he excerpt from \ref{fig:listing5} shows that \glqq structTypes\grqq can be used to search the individual sublists
+of \glqq hourly weather\grqq .
+The complete script is contained in the Zeppelin notebook \glqq DFGeneration.json\grqq and can also be found on GitHub.
+\\\\
+The described transformation creates a new column with the corresponding value for each hour.
+The script works dynamically. For example, the user can look at the data with two hours from today,
+which then looks like this:
+\begin{figure}[H]
+\hspace{-1.6cm}
+\includegraphics[width=1.2\textwidth]{img/figure7_weather_df}\label{fig:figure7_weather_df}
+\captionof{figure}{Weather dataframe after transformation for two hours}\label{fig:figure7_weather_df}
+\end{figure}
 \subsubsection{Holidays}
 
 To add the bank holidays are day-based, so they should be set on a 24 hour windows.
 
@@ -345,3 +345,68 @@ @inproceedings{riedmiller1993direct
   year={1993},
   organization={San Francisco}
 }
+@article{RN2,
+   author = {Hortonworks},
+   title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN4,
+   author = {Hortonworks},
+   title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/check_dns.html},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN6,
+   author = {Lydall, Ross},
+   title = {Boris Johnson's bike hire scheme gets a £25m bonus from Barclays. Available on:
+https://web.archive.org/web/20100913111233/http://www.thisislondon.co.uk/standard/article-23839406-boris-bike-hire-scheme-gets-a-pound-25m-bonus-from-barclays.do},
+   year = {2010},
+   type = {Journal Article}
+}
+
+@article{RN8,
+   author = {n.d.},
+   title = {Apache PySpark Documentation. Available on: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN7,
+   author = {n.d.},
+   title = {Dark Sky Weather API. Available on: https://darksky.net/dev/docs/faq},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN5,
+   author = {n.d.},
+   title = {Folium package. Available on: 
+https://python-visualization.github.io/folium/},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@misc{RN1,
+   year = {2019},
+   type = {Generic}
+}
+
+@article{RN3,
+   author = {Sposetti, Jeff},
+   title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-Step1:CreateBlueprint},
+   year = {2017},
+   type = {Journal Article}
+}
+
+@article{RN9,
+   author = {Wang, Yongqiao and Li, Lishuai and Dang, Chuangyin},
+   title = {Calibrating Classification Probabilities with Shape-restricted Polynomial Regression},
+   journal = {IEEE transactions on pattern analysis and machine intelligence},
+   ISSN = {0162-8828},
+   year = {2019},
+   type = {Journal Article}
+}
+
@@ -0,0 +1,65 @@
+@article{RN2,
+   author = {Hortonworks},
+   title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN4,
+   author = {Hortonworks},
+   title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/check_dns.html},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN6,
+   author = {Lydall, Ross},
+   title = {Boris Johnson's bike hire scheme gets a £25m bonus from Barclays. Available on:
+https://web.archive.org/web/20100913111233/http://www.thisislondon.co.uk/standard/article-23839406-boris-bike-hire-scheme-gets-a-pound-25m-bonus-from-barclays.do},
+   year = {2010},
+   type = {Journal Article}
+}
+
+@article{RN8,
+   author = {n.d.},
+   title = {Apache PySpark Documentation. Available on: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN7,
+   author = {n.d.},
+   title = {Dark Sky Weather API. Available on: https://darksky.net/dev/docs/faq},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@article{RN5,
+   author = {n.d.},
+   title = {Folium package. Available on: 
+https://python-visualization.github.io/folium/},
+   year = {2019},
+   type = {Journal Article}
+}
+
+@misc{RN1,
+   year = {2019},
+   type = {Generic}
+}
+
+@article{RN3,
+   author = {Sposetti, Jeff},
+   title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-Step1:CreateBlueprint},
+   year = {2017},
+   type = {Journal Article}
+}
+
+@article{RN9,
+   author = {Wang, Yongqiao and Li, Lishuai and Dang, Chuangyin},
+   title = {Calibrating Classification Probabilities with Shape-restricted Polynomial Regression},
+   journal = {IEEE transactions on pattern analysis and machine intelligence},
+   ISSN = {0162-8828},
+   year = {2019},
+   type = {Journal Article}
+}
+