Skip to content

Commit cbc2baf

Browse files
author
KathiBrown
committed
merged data profilling part 2, feature engineering kings cross and partly data preparation on an hourly base
1 parent 5be3ef3 commit cbc2baf

22 files changed

+306
-11
lines changed

22_data_profiling.tex

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
% vim:ft=tex
22

3-
\section{Data Profiling}
3+
\section{Data Profiling}\label{dp1}
44

55
Data profiling is the process of reviewing source data, understanding structure, content and
66
interrelationships, and identifying potential for data projects. For the project, the Santander Bicycle data will be profiled more closely \citep{TFL2019}. For this I mainly used Zeppelin notebooks and the HDP cluster with four worker nodes initialized in chapter \ref{intallhadoop}. Zeppelin works similar to Jupyter notebooks and also supports magic commands. However, Zeppelin stores the notebooks in .json format, while Jupyter notebooks (python) uses \textbf{.ipynb}. A corresponding import of Zeppelin notebooks into Jupyter is therefore not possible and

23_data_prediction.tex

+139-6
Large diffs are not rendered by default.

24_hourly.tex

+36-4
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,42 @@
33
\section{Prediction on Hourly Basis}
44

55
\subsection{Data Preparation}
6-
7-
% Sliding windows, ...
8-
% Aggregation stuff...
9-
6+
Since the long term goal was to predict the usage on a hourly base, some further data
7+
transformation steps are necessary to achieve this goal. Furthermore the data records will
8+
increase significantly on an hourly granularity. Therefore the normal use of Anaconda and Jupyter
9+
on a local computer may be not sufficient due to low physical memory. An ideal use case for our
10+
newly installed Hadoop cluster! There PySpark can be used as already used under Data Profiling
11+
Part 1 in chapter \ref{dp1} to manage \glqq bg data\grqq transformations. Unfortunately the behaviour and syntax of PySpark
12+
is sometimes a little more complicated than Pandas. For example, in PySpark it is not easily
13+
possible to iterate over rows since the data frame is distributed over the worker nodes and thus it
14+
only allows columnwise operations. Moreover, Pandas operations such as „iloc“ are not available
15+
in PySpark. But the API comes also with some advantages, e.g. it is quite performant on big data
16+
scale (i.e. it can easily perform several million of records) and it has an SQL approach.
17+
Functions like \glqq select\grqq \glqq where\grqq and \glqq filter\grqq are syntactical close by to SQL as we know from MySQL
18+
and other database management systems.\\\\
19+
The structure of the weather data is inconsistent due to \glqq hourly weather\grqq While the other columns
20+
only contain simple values, the column \glqq hourly weather\grqq contains nested JSON lists. This means
21+
that these nested lists must somehow become normal columns. This was a little more complicated
22+
than expected, but not impossible. Fortunately, PySpark allows one to define \glqq schemas\grqq that are
23+
used as a kind of blueprint by Spark to read the Spark data frame. With the following code one
24+
can already create normal columns from the JSON lists:
25+
\begin{figure}[H]
26+
\hspace{-1.6cm}
27+
\includegraphics[width=1.2\textwidth]{img/listing5}\label{fig:listing5}
28+
\captionof{figure}{Transformation of humidity columns (excerpt)}\label{fig:listing5}
29+
\end{figure}
30+
he excerpt from \ref{fig:listing5} shows that \glqq structTypes\grqq can be used to search the individual sublists
31+
of \glqq hourly weather\grqq .
32+
The complete script is contained in the Zeppelin notebook \glqq DFGeneration.json\grqq and can also be found on GitHub.
33+
\\\\
34+
The described transformation creates a new column with the corresponding value for each hour.
35+
The script works dynamically. For example, the user can look at the data with two hours from today,
36+
which then looks like this:
37+
\begin{figure}[H]
38+
\hspace{-1.6cm}
39+
\includegraphics[width=1.2\textwidth]{img/figure7_weather_df}\label{fig:figure7_weather_df}
40+
\captionof{figure}{Weather dataframe after transformation for two hours}\label{fig:figure7_weather_df}
41+
\end{figure}
1042
\subsubsection{Holidays}
1143

1244
To add the bank holidays are day-based, so they should be set on a 24 hour windows.

DSPRReport.pdf

1.82 MB
Binary file not shown.

bibtex/library.bib

+65
Original file line numberDiff line numberDiff line change
@@ -345,3 +345,68 @@ @inproceedings{riedmiller1993direct
345345
year={1993},
346346
organization={San Francisco}
347347
}
348+
@article{RN2,
349+
author = {Hortonworks},
350+
title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},
351+
year = {2019},
352+
type = {Journal Article}
353+
}
354+
355+
@article{RN4,
356+
author = {Hortonworks},
357+
title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/check_dns.html},
358+
year = {2019},
359+
type = {Journal Article}
360+
}
361+
362+
@article{RN6,
363+
author = {Lydall, Ross},
364+
title = {Boris Johnson's bike hire scheme gets a £25m bonus from Barclays. Available on:
365+
https://web.archive.org/web/20100913111233/http://www.thisislondon.co.uk/standard/article-23839406-boris-bike-hire-scheme-gets-a-pound-25m-bonus-from-barclays.do},
366+
year = {2010},
367+
type = {Journal Article}
368+
}
369+
370+
@article{RN8,
371+
author = {n.d.},
372+
title = {Apache PySpark Documentation. Available on: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode},
373+
year = {2019},
374+
type = {Journal Article}
375+
}
376+
377+
@article{RN7,
378+
author = {n.d.},
379+
title = {Dark Sky Weather API. Available on: https://darksky.net/dev/docs/faq},
380+
year = {2019},
381+
type = {Journal Article}
382+
}
383+
384+
@article{RN5,
385+
author = {n.d.},
386+
title = {Folium package. Available on:
387+
https://python-visualization.github.io/folium/},
388+
year = {2019},
389+
type = {Journal Article}
390+
}
391+
392+
@misc{RN1,
393+
year = {2019},
394+
type = {Generic}
395+
}
396+
397+
@article{RN3,
398+
author = {Sposetti, Jeff},
399+
title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-Step1:CreateBlueprint},
400+
year = {2017},
401+
type = {Journal Article}
402+
}
403+
404+
@article{RN9,
405+
author = {Wang, Yongqiao and Li, Lishuai and Dang, Chuangyin},
406+
title = {Calibrating Classification Probabilities with Shape-restricted Polynomial Regression},
407+
journal = {IEEE transactions on pattern analysis and machine intelligence},
408+
ISSN = {0162-8828},
409+
year = {2019},
410+
type = {Journal Article}
411+
}
412+
255 KB
Loading

img/figure1_proxmox.png

155 KB
Loading

img/figure2_webmin.png

98.1 KB
Loading

img/figure3_hadoop.png

1.67 MB
Loading

img/figure4_folium_plot1.png

907 KB
Loading

img/figure5_folium_plot2.png

1 MB
Loading

img/figure6_kings_cross_df.png

26.2 KB
Loading

img/figure7_weather_df.png

17.6 KB
Loading

img/figure8_temperature_df.png

14.3 KB
Loading

img/figure9_polynomial_features.png

152 KB
Loading

img/listing1.png

57.3 KB
Loading

img/listing2.png

52.5 KB
Loading

img/listing3.png

54.6 KB
Loading

img/listing4.png

29.9 KB
Loading

img/listing5.png

28.5 KB
Loading

pascal_report/ds2_pascal_report.pdf

1.6 MB
Binary file not shown.

pascal_report/references.txt

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
@article{RN2,
2+
author = {Hortonworks},
3+
title = {Apache Ambari Operations. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/ch_using_ambari_metrics.html},
4+
year = {2019},
5+
type = {Journal Article}
6+
}
7+
8+
@article{RN4,
9+
author = {Hortonworks},
10+
title = {Check DNS and NSCD. Available on: https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/check_dns.html},
11+
year = {2019},
12+
type = {Journal Article}
13+
}
14+
15+
@article{RN6,
16+
author = {Lydall, Ross},
17+
title = {Boris Johnson's bike hire scheme gets a £25m bonus from Barclays. Available on:
18+
https://web.archive.org/web/20100913111233/http://www.thisislondon.co.uk/standard/article-23839406-boris-bike-hire-scheme-gets-a-pound-25m-bonus-from-barclays.do},
19+
year = {2010},
20+
type = {Journal Article}
21+
}
22+
23+
@article{RN8,
24+
author = {n.d.},
25+
title = {Apache PySpark Documentation. Available on: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode},
26+
year = {2019},
27+
type = {Journal Article}
28+
}
29+
30+
@article{RN7,
31+
author = {n.d.},
32+
title = {Dark Sky Weather API. Available on: https://darksky.net/dev/docs/faq},
33+
year = {2019},
34+
type = {Journal Article}
35+
}
36+
37+
@article{RN5,
38+
author = {n.d.},
39+
title = {Folium package. Available on:
40+
https://python-visualization.github.io/folium/},
41+
year = {2019},
42+
type = {Journal Article}
43+
}
44+
45+
@misc{RN1,
46+
year = {2019},
47+
type = {Generic}
48+
}
49+
50+
@article{RN3,
51+
author = {Sposetti, Jeff},
52+
title = {Ambari Blueprints. Available on: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-Step1:CreateBlueprint},
53+
year = {2017},
54+
type = {Journal Article}
55+
}
56+
57+
@article{RN9,
58+
author = {Wang, Yongqiao and Li, Lishuai and Dang, Chuangyin},
59+
title = {Calibrating Classification Probabilities with Shape-restricted Polynomial Regression},
60+
journal = {IEEE transactions on pattern analysis and machine intelligence},
61+
ISSN = {0162-8828},
62+
year = {2019},
63+
type = {Journal Article}
64+
}
65+

0 commit comments

Comments
 (0)