-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLesson-Warehouse.qmd
109 lines (81 loc) · 6.08 KB
/
Lesson-Warehouse.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
title: "Biostat 823 - Data Warehousing"
author: "Hilmar Lapp"
institute: "Duke University, Department of Biostatistics & Bioinformatics"
date: "Sep 12, 2024"
format:
revealjs:
slide-number: true
editor: visual
knitr:
opts_chunk:
echo: TRUE
---
## OLTP vs OLAP
| OLTP <br/>[(Online Transaction Processing)]{style="font-size:smaller"} | OLAP <br/>[(Online Analytical Processing)]{style="font-size:smaller"} |
|------------------------------------|------------------------------------|
| Real-time data transactions | Data analysis (BI) |
| Fast response times, protected data integrity | Dataset generation for question answering |
| frequent INSERT, UPDATE, DELETE | SELECT, frequently aggregated |
| Highly normalized design | Denormalized design |
| Constantly updated | Read-only, periodically refreshed |
## Data Warehousing
- Data Warehouses support OLAP use-cases and applications.
- Typically built by ETL or ELT process(es)
- Often from multiple source databases
- Denormalized database schema design based on fact and dimension tables
## Extract--Transform--Load (ETL)
- Often uses staging tables and/or databases
- Transformation in external software or in SQL
- Note that tables can be created directly from query:
``` sql
CREATE TABLE tablename AS SELECT ...
```
## Datamarts vs Data Warehouse {.smaller}
- Datamart is a type of Data Warehouse
- Datamarts are typically more narrow in data scope
- Typically focused on one subject area
- Easier and less time-consuming to build
- Less memory and storage-intensive than a full Data Warehouse 
## Star Schema {.smaller}
- Uses one (central) Fact table and multiple Dimension tables
- Fact table houses quantitative metrics or measurements
- Typically aggregated, based on chosen *granularity* (of dimension(s))
- Foreign key to each dimension {width="40%" style="float: right"}
- Dimension tables house the dimensions along which to aggregate or slice facts
- Time, location, people, etc
- Typically denormalized
- Each dimension table has 1:n relationship to the fact table
- Fact table normally has orders of magnitude more rows than dimension tables
- In essence, this represents a data cube
- Fact table is the cells; dimension tables are the dimensions ("axes") of the cube
## Star Schema example (generic)
{fig-align="center" width="65%"}
::: aside
This and previous figure from [Microsoft, *Understand star schema and the importance for Power BI*](https://learn.microsoft.com/en-us/power-bi/guidance/star-schema)
:::
## Clinical data warehouse example
](images/SQL/Clinical-star-schema-7796957-fig-1-source-large.gif)
## Snowflake Schema
- A *Snowflake Schema* is a Star Schema with more denormalized dimension tables
](https://upload.wikimedia.org/wikipedia/commons/7/73/Snowflake-schema-example.png){width="60%"}
## Reverse Star Schema
- Deployed for descriptive or observational facts
- Quantitative metrics often are not meaningful
- Typically multiple related facts ("dimension" table rows) for the central fact , BioMart: a data federation framework for large collaborative projects. doi:10.1093/database/bar038"){width="40%" style="float: right"}
- Queries often involve attributes of the central fact
- Fact table has 1:n ("reversed") relationship to each "dimension"
- Hence, fact table is usually smaller than any of the dimension tables
## Data Warehouses in Bioinformatics
- Many biomedical and biological data warehouse systems follow the Reverse Star Schema model
](images/SQL/Ensmart.jpg)
## Genomics Data Warehouse systems {.smaller}
- [BioMart](https://github.com/biomart/BioMart) (successor of EnsMart)
- Zhang *et al* (2011) BioMart: a data federation framework for large collaborative projects, Database, doi:[10.1093/database/bar038](https://doi.org/10.1093/database/bar038)
- Underlying the [data mining UI and API for Ensembl](https://www.ensembl.org/info/data/biomart/) (see also [fully normalized Ensembl data model](https://www.ensembl.org/info/docs/api/core/core_schema.html))
- [Intermine](http://intermine.org)
- Smith *et al* (2012) [InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data](https://doi.org/10.1093/bioinformatics/bts577). Bioinformatics **28**(23):3163-3165
- [Uses a configurable object-relational model](http://intermine.org/im-docs/docs/data-model/overview), implemented in a PostgreSQL database
- Data access is through APIs
- Underlying the [data mining API and UI for Drosophila](https://www.flymine.org/), and deployed for [dozens of other model organism databases](https://en.wikipedia.org/wiki/InterMine#Current_projects_(not_exhaustive_list))
- See also Triplet and Butler (2014) [A review of genomic data warehousing systems](https://doi.org/10.1093/bib/bbt031). Briefings in Bioinformatics. **15**(4):471--483