-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy path02-data-structures.Rmd
More file actions
408 lines (295 loc) · 14.1 KB
/
02-data-structures.Rmd
File metadata and controls
408 lines (295 loc) · 14.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
# Data Structures
```{r,echo=FALSE}
knitr::opts_chunk$set(comment = '', prompt = FALSE, collapse = FALSE)
```
This chapter compares and contrasts data structures in Python and R.
## One-dimensional data
A one-dimensional data structure can be visualized as a column in a spreadsheet or as a list of values.
#### Python {-}
There are many ways to organize one-dimensional data in Python. Three of the most common one-dimensional data structures are lists, **NumPy** arrays, and **pandas** Series. All three are ordered and mutable, and they can contain data of different types.
Lists in Python do not need to be explicitly declared; they are indicated by the use of square brackets.
```{python}
l = [1, 2, 3, 'hello']
```
Values in lists can be accessed by using square brackets. Python indexing begins at 0, so to extract the first element, we would use the index 0. Python also allows for negative indexing; using an index of -1 will return the last value in the list. Indexing a range in Python is _not_ inclusive of the last index.
```{python}
# Extract first element
l[0]
```
```{python}
# Extract last element
l[-1]
```
```{python}
# Extract second and third elements
l[1:3]
```
**NumPy** arrays, on the other hand, need to be declared using the `numpy.array()` function, and the **NumPy** package needs to be imported.
```{python}
import numpy as np
arr = np.array([1, 2, 3, 'hello'])
print(arr)
```
Accessing data in a **NumPy** array is the same as indexing a list.
```{python}
# Extract first element
arr[0]
# Extract last element
arr[-1]
# Extract second and third elements
arr[1:3]
```
**pandas** Series also need to be declared using the `pandas.Series()` function. Like **NumPy**, the **pandas** package must be imported as well. The **pandas** package is built on **NumPy**, so we can input data into a **pandas** Series using a **NumPy** array. We can extract data from a Series by indexing, similar to indexing a list or **NumPy** array.
```{python}
import pandas as pd
import numpy as np
data = np.array([1, 2, 3, "hello"])
ser1 = pd.Series(data)
print(ser1)
```
```{python}
# Extract first element
ser1[0]
```
```{python}
# Extract second and third elements
ser1[1:3]
```
Similarly, we can use `iloc[]` (note the square brackets) to subset elements based on integer position.
```{python}
# Extract second and third elements
ser1.iloc[1:3]
```
We can relabel the indices of the Series to whatever we like using the `index` attribute within the `Series()` function.
```{python}
import pandas as pd
import numpy as np
ser2 = pd.Series(data, index = ['a', 'b', 'c', 'd'])
print(ser2)
```
We can then use our own specified indices to select and index our data. Indexing with our labels can be done in two ways; one approach uses the `.loc[]` function and is similar to indexing arrays and lists with square brackets; the other approach follows this form: `Series.label_name`. Note that ranges referenced in `loc[]` are _inclusive on both ends_; normally, Python interprets ranges as having open upper ends (e.g., `print(["a", "b", "c"][0:2])` returns `["a", "b"]`.)
```{python}
# Extract element in row b
ser2.loc["b"]
```
```{python}
# Extract elements from row b to the end
ser2.loc["b":]
```
```{python}
# Extract element in row "d"
ser2.d
```
```{python}
# Extract element in row "b"
ser2.b
```
Mathematical operations cannot be carried out on lists, but they can be carried out on **NumPy** arrays and **pandas** Series. In general, lists are better for short data sets that won't be the targets of mathematical operations. **NumPy** arrays and **pandas** Series are better for long data sets and for data sets that will be operated on mathematically.
#### R {-}
In R a one-dimensional data structure is called a _vector_. We can create a vector using the `c()` function. A vector in R can only contain one type of data (all numbers, all strings, etc). The columns of data frames---a data structure discussed in section 2.2 below---are vectors. If multiple types of data are put into a vector, the data will be coerced according to the hierarchy `logical` < `integer` < `double` < `complex` < `character`. This means if you mix, say, integers and character data, all the data will be coerced to character.
```{r}
x1 <- c(23, 43, 55)
x1
# All values coerced to character
x2 <- c(23, 43, 'hi')
x2
```
Values in a vector can be accessed by position using indexing brackets. R indexes elements of a vector starting at 1. Index values are inclusive. For example, a_vec[`2:3`] selects the second and third elements of `a_vec`.
```{r}
# Extract the second value
x1[2]
# Extract the second and third values
x1[2:3]
```
## Two-dimensional data
Two-dimensional data are rectangular in nature, consisting of rows and columns. These can be the type of data you might find in a spreadsheet with a mix of data types in the columns; they can also be matrices as you might encounter in matrix algebra.
#### Python {-}
In Python, two common two-dimensional data structures are the **NumPy** array (introduced above in its one-dimensional form) and the **pandas** DataFrame.
A two-dimensional **NumPy** array is made in a similar way to the one-dimensional array using the `numpy.array()` function.
```{python}
import numpy as np
arr2d = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
print(arr2d)
```
Data can be selected from a two-dimensional **NumPy** array via `[row, column]` indexing:
```{python}
import numpy as np
# Extract first element
arr2d[0, 0]
# Extract last element
arr2d[-1, -1]
# Extract second and third columns (all rows)
arr2d[:, 1:3]
```
A **pandas** DataFrame is made using the `pandas.DataFrame()` function, and it shares many functional similarities with the **pandas** Series.
```{python}
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
df = pd.DataFrame(data)
print(df)
```
To select certain rows and columns from a **pandas** DataFrame based on their integer position, we can use `iloc[]` (as we did with the **pandas** Series above). The values inside the brackets following `iloc` reflect the rows and the columns, respectively.
```{python}
# Extract first element (first row, first column)
df.loc[0, 0]
```
```{python}
# Extract first row, all columns
df.loc[0, :]
```
```{python}
# Extract second column, all rows
df.loc[:, 1]
```
As we did with the **pandas** Series, we can change the indices and the column names of the DataFrame and use those to select certain data. We change the indices using the `index` attribute in `pandas.DataFrame()`, and we change the column names using the `columns` attribute. These new labels can be used with `loc[]` for subsetting. (And, as mentioned above, if a range is passed to `loc[]`, it's treated as inclusive on both ends.)
```{python}
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
df = pd.DataFrame(data, index = ["a", "b"], columns = ["column 1", "column 2", "column 3", "column 4"])
print(df.loc[["a", "b"], "column 1"])
```
One thing to note is that **NumPy** arrays can actually have an arbitrary number of dimensions, whereas **pandas** DataFrames can only have two.
#### R {-}
Two-dimensional data structures in R include the _matrix_ and _data frame_. A matrix can contain only one data type. A data frame can contain multiple vectors, each of which can consist of different data types.
Create a matrix with the `matrix()` function. Create a data frame with the `data.frame()` function. Most imported data comes into R as a data frame.
```{r}
# Matrix; populated down by column by default
m <- matrix(data = c(1, 3, 5, 7), nrow = 2, ncol = 2)
m
# Data frame
d <- data.frame(name = c("Rob", "Cindy"),
age = c(35, 37))
d
```
Values in a matrix and data frame can be accessed by position using indexing brackets. The first number(s) refers to rows; the second number(s) refers to columns. Leaving row or column numbers empty selects all rows or columns.
```{r}
# Extract value in row 1, column 2
m[1, 2]
# Extract values in row 2
d[2, ]
```
## Three-dimensional and higher data
Three-dimensional and higher data can be visualized as multiple rectangular structures stratified by extra variables. These are sometimes referred to as _arrays_. Analysts usually prefer two-dimensional data frames to arrays. Data frames can accommodate multidimensional data by including the additional dimensions as variables.
#### Python {-}
To create a three-dimensional and higher data structure in Python, we again use a **NumPy** array. We can think of the three-dimensional array as a stack of two-dimensional arrays. We construct this in the same way as the one- and two-dimensional arrays.
```{python}
import numpy as np
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d
```
We can also construct a three-dimensional **NumPy** array using the `reshape()` function on an existing array. The argument of `reshape()` is where you input your desired dimensions: strata, rows, and then columns. Here, the `arange()` function is used to create a **NumPy** array containing the numbers 1 through 12 (to recreate the same array shown above).
```{python}
arr3d_2 = np.arange(1, 13).reshape(2, 2, 3)
arr3d_2
```
Indexing the three-dimensional array follows the same format as the two-dimensional arrays. Since we can think of the three-dimensional array as a stack of two-dimensional arrays, we can extract each "stacked" two-dimensional array. Here we extract the first of the "stacked" two-dimensional arrays:
```{python}
# Extract first strata (first "stacked" 2-D array)
arr3d[0]
```
We can also extract entire rows and columns, or individual array elements:
```{python}
# Extract first row of second strata (second "stacked" 2-D array)
arr3d[1, 0]
```
```{python}
# Extract first column of second strata
arr3d[1, :, 0]
```
```{python}
# Extract the number 6 (first strata, second row, third column)
arr3d[0, 1, 2]
```
The three-dimensional arrays can be converted to two-dimensional arrays again using the `reshape` function:
```{python}
arr3d_2d = arr3d.reshape(4, 3)
arr3d_2d
```
#### R {-}
The `array()` function in R can create three-dimensional and higher data structures. Arrays are like vectors and matrices in that they can only contain one data type. In fact matrices and arrays are sometimes described as vectors with instructions on how to layout the data.
We can specify the dimension number and size using the `dim` argument. Below we specify 2 rows, 3 columns, and 2 strata using a vector: `c(2,3,2)`. This creates a three-dimensional data structure. The data in the example are simply the numbers 1 through 12.
```{r}
a1 <- array(data = 1:12, dim = c(2, 3, 2))
a1
```
Values in arrays can be accessed by position using indexing brackets.
```{r}
# Extract value in row 1, column 2, strata 1
a1[1, 2, 1]
# Extract column 2 in both strata
# Result is returned as matrix
a1[, 2, ]
```
The dimensions can be named using the `dimnames()` function. Notice the names must be a _list_.
```{r}
dimnames(a1) <- list("X" = c("x1", "x2"),
"Y" = c("y1", "y2", "y3"),
"Z" = c("z1", "z2"))
a1
```
The `as.data.frame.table()` function can collapse an array into a two-dimensional structure that may be easier to use with standard statistical and graphical routines. The `responseName` argument allows you to provide a suitable column name for the values in the array.
```{r}
as.data.frame.table(a1, responseName = "value")
```
## General data structures
Both R and Python provide general "catch-all" data structures that can contain data of any shape, type, and amount.
#### Python {-}
The most general data structures in Python are the _list_ and the _tuple_. Both lists and tuples are ordered collections of objects called _elements_. The elements can be other lists/tuples, arrays, integers, objects, etc.
Lists are mutable objects; elements can be reordered or deleted, and new elements can be added after the list has been created. Tuples, on the other hand, are immutable; once a tuple is created it cannot be changed.
Lists are created using square brackets. Here, we create a list and add an element to the list after it is created using the `append` function.
```{python}
lst = [1, 2, 'a', 'b', [3, 4, 5]]
lst
lst.append('c')
lst
```
Tuples are created using parenthesis. Here we create a tuple.
```{python}
tuple = (1, 2, 'a','b', [3, 4, 5])
tuple
```
Let's try to use the append function to explore the immutability of the tuple. We expect to get an error.
```{python, error = TRUE}
tuple.append('c')
```
We can refer to specific list/tuple elements by using square brackets. In the square brackets we put the index number of the element. The element in the first position is at index 0.
```{python}
# Extract the first element of the list and the tuple
lst[0]
tuple[0]
```
```{python}
# Extract the last element of each
lst[-1]
tuple[-1]
```
#### R {-}
The most general data structure in R is the _list_. A list is an ordered collection of objects, which are referred to as the _components_. The components can be vectors, matrices, arrays, data frames, and other lists. The components are always numbered but can also have names. The results of statistical functions are often returned as lists.
We can create lists with the `list()` function. The list below contains three components: a vector named "x", a matrix named "y", and a data frame named "z". Notice the `m` and `d` objects were created in the two-dimensional data section earlier in this chapter.
```{r}
l <- list(x = c(1, 2, 3),
y = m,
z = d)
l
```
We can refer to list components by their order number or name (if present). To use order number, use indexing brackets. Single brackets returns a list. Double brackets return the component itself.
```{r}
# Second element returned as list
l[2]
# Second element returned as itself (matrix)
l[[2]]
```
Use the `$` operator to refer to components by name. This returns the component itself.
```{r}
l$y
```
Finally, it is worth noting that a data frame is a special case of a list consisting of components with the same length. The `is.list()` function returns `TRUE` if an object is a list and `FALSE` otherwise.
```{r}
# Object d is data frame
d
str(d)
# ...And a data frame is a list
is.list(d)
```