Skip to content

Commit d427f2f

Browse files
committed
Merge branch 'allocate2'
2 parents 12fe2b2 + 0ae6944 commit d427f2f

File tree

5 files changed

+82
-3
lines changed

5 files changed

+82
-3
lines changed

docs/_quarto.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,12 @@ project:
33

44
website:
55
title: pysps
6+
site-url: https://marberts.github.io/pysps
67
description: Sequential Poisson sampling in Python
78
navbar:
89
logo: logo.png
910
left:
10-
- file: index.qmd
11+
- file: get-started.qmd
1112
text: Getting started
1213
- href: reference/index.qmd
1314
text: References

docs/get-started.qmd

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: Drawing a sequential Poisson sample
3+
---
4+
5+
Sequential Poisson sampling is a variation of Poisson sampling for drawing probability-proportional-to-size samples with a given number of units. It’s a fast, simple, and flexible method for sampling units proportional to their size, and is often used for drawing a sample of businesses. The purpose of this vignette is to give an example of how the functions in this package can be used to easily draw a sample using the sequential Poisson method.
6+
7+
## Drawing a sample of businesses
8+
9+
Consider the problem of drawing a sample of businesses in order to measure the value of sales for the current quarter. The frame is a business register that gives an enumeration of all businesses in operation, along with the revenue of each business from the previous year and the region in which they are headquartered.
10+
11+
```{python}
12+
import pandas as pd
13+
import numpy as np
14+
import pysps
15+
16+
rng = np.random.default_rng(1234)
17+
18+
frame = pd.DataFrame({
19+
"revenue": np.round(rng.uniform(size=300) * 1000),
20+
"region": np.repeat(["a", "b", "c"], 100)
21+
})
22+
23+
frame.head()
24+
```
25+
26+
Associated with each business is a value for their sales for the current quarter, although these values are not observable for all businesses. The purpose of drawing a sample is to observe sales for a subset of businesses, and extrapolate the value of sales from the sample of business to the entire population. Sales are positively correlated with last year’s revenue, and this is the basis for sampling businesses proportional to revenue.
27+
28+
```{python}
29+
sales = np.round(frame.revenue * rng.uniform(size=len(frame)))
30+
```
31+
32+
Budget constraints mean that it’s feasible to draw a sample of 30 businesses. Businesses operate in different regions, and so the sample will be stratified by region. This requires determining how the total sample size of 30 is allocated across regions. A common approach is to do this allocation proportional to the total revenue in each region.
33+
34+
```{python}
35+
allocation = pysps.prop_allocation(
36+
frame.groupby("region")["revenue"].agg("sum"),
37+
n = 30
38+
)
39+
40+
allocation
41+
```
42+
43+
With the sample size for each region in hand, it’s now time to draw a sample and observe the value of sales for these businesses. In practice this is usually the result of a survey that’s administered to the sampled units.
44+
45+
```{python}
46+
res = {}
47+
for g, df in frame.groupby("region"):
48+
pi = pysps.InclusionProb(df.revenue, allocation[g])
49+
sample = pysps.OrderSample(pi)
50+
res[g] = df.iloc[sample.units].assign(weight=sample.weights)
51+
52+
survey = pd.concat(res.values())
53+
54+
survey["sales"] = sales
55+
survey.head()
56+
```
57+
58+
An important piece of information from the sampling process is the design weights, as these enable estimating the value of sales in the population with the usual Horvitz-Thompson estimator.
59+
60+
```{python}
61+
ht = np.sum(survey.sales * survey.weight)
62+
ht
63+
```
64+
65+
The Horvitz-Thompson estimator is (asymptotically) unbiased under sequential Poisson sampling, so it should be no surprise that the estimate is fairly close the true (but unknown) value of sales among all businesses.
66+
67+
```{python}
68+
ht / np.sum(sales) - 1
69+
```

pyproject.toml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,14 @@ dependencies = [
2121
]
2222
requires-python = ">=3.10"
2323

24+
[project.optional-dependencies]
25+
dev = [
26+
"pandas",
27+
"quartodoc>=0.11.0",
28+
"pytest>=3",
29+
"pytest-cov"
30+
]
31+
2432
[project.urls]
2533
documentation = 'https://marberts.github.io/pysps'
2634
repository = 'https://github.com/marberts/pysps'

pysps/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
"divisor_method",
1616
]
1717

18-
__version__ = "0.1.1.9002"
18+
__version__ = "0.1.1.9003"

pysps/allocate.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ def prop_allocation(
9797
)
9898
```
9999
"""
100+
x = dict(x)
100101
if not set(x.keys()).issuperset(initial.keys()):
101102
raise ValueError("all keys in 'initial' must also be in 'x'")
102103
if not set(x.keys()).issuperset(available.keys()):
@@ -114,7 +115,7 @@ def prop_allocation(
114115
raise ValueError("'n' must be greater than or equal to 0")
115116

116117
upper = dict.fromkeys(x.keys(), n)
117-
upper.update(available | {k: 0 for k, v in x.items() if v == 0})
118+
upper.update(available | {k: 0 for k, v in x.items() if v == 0.0})
118119

119120
if n < sum(res.values()):
120121
raise ValueError("'n' is smaller than initial allocation")

0 commit comments

Comments
 (0)