Skip to content

Conversation

@shahronak47
Copy link
Contributor

@shahronak47 shahronak47 commented May 23, 2024

Hi Andres,

This is the first draft of vectorize povline. I just want to confirm the expected output. To test this install vectorize-povline branch in wbpip using -

metapip::install_branch("wbpip", "vectorize-povline")

You can then test -

devtools::load_all()

out1 <- pip(country = "AGO",
     year = 2000,
     povline = 1.9,
     lkup = lkups$versions_paths$`20220810_2017_01_02_TEST`)

out2 <- pip(country = "AGO",
            year = 2000,
            povline = 1.675,
            lkup = lkups$versions_paths$`20220810_2017_01_02_TEST`)         

out3 <- pip(country = "AGO",
            year = 2000,
            povline = c(1.675, 1.9),
            lkup = lkups$versions_paths$`20220810_2017_01_02_TEST`)

identical(rbind(out2, out1), out3)

@randrescastaneda
Copy link
Member

Hi @shahronak47 ,

I tested the output and it is looking very nice. Yes, this is what I had in mind. Thank you.

So, These would be the next steps

  1. Vectorize properly prod_md_compute_pip_stats() and prod_gd_compute_pip_stats() in prod_compute_pip_stats to avoid the lapply() function.
  2. make sure it is also implemented in fillgaps, (fg_pip())
  3. Make sure it is implemented in aggregate data, in (pip_grp_logic)
  4. Add tests.

Copy link
Member

@randrescastaneda randrescastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shahronak47 ,

There seems to be a problem when more than one poverty line is requested for the case of CHN. See code below

  out <- pip(country = "CHN",
              year = c(2010, 2018),
              povline = c(1.675, 1.9),
              lkup = lkup)

Thanks.

@shahronak47
Copy link
Contributor Author

Hi @randrescastaneda , I have pushed the fixes and now all the test cases from #413 pass. You can do some more round of testing if you have time or else we will wait till mid-September before merging this PR. Thank you for providing with the test file, it was very helpful.

@randrescastaneda
Copy link
Member

hi @shahronak47 ,
Thank you so much, the tests at the country level are passing. However, I added a few tests at the aggregate level (using pip_grp_logic()) and they are not passing yet. I added the tests in #421. could you please make sure those tests pass too? Thank you so much!
Best,

@shahronak47
Copy link
Contributor Author

Thanks @randrescastaneda for catching that and adding the test. I have done the necessary changes and now the tests should pass.

@shahronak47
Copy link
Contributor Author

Switch to collapse::join as it is much faster. Tested on this example pip(country = c("AGO", "USA"), year = 2000, lkup = lkup)


microbenchmark::microbenchmark(
  duckplyr = dplyr::inner_join(
  x = master_file,
  y = a |>
    collapse::fselect(country_code, reporting_year, is_interpolated, welfare_type),
  by = c("country_code", "reporting_year", "is_interpolated", "welfare_type")), 
  collapse = collapse::join(
  master_file,
  collapse::fselect(a, country_code, reporting_year, is_interpolated, welfare_type),
  on = c("country_code", "reporting_year", "is_interpolated", "welfare_type"),
  how = "inner", 
  overid = 2
)
)

Unit: microseconds
     expr      min        lq       mean    median        uq      max neval
 duckplyr 211168.8 223155.20 271348.656 247165.40 307077.15 494288.0   100
 collapse    420.7    664.25    887.706    874.85   1090.55   1473.5   100

@randrescastaneda
Copy link
Member

randrescastaneda commented May 5, 2025

Hi @shahronak47 ,

I tested this version and it is faling on adding new lines to the master when old lines are present in the request. See the examples below.

Thanks.

# SET UP
devtools::load_all(".")
library(fastverse)
force <- FALSE
if (!"lkups" %in% ls() || isTRUE(force)) {
  data_dir <- Sys.getenv("PIPAPI_DATA_ROOT_FOLDER_LOCAL") |>
    fs::path()
  fs::dir_ls(data_dir, recurse = FALSE)
}


latest_version <-
  available_versions(data_dir) |>
  max()

lkups <- create_versioned_lkups(data_dir,
                                vintage_pattern = "2025")

lkup <- lkups$versions_paths[[lkups$latest_release]]



# PROBLEM

# this works because all numbers are new
pls <- runif(20, min = 5, max = 25)
col <- pip(country = "COL",
           year = 2020:2025,
           povline = pls,
           lkup = lkup)

waldo::compare(unique(col$poverty_line),
          pls,
          tolerance = 1e-10)

# thus, it  saves the data correctly and retunrs from cache
col <- pip(country = "COL",
           year = 2020:2025,
           povline = pls,
           lkup = lkup)


# however, when new pov lines and old povlines are included in the request, 
# only the old ones are returned and the new ones are NOT saved
pls <- seq(10, 30, .1)
col <- pip(country = "COL",
           year = 2020:2025,
           povline = pls,
           lkup = lkup)

waldo::compare(unique(col$poverty_line),
               pls,
               tolerance = 1e-10)

# So this is processed again and still does not work
col <- pip(country = "COL",
           year = 2020:2025,
           povline = pls,
           lkup = lkup)
unique(col$poverty_line)

Copy link
Member

@randrescastaneda randrescastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shahronak47 ,

Thank you for the changes. Unfortunately, it is still not working. Pleas take a look at the code below. You can see that both calls take almost the same time. it seems as if the duckdb file is not updated, the same call is processed again. also, note that I modified a little de DESCRIPTION file, but it is completely independent from this. THANKS.

# remotes::install_github("PIP-Technical-Team/pipapi@vectorize-povline")

# SET UP
devtools::load_all(".")
library(fastverse)
force <- FALSE
if (!"lkups" %in% ls() || isTRUE(force)) {
  data_dir <- Sys.getenv("PIPAPI_DATA_ROOT_FOLDER_LOCAL") |>
    fs::path()
  fs::dir_ls(data_dir, recurse = FALSE)
}


latest_version <-
  available_versions(data_dir) |>
  max()

lkups <- create_versioned_lkups(data_dir,
                                vintage_pattern = "2025")

lkup <- lkups$versions_paths[[lkups$latest_release]]



# PROBLEM

# this works because all numbers are new
pls <- runif(20, min = 5, max = 25)
tictoc::tic()
col <- pip(country = "COL",
           year = 2020:2025,
           povline = pls,
           lkup = lkup)
t1 <- tictoc::toc()
waldo::compare(unique(col$poverty_line),
               pls,
               tolerance = 1e-10)

# It takes the same time to process everything again. 
tictoc::tic()
col <- pip(country = "COL",
           year = 2020:2025,
           povline = pls,
           lkup = lkup)
t2 <- tictoc::toc()


t1$callback_msg
t2$callback_msg

@randrescastaneda randrescastaneda merged commit 5ac9b8d into DEV May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants