Skip to content

Conversation

@shahronak47
Copy link
Contributor

@shahronak47 shahronak47 commented Jan 13, 2025

Hi Andres, I have added the code in implement-duckdb branch. First you need to set the path of duckdb file that I have in E drive.

devtools::load_all()
Sys.setenv('PIP_CACHE_FILE' = fs::path('e:/PIP/pipapi_data/demo.duckdb'))

There is a vignette called duckdb-caching.Rmd which explains the process. There are two master files rg_master_file and fg_master_file in duckdb. They are selected based on the fill_gaps argument. I have kept both of them as empty for you to test and verify the examples.

Here are examples for you to try out. Please create the lkup object to try the following code.

## 1.

pip(country = c("AGO", "USA"), year = 2000, lkup = lkup)
# Since rg_master_file was empty initially, now it should have 2 rows
DBI::dbGetQuery(con, "select * from rg_master_file")

## 2. 
pip(country = c("AGO"), year = 2000, lkup = lkup)
# This has already been calculated above so the number of rows should still remain as 2
DBI::dbGetQuery(con, "select * from rg_master_file")

## 3.
pip(country = c("AGO", "COL"), year = 2000, lkup = lkup)
# We now have a new country here so the number of rows will now be 3
DBI::dbGetQuery(con, "select * from rg_master_file")

## 4. 

pip(country = "all", year = 2000, lkup = lkup)
DBI::dbGetQuery(con, "select * from rg_master_file")

## 5. 

pip(country = "AGO", year = "all", lkup = lkup)
DBI::dbGetQuery(con, "select * from rg_master_file")

## 6. 

pip(country = "AGO", year = "all", lkup = lkup, fill_gaps = TRUE)
DBI::dbGetQuery(con, "select * from rg_master_file")
DBI::dbGetQuery(con, "select * from fg_master_file")

dbDisconnect(con)

After you have got hang of it, you can try out different pip calls and verify the caching algorithm and the result.

Copy link
Member

@randrescastaneda randrescastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shahronak47,

Thank you for your PR. This is looking great. I reviewed the code and have a few points to discuss:

  1. The lines DBI::dbGetQuery(con, "select * ") cannot be executed because the con object is not available in the global environment. However, this should not be an issue if subsequent calls to pip() use the con object created in the previous call.
  2. Running pip(country = "AGO", year = "all", lkup = lkup, fill_gaps = TRUE) produces the following warning:
    Warning message:
    Connection is garbage-collected, use dbDisconnect() to avoid this.
    
  3. The idea of displaying whether the data is loaded from cache is great. However, this message should only appear in interactive mode, right?
  4. We need a separate DuckDB file for each release, and it should not be placed at the root of the folder. The cache must be release-dependent.
  5. The con object should be created in the .Rprofile when the lookups are created.

Let's discuss this further in a phone call.

Thank you.

@shahronak47
Copy link
Contributor Author

Hi Andres,

Based on our discussions here are the changes implemented -

  1. Duckdb file name changed from demo.duckdb to cache.duckdb
  2. cache.duckdb now sits in root of the release folder.
  3. A function reset_cache has been introduced to erase the data for a specific release. It requires two environment variables PIP_CACHE_LOCAL_KEY and PIP_CACHE_SERVER_KEY to be of same value for it to erase the cache. PIP_CACHE_SERVER_KEY is set at the server and PIP_CACHE_LOCAL_KEY is passed by the user.
  4. Read and write connections are separately created in pip call and they are closed immediately after their use.
  5. Set options(pipapi.query_live_data = TRUE) to bypass cache.
  6. A new endpoint (/duckdb-reset) has been introduced as well to reset the cache however, I would need to talk to you about how it will be used.

@shahronak47
Copy link
Contributor Author

Hi @randrescastaneda , one thing to note here is the latest duckplyr ( > 1.0.0) depends on dplyr so we need to import dplyr in pipapi.

@randrescastaneda
Copy link
Member

randrescastaneda commented Mar 10, 2025

Hi @randrescastaneda , one thing to note here is the latest duckplyr ( > 1.0.0) depends on dplyr so we need to import dplyr in pipapi.

OOHHH. Ok. Let's do it. It is not ideal, but it is what it is. Thanks for letting me Know.

Best,
Andres

@randrescastaneda randrescastaneda merged commit aa15bc9 into DEV Mar 19, 2025
@randrescastaneda randrescastaneda deleted the implement-duckdb branch March 19, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants