Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-931] [Bug] Values in seeds that should convert to null aren't working for session connection method #901

Open
2 tasks done
vakarisbk opened this issue Oct 4, 2023 · 3 comments
Labels
bug Something isn't working help_wanted Extra attention is needed

Comments

@vakarisbk
Copy link

vakarisbk commented Oct 4, 2023

Is this a new bug in dbt-spark?

  • I believe this is a new bug in dbt-spark
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Null values in csv seeds appear as string "None" or empty string in the materialized table when using Spark Session connection or the Spark Connect method in development on #899

Expected Behavior

Null values in seeds should be materialized as null values.

Steps To Reproduce

  1. Create a seed with null values

seed_null.csv:

type,value1,value2
empty,,
doublequotes,""," "
ticks,'',''
spaces, ,
stringnone,None,None
stringnull,Null,Null
nothing
  1. Create a model for testing null values:

test_null.sql:

SELECT
  type,
  value1,
  CASE
    WHEN value1 IS NULL THEN 'Actual NULL'
    WHEN value1 = 'None' THEN 'String "None"'
    WHEN value1 = '' THEN 'Empty String'
    ELSE 'Other'
  END AS value1_type,
  value2,
  CASE
    WHEN value2 IS NULL THEN 'Actual NULL'
    WHEN value2 = 'None' THEN 'String "None"'
    WHEN value2 = '' THEN 'Empty String'
    ELSE 'Other'
  END as value2_type
FROM {{ ref('seed_null') }}
  1. Run the model:
  2. Query the created table:
spark.sql("SELECT * from test_null").show()

+------------+------+-------------+------+-------------+
|        type|value1|  value1_type|value2|  value2_type|
+------------+------+-------------+------+-------------+
|     nothing|  None|String "None"|  None|String "None"|
|doublequotes|  None|String "None"|  None|String "None"|
|       empty|  None|String "None"|  None|String "None"|
|  stringnone|  None|String "None"|  None|String "None"|
|  stringnull|  None|String "None"|  None|String "None"|
|      spaces|  None|String "None"|  None|String "None"|
|       ticks|      | Empty String|      | Empty String|
+------------+------+-------------+------+-------------+

Relevant log output

No response

Environment

- OS:
- Python: 3.11.5
- dbt-core: 1.6.5
- dbt-spark: 1.6.0

Additional Context

No response

@vakarisbk vakarisbk added bug Something isn't working triage labels Oct 4, 2023
@github-actions github-actions bot changed the title [Bug] [PySpark Session] Null values in seeds appear as string "None" or empty strings in the resulting table [ADAP-931] [Bug] [PySpark Session] Null values in seeds appear as string "None" or empty strings in the resulting table Oct 4, 2023
@dbeatty10
Copy link
Contributor

Thanks for reaching out @vakarisbk ! And thanks for providing such a well-constructed example 🏆

Could you try this?

dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10

This worked for me, and here's the output that I got:

16:35:50  Running with dbt=1.6.5
16:35:50  Registered adapter: spark=1.6.2
16:35:50  Unable to do partial parsing because a project config has changed
16:35:51  Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:35:51  
16:35:55  Concurrency: 1 threads (target='http')
16:35:55  
16:35:55  1 of 2 START seed file dbt_dbeatty.seed_null ................................... [RUN]
16:36:03  1 of 2 OK loaded seed file dbt_dbeatty.seed_null ............................... [CREATE 7 in 7.76s]
16:36:03  2 of 2 START sql view model dbt_dbeatty.test_null .............................. [RUN]
16:36:05  2 of 2 OK created sql view model dbt_dbeatty.test_null ......................... [OK in 2.42s]
16:36:06  
16:36:06  Finished running 1 seed, 1 view model in 0 hours 0 minutes and 15.45 seconds (15.45s).
16:36:06  
16:36:06  Completed successfully
16:36:06  
16:36:06  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
16:36:08  Running with dbt=1.6.5
16:36:08  Registered adapter: spark=1.6.2
16:36:08  Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:36:08  
16:36:11  Concurrency: 1 threads (target='http')
16:36:11  
16:36:13  Previewing inline node:
| type         | value1 | value1_type   | value2 | value2_type   |
| ------------ | ------ | ------------- | ------ | ------------- |
| doublequotes |        | Actual NULL   |        | Actual NULL   |
| empty        |        | Actual NULL   |        | Actual NULL   |
| nothing      |        | Actual NULL   |        | Actual NULL   |
| spaces       |        | Actual NULL   |        | Actual NULL   |
| stringnone   | None   | String "None" | None   | String "None" |
| stringnull   |        | Actual NULL   |        | Actual NULL   |
| ticks        | ''     | Other         | ''     | Other         |

Since this looks like it's working, I'm going to close this as "can't reproduce". But please let me know if you still think this is an issue and we can consider re-opening this.

@vakarisbk
Copy link
Author

@dbeatty10 thanks for your response.
I can see from your logs that you're using target=http which I assume may be connection to a Databricks cluster. This issue only appears in the session connection (local pyspark).

I just re-ran the tests and got my originally reported results back.

@dbeatty10
Copy link
Contributor

Thank you for that insight @vakarisbk -- will re-open this!

Nice eyes on http vs. session 🦅

Reprex

Using the example here, the following commands do give the expected output when using the ✅ http connection method, but give unexpected output when using the ❌ session method:

dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10

Acceptance criteria

  • There are tests within dbt-adapters for various seed values that are expected to translate to null within the database.
  • These tests are inherited by the dbt-spark adapter.
  • These tests pass when the session connection method is used.

Summary

We are unable to prioritize this ourselves, so I'm going to lable this as "help wanted".

@dbeatty10 dbeatty10 reopened this Feb 9, 2024
@dbeatty10 dbeatty10 added help_wanted Extra attention is needed and removed wontfix This will not be worked on labels Feb 9, 2024
@dbeatty10 dbeatty10 changed the title [ADAP-931] [Bug] [PySpark Session] Null values in seeds appear as string "None" or empty strings in the resulting table [ADAP-931] [Bug] Values in seeds that should convert to null aren't working for session connection method Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help_wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants