Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-place upgrade fails for BigQuery ExternalTables #19177

Comments

@rshamunov
Copy link

rshamunov commented Aug 19, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.9.4
on linux_amd64

  • provider registry.terraform.io/hashicorp/google v5.40.0

Affected Resource(s)

google_bigquery_table

Terraform Configuration

resource "google_bigquery_table" "test-hived-table" {
  dataset_id          = "$DATASET_NAME"
  project             = "$PROJECT_ID"
  table_id            = "test-hived-table-tf"
  deletion_protection = false
  labels = {
    l1 = "hello"
  }
  external_data_configuration {
    autodetect    = false
    source_format = "CSV"
    source_uris = [
      "$BUCKET_NAME/example/*"
    ]
    max_bad_records       = 0
    ignore_unknown_values = false
    csv_options {
      field_delimiter   = ","
      skip_leading_rows = 1
      quote             = "\""
    }
    hive_partitioning_options {
      mode                     = "CUSTOM"
      source_uri_prefix        = "$BUCKET_NAME/example/{day:DATE}"
      require_partition_filter = false
    }
schema = <<EOF
[
    {
        "name": "Id",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "Weight",
        "type": "INTEGER",
        "mode": "REQUIRED"
    }
]
EOF
}
}
provider "google" {
  default_labels = {
    l1 = "hello"
  }
}

Debug Output

https://gist.github.com/rshamunov/8814cff7f2c6733061d92ca45ab36979#file-output-log

Expected Behavior

terrafrom apply completes in-place update of an external table

Actual Behavior

Error: googleapi: Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration. If the schema is present on both, the schemas must be the same., invalid

Steps to reproduce

script to reproduce

Important Factoids

No response

References

No response

b/362528330

@rshamunov rshamunov added the bug label Aug 19, 2024
@github-actions github-actions bot added forward/review In review; remove label to forward service/bigquery labels Aug 19, 2024
@ggtisc ggtisc self-assigned this Aug 20, 2024
@ggtisc
Copy link
Collaborator

ggtisc commented Aug 21, 2024

Hi @rshamunov!

Looking in your code I noticed that you are not using the correct format as terraform registry describes.

Try to use something like this(watch the source_uri_prefix and source_uris arguments, both are using the bucket name but following the terraform registry format):

resource "google_bigquery_dataset" "bq_ds_19177" {
  dataset_id                  = "bq_ds_19177"
  friendly_name               = "BQ DS 19177"
  description                 = "something"
  location                    = "EU"
  default_table_expiration_ms = 3600000

  labels = {
    env = "something"
  }
}

resource "google_storage_bucket" "bucket_19177" {
  name     = "bucket-19177"
  location = "US"
}

resource "google_bigquery_table" "bq_table_19177" {
  dataset_id          = google_bigquery_dataset.bq_ds_19177.dataset_id
  project             = "my-project"
  table_id            = "bq-table-19177"
  deletion_protection = false

  labels = {
    l1 = "hello"
  }

  external_data_configuration {
    autodetect    = false
    source_format = "CSV"

    source_uris = [
      "gs://${google_storage_bucket.bucket_19177.name}/*"
    ]

    max_bad_records       = 0
    ignore_unknown_values = false

    csv_options {
      field_delimiter   = ","
      skip_leading_rows = 1
      quote             = "\""
    }

    hive_partitioning_options {
      mode                     = "CUSTOM"
      source_uri_prefix        = "gs://${google_storage_bucket.bucket_19177.name}/{key1:STRING}"
      require_partition_filter = false
    }

schema = <<EOF
[
    {
        "name": "Id",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "Weight",
        "type": "INTEGER",
        "mode": "REQUIRED"
    }
]
EOF
}
}

@rshamunov
Copy link
Author

Hello @ggtisc.
Here is the TF configuration with variable names replaced to actual values in my test project:

resource "google_bigquery_table" "test-hived-table" {
  dataset_id          = "example_dataset"
  project             = "sh-unmanaged-tests"
  table_id            = "test-hived-table-tf"
  deletion_protection = false
  labels = {
    l1 = "hello_again"
  }
  external_data_configuration {
    autodetect    = false
    source_format = "CSV"
    source_uris = [
      "gs://sh-unmanaged-tests-bucket/example/*"
    ]
    max_bad_records       = 0
    ignore_unknown_values = false
    csv_options {
      field_delimiter   = ","
      skip_leading_rows = 1
      quote             = "\""
    }
    hive_partitioning_options {
      mode                     = "CUSTOM"
      source_uri_prefix        = "gs://sh-unmanaged-tests-bucket/example/{day:DATE}"
      require_partition_filter = false
    }
schema = <<EOF
[
    {
        "name": "Id",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "Weight",
        "type": "INTEGER",
        "mode": "REQUIRED"
    }
]
EOF
}

}

provider "google" {
  default_labels = {
    l1 = "hello_again"
  }
}

Both source_uris and source_uri_prefix seems correct. I don't create table by Terraform in the script, but it should not matter. Table is created successfully. I can query underlying CSV data. Issue only happens when table labels updated (in-place update)

@ggtisc
Copy link
Collaborator

ggtisc commented Aug 21, 2024

Are you trying to use an example folder inside your bucket(sh-unmanaged-tests-bucket)?

@rshamunov
Copy link
Author

@ggtisc yes. I create a bucket sh-unmanaged-tests-bucket, a folder example/day=2024-01-30 and a csv document inside of it. Please see my script to reproduce. If you start it in GCP CloudShell, you end up with mentioned error.

@ggtisc
Copy link
Collaborator

ggtisc commented Aug 22, 2024

@rshamunov I saw it but it seems more like a naming error.

  • Have you already used the format of the link I shared with you?
  • I tried again following the terraform registry format rules and everything was successful without errors

Example:

source_uris = [
      "gs://${google_storage_bucket.bucket_19177.name}/example/*"
]

source_uri_prefix = "gs://${google_storage_bucket.bucket_19177.name}/example/{key1:STRING}"

@rshamunov
Copy link
Author

@ggtisc, Could you explain what's the "terraform registry format" exactly in this case? Do you want me to replace actual bucket name to terraform variable? Why it should be different?

source_uris:
gs://${google_storage_bucket.bucket_19177.name}/example/*
gs://sh-unmanaged-tests-bucket/example/*

source_uri_prefix:
source_uri_prefix = "gs://${google_storage_bucket.bucket_19177.name}/example/{key1:STRING}"
source_uri_prefix = "gs://sh-unmanaged-tests-bucket/example/{day:DATE}"

Did you try to create the table, change table label and apply the configuration again?

@ggtisc
Copy link
Collaborator

ggtisc commented Aug 23, 2024

yes @rshamunov and after testing many times everything is okay without errors.

Terraform registry is the official documentation site of the terraform resources. Did you open the link to check the correct format?

You need to open the link and realize the correct format that you need to follow to avoid this kind of errors.

Also I gave you an example of how these resources work correctly.

@rshamunov
Copy link
Author

rshamunov commented Aug 26, 2024

I've applied configuration almost exactly as you provided (changed bucket name and project name):

resource "google_bigquery_dataset" "bq_ds_19177" {
  dataset_id                  = "bq_ds_19177"
  friendly_name               = "BQ DS 19177"
  description                 = "something"
  location                    = "EU"
  default_table_expiration_ms = 3600000
  project             = "sh-unmanaged-tests"

  labels = {
    env = "something"
  }
}

resource "google_storage_bucket" "bucket_19177" {
  name     = "bucket-19177-777"
  location = "US"
}

resource "google_bigquery_table" "bq_table_19177" {
  dataset_id          = google_bigquery_dataset.bq_ds_19177.dataset_id
  project             = "sh-unmanaged-tests"
  table_id            = "bq-table-19177"
  deletion_protection = false

  labels = {
    l1 = "hello_again"
  }

  external_data_configuration {
    autodetect    = false
    source_format = "CSV"

    source_uris = [
      "gs://${google_storage_bucket.bucket_19177.name}/*"
    ]

    max_bad_records       = 0
    ignore_unknown_values = false

    csv_options {
      field_delimiter   = ","
      skip_leading_rows = 1
      quote             = "\""
    }

    hive_partitioning_options {
      mode                     = "CUSTOM"
      source_uri_prefix        = "gs://${google_storage_bucket.bucket_19177.name}/{key1:STRING}"
      require_partition_filter = false
    }

schema = <<EOF
[
    {
        "name": "Id",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "Weight",
        "type": "INTEGER",
        "mode": "REQUIRED"
    }
]
EOF
}
}

As expected it applied successfully. I changed a label l1 ("hello" to "hello_again") and applied it again. It failed with the same error.

google_bigquery_table.bq_table_19177: Modifying... [id=projects/sh-unmanaged-tests/datasets/bq_ds_19177/tables/bq-table-19177]
╷
│ Error: googleapi: Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration. If the schema is present on both, the schemas must be the same., invalid
│ 
│   with google_bigquery_table.bq_table_19177,
│   on bq.tf line 19, in resource "google_bigquery_table" "bq_table_19177":
│   19: resource "google_bigquery_table" "bq_table_19177" {
│ 

@ggtisc
Copy link
Collaborator

ggtisc commented Aug 27, 2024

Confirmed issue!

When google_bigquery_table.labels changes and trying to run a terraform apply it returns the shared error

@ggtisc ggtisc removed their assignment Aug 27, 2024
@ggtisc ggtisc removed the forward/review In review; remove label to forward label Aug 27, 2024
@wj-chen
Copy link

wj-chen commented Aug 28, 2024

Thanks a lot for the script. Was able to repro. I saw that schema was present in both Table and ExternalDataConfiguration in the Update API request but the value was different, thereby failing the validation. I forwarded the internal version of this issue to the API team to get more insight.

@wj-chen
Copy link

wj-chen commented Sep 4, 2024

Hi @rshamunov, here is the guidance from our API team:

So what happened here is that the table creation was done with schema specified on ExternalDataConfiguration, but as part of the table creation, BQ will store the schema in the main Table message. This causes the future update to fail.

We recommend just using the schema in Table instead - this is the recommended path and avoids the previous problem.

Could you try specifying google_bigquery_table.schema instead of google_bigquery_table.external_data_configuration.schema and see if it resolves your issue?

@rshamunov
Copy link
Author

I tried to define a schema outside of external_data_configuration, in google_bigquery_table.schema. But this brought another issue. Every terraform apply command re-creates the table, despite there are no actual changes:

terraform apply
google_bigquery_table.test-hived-table: Refreshing state... [id=projects/sh-unmanaged-tests/datasets/example_dataset/tables/test-hived-table-tf]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # google_bigquery_table.test-hived-table must be replaced
-/+ resource "google_bigquery_table" "test-hived-table" {
      ~ creation_time                   = 1725528080087 -> (known after apply)
      ~ etag                            = "JVoggigE1FbxTTSsLimbJw==" -> (known after apply)
      ~ expiration_time                 = 0 -> (known after apply)
      ~ id                              = "projects/sh-unmanaged-tests/datasets/example_dataset/tables/test-hived-table-tf" -> (known after apply)
      ~ last_modified_time              = 1725528080152 -> (known after apply)
      ~ location                        = "US" -> (known after apply)
      ~ num_bytes                       = 0 -> (known after apply)
      ~ num_long_term_bytes             = 0 -> (known after apply)
      ~ num_rows                        = 0 -> (known after apply)
      - require_partition_filter        = false -> null
      - resource_tags                   = {} -> null
      ~ schema                          = jsonencode(
          ~ [
                # (2 unchanged elements hidden)
                {
                    mode = "REQUIRED"
                    name = "Weight"
                    type = "INTEGER"
                },
              - {
                  - mode = "NULLABLE"
                  - name = "day"
                  - type = "DATE"
                },
            ] # forces replacement
        )
      ~ self_link                       = "https://bigquery.googleapis.com/bigquery/v2/projects/sh-unmanaged-tests/datasets/example_dataset/tables/test-hived-table-tf" -> (known after apply)
      ~ type                            = "EXTERNAL" -> (known after apply)
        # (8 unchanged attributes hidden)

      ~ external_data_configuration {
          + schema                = (known after apply)
            # (6 unchanged attributes hidden)

            # (2 unchanged blocks hidden)
        }
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Moreover, documentation for google_bigquery_dataset suggests to use google_bigquery_table.external_data_configuration.schema

If you use external_data_configuration [documented below](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_table#nested_external_data_configuration) and do not set external_data_configuration.connection_id, schemas must be specified with external_data_configuration.schema. Otherwise, schemas must be specified with this top-level field.

@wj-chen
Copy link

wj-chen commented Sep 7, 2024

Thanks for the detailed follow-up. I believe the permadiff issue is the same as in #12465. The provider-side fix is in discussion but unfortunately won't be available soon. Meanwhile The other issue has some workarounds that may unblock you for the time being.

For the resource documentation, I will check with our API team to see if that's no longer accurate.

@wj-chen
Copy link

wj-chen commented Sep 12, 2024

I received some guidance from the API team. The documentation is correct. In terms of the two issues seen here 1) "Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration...." when defining the schema in the external data config level and 2) terraform apply attempting to re-create the table when defining the schema in the table level, I will look into potential fixes on the Terraform resource implementation side.

@wj-chen
Copy link

wj-chen commented Sep 20, 2024

The PR that closed this Github issue should fix the following problem when released in a new google provider, (the version after the next, estimated to be v6.5.0):

"Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration...." when defining the schema in the external data config level

Once you upgrade the provider to that version, please continue to specify the schema inside external_data_configuration like before. Feel free to update this issue here if the problem persists or you see other issues then.

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.