-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CVV_COLUMNS table integration test #16
base: dev
Are you sure you want to change the base?
Conversation
dmorokov
commented
May 26, 2022
- Added JdbcUtils and methods
- Added svv_columns.sql for retrieving table's data
- Added SvvColumnsTest test
- Added SvvColumnsRow POJO object
- Added CsvUtil and methods
- Added Gradle reporting plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I added a few comments and asked a few questions.
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
dumper-integration-tests/redshift-tests/src/main/java/com/google/sql/svv_columns.sql
Outdated
Show resolved
Hide resolved
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
dumper-integration-tests/redshift-tests/src/main/java/com/google/jdbc/JdbcUtil.java
Outdated
Show resolved
Hide resolved
implementation 'commons-io:commons-io:2.11.0' | ||
implementation 'org.apache.commons:commons-collections4:4.4' | ||
implementation 'com.google.guava:guava:31.0.1-jre' | ||
implementation 'com.opencsv:opencsv:5.6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumper uses commons-csv
, using the same might rule-out some issues due to library incompatibility on special/corner cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite the opposite, if lib A has any bug or unexpected behavior, than using in tests lib B allows us to catch that behavior, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will tell where A and B differ in behavior, and both libraries might be just using different - although valid - conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in 1-to-1 chat. No alterations required here. Consider to leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please share the answer here for other reviewers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure,
Paolo Morandini, Fri 11:22 AM
I'm going to suggest to stick with opencsv and reconsider only if we get some test failures / problems due to incompatibility with commons-csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption is that this would favor coherence with, and re-use of, the already existing test suite for the extraction tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's valid for a server to return a different column set at a different time, for example, a different server release.
This PR binds the columns into a hard-written POJO, so when that (valid) act happens, this test case will fail. So the test case is enforcing something which is not within the contract of the dumper. That concerns me a little.
If the use of OpenCSV is to do with the beans binding and POJOs, then we may need to raise this discussion by one level, to work out what contract we are actually enforcing. I would have thought that the contract is that:
- If we create a dump, and
- Run the loader over it, then
- We obtain a particular set of semantic information.
I would avoid enforcing the presence of columns which may or may not be present, and which we may or may not care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a test should do exactly what you are saying - enforce the contract.
When a test is 'flexible' and 'adjusting on the fly' it's very probable to miss a corner case.
Obviously, there are a bunch of traits of a test, but I think, that a good test should be simple, specific and understandable. So, here we are talking about a test being specific with its data, structure, etc.
I read it as no changes are required here.
Please, confirm or resolve the conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract we aim to enforce is that the loader ends up with a particular set of semantic information, not that a test which selects from the same table obtains the same rows. I think this is the key issue about this entire pull-request chain, and I think this thread is not resolved.
e49fd30
to
a7987ee
Compare
8d6aeaa
to
38729b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the description to explain the purpose of this code?
implementation 'commons-io:commons-io:2.11.0' | ||
implementation 'org.apache.commons:commons-collections4:4.4' | ||
implementation 'com.google.guava:guava:31.0.1-jre' | ||
implementation 'com.opencsv:opencsv:5.6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please share the answer here for other reviewers?
* @param dbMultiset List of extracted from DB items | ||
* @param csvMultiset List of uploaded from Avro items | ||
*/ | ||
public static void assertMultisetsEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use Maps.difference() on a Map<T, Integer>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can, but what's the reason behind it?
We used that approach for testing Teradata dwh-assessment-extraction-tool and faced zero issues with it even on large amounts of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every line of code we have under maintenance costs us dollars and hours. Using pre-existing, tested code saves us money both now and in the long run. The objective of programming is to write code. The objective of good engineering is to achieve results, preferably while not writing code. So the less code we have, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want something like
[Immutable]Map<K, Integer> m = Maps.toMap(multiset, multiset::getCount)
And then make assertions on Maps.difference(m0, m1).
The code here, in addition to having maintenance cost, is very expensive in terms of remove() calls, and the error messages it finally produces are not as clear as those which can easily be generated from a MapDifference object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conclusion
I've tried this approach with Maps.difference()
and it doesn't give me a desirable result.
Because it's keeping custom objects as Keys and instance counters as Values, this algorithm misses pretty obvious test cases, thus the test runs and indicates false-positive result.
I have lists A and B, I removed A[0], A[1], and B[last] and the algorithm with maps said that there is the diff in only 1 element, when in fact it's there 3 mutual exclusive elements.
The original code isn't a piece of art but it's fine tuned and it does its job. Given that we are not going to do any sort of performance tests using these test case scenarios I don't see any potential risk for this piece of code even though it's not perfect.
Point
I think, that 1 custom method, effectively, 8 lines of code, 4 of them are pretty much logging and reporting to CLI, would not require too much attention and costs wouldn't be so dramatically high.
...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/base/TestBase.java
Outdated
Show resolved
Hide resolved
...tegration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/sql/SqlUtil.java
Outdated
Show resolved
Hide resolved
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
38729b7
to
6be870a
Compare
@shevek-google I don't understand what I should update in the commit message. I think that all the main points of this PR are already there. Please, let me know if any specific point should be included. |
775e185
to
3889805
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this is definitely moving in the right direction, I feel like we're into second-layer comments now, which is great, thank you.
implementation 'commons-io:commons-io:2.11.0' | ||
implementation 'org.apache.commons:commons-collections4:4.4' | ||
implementation 'com.google.guava:guava:31.0.1-jre' | ||
implementation 'com.opencsv:opencsv:5.6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's valid for a server to return a different column set at a different time, for example, a different server release.
This PR binds the columns into a hard-written POJO, so when that (valid) act happens, this test case will fail. So the test case is enforcing something which is not within the contract of the dumper. That concerns me a little.
If the use of OpenCSV is to do with the beans binding and POJOs, then we may need to raise this discussion by one level, to work out what contract we are actually enforcing. I would have thought that the contract is that:
- If we create a dump, and
- Run the loader over it, then
- We obtain a particular set of semantic information.
I would avoid enforcing the presence of columns which may or may not be present, and which we may or may not care about.
implementation 'com.google.protobuf:protobuf-gradle-plugin:0.8.18' | ||
implementation 'com.google.protobuf:protobuf-java:3.20.1' | ||
annotationProcessor 'com.google.auto.value:auto-value:1.9' | ||
compileOnly 'com.google.auto.value:auto-value-annotations:1.9' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have typically set dependency versions globally using a Gradle build convention. Can we please follow that convention for these submodule(s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of right now, this test automation framework isn't included as a submodule into the main project.
There are several reasons for that:
- We don't want to overlap in dependencies or cross-dependencies
- We use different libs (OpenCSV and Common-CSV)
- We may have different core level test frameworks - TestNG and Junit
- We may have different reporting and logging implementations
- Integration tests, in accordance with industry best practices, are considered to be a separate project and it doesn't necessarily have to be integrated with the code of the product. If that were Component or Unit tests, then yes, it's a valid scenario for submoduling and bundling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(2) and (3) are review comments elsewhere; our convention is to use commons-csv and junit.
(4) should be a review comment; we would like this code to follow the conventions of the repository.
(5) Gradle would normally represent an integration test as a TestSet, and this practice is sufficiently standard that it is now part of Gradle core.
* @param dbMultiset List of extracted from DB items | ||
* @param csvMultiset List of uploaded from Avro items | ||
*/ | ||
public static void assertMultisetsEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want something like
[Immutable]Map<K, Integer> m = Maps.toMap(multiset, multiset::getCount)
And then make assertions on Maps.difference(m0, m1).
The code here, in addition to having maintenance cost, is very expensive in terms of remove() calls, and the error messages it finally produces are not as clear as those which can easily be generated from a MapDifference object.
...on-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/pojo/SvvColumnsRow.java
Outdated
Show resolved
Hide resolved
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/jdbc/JdbcUtil.java
Outdated
Show resolved
Hide resolved
...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/jdbc/JdbcUtil.java
Outdated
Show resolved
Hide resolved
...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/jdbc/JdbcUtil.java
Outdated
Show resolved
Hide resolved
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
...on-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/base/TestConstants.java
Show resolved
Hide resolved
5661f9c
to
d5e6e7e
Compare
d5e6e7e
to
32e7f6e
Compare
...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
8738593
to
4e7e799
Compare
...tegration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/csv/CsvUtil.java
Outdated
Show resolved
Hide resolved
...tegration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/csv/CsvUtil.java
Outdated
Show resolved
Hide resolved
...tegration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/csv/CsvUtil.java
Outdated
Show resolved
Hide resolved
4e7e799
to
7b499ed
Compare
.../redshift-tests/src/test/java/com/google/edwmigration/dumper/integration/SvvColumnsTest.java
Outdated
Show resolved
Hide resolved
|
||
assertThat(dbColumnHeaders.size()).isEqualTo(csvColumnHeaders.size()); | ||
assertThat(dbColumnHeaders).containsExactlyElementsIn(csvColumnHeaders); | ||
assertDbCsvDataEqual(dbMultiset, csvMultiset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this is equal to
Truth.assertThat(Multisets.difference(dbMultiset, csvMultiset))
.containsExactlyElementsIn(Multisets.difference(csvMultiset, dbMultiset));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's make an experiment.
There are 4 possible cases for this assert:
- A and B are equal
- A has plus or minus some entries
- B has plus or minus some entries
- A and B have mutually exclusive entries
Now, with this output can you clearly understand what happened in the test?
value of: multiset.onlyElement()
expected: {numeric_precision_radix=, is_nullable=NO, column_default=, numeric_scale=, table_catalog=dev, table_name=pg_tablespace, collation_catalog=, domain_name=, character_set_name=, interval_type=, datetime_precision=, collation_schema=, column_name=spcname, character_maximum_length=, collation_name=case_sensitive, character_set_catalog=, interval_precision=, character_set_schema=, ordinal_position=1, table_schema=pg_catalog, data_type=name, numeric_precision=, remarks=}
but was : {numeric_precision_radix=, datetime_precision=, is_nullable=YES, column_default=, collation_schema=, column_name=as_locator, character_maximum_length=, collation_name=, numeric_scale=, character_set_catalog=, table_catalog=dev, table_name=routines, interval_precision=, character_set_schema=, collation_catalog=, ordinal_position=54, domain_name=character_data, character_set_name=, table_schema=information_schema, data_type=character varying, numeric_precision=, interval_type=, remarks=}
and can you with this
DB view and CSV file have mutually exclusive row(s)
DB view has 1 different row(s):
{numeric_precision_radix=, datetime_precision=, is_nullable=YES, column_default=, collation_schema=, column_name=as_locator, character_maximum_length=, collation_name=, numeric_scale=, character_set_catalog=, table_catalog=dev, table_name=routines, interval_precision=, character_set_schema=, collation_catalog=, ordinal_position=54, domain_name=character_data, character_set_name=, table_schema=information_schema, data_type=character varying, numeric_precision=, interval_type=, remarks=}
CSV file has 1 different row(s):
{numeric_precision_radix=, is_nullable=NO, column_default=, numeric_scale=, table_catalog=dev, table_name=pg_tablespace, collation_catalog=, domain_name=, character_set_name=, interval_type=, datetime_precision=, collation_schema=, column_name=spcname, character_maximum_length=, collation_name=case_sensitive, character_set_catalog=, interval_precision=, character_set_schema=, ordinal_position=1, table_schema=pg_catalog, data_type=name, numeric_precision=, remarks=}
Conclusion
Does the usage of this built-in comparison method provide any benefit for tests?
I don't think so.
We save 20 lines of code, sure, but we lose a ton of output verbosity and visibility.
I completely cannot understand why we are fighting so hard for the 20-line code fragment and willing to sacrifice output readability over the number of code lines.
So, yes, the method does work.
However, it's a push back from my side on that. As a QA engineer, dealing with unreadable test output is worse than having a 20-line code block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following considerations apply:
- widely used libraries are less likely to have bugs
- same applies for performance reasons
- easier for new contributors to understand what's going on (under the assumption that they know the libraries in use)
- the output format
expected: {} \n but was: {}
is very common in test failure reports (some tools might even look for this pattern)
If we want to keep the custom failure reporting, we could at least use Multisets.difference
instead of doing it manually by iterate-and-remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- libs are more likely to have bugs compare to Java native methods
- I tried both methods, this custom and provided by you above - no performance difference on 20k entries
- agree, should be easier. The problem in this case is - easier doesn't mean better. As Shevek said, we develop code that does what we need, it solves some computations or problems, here is the example of solving a problem with understand-ability of long log outputs.
- again, I agree that it's very common, but we are talking about different code properties. Common vs Convenient-and-Readable. A test must do what its supposed to do - verify and provide a report that's easy to read.
@paolomorandini Still, from the first log output above, can you recognize and distinguish - does either DB or CSV have more or less entries? or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Truth.assertThat(dbMultiset).containsExactlyElementsIn(csvMultiset)
has been excluded because of too much output being produced (it would've been possible to solve this with an override if Multiset.toString
wasn't final).
Should assertDbCsvDataEqual
be refactored to use Multisets.difference(a, b)
instead of manually removing elements with b.forEach(a::remove)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the reduction itself can be refactored.
Although, keep in mind that Multisets.difference()
is marked as a Beta method.
@Beta
public static <E extends @Nullable Object> Multiset<E> difference() {
which stands for:
Signifies that a public API (public class, method or field) is subject to incompatible changes, or even removal, in a future release. An API bearing this annotation is exempt from any compatibility guarantees made by its containing library. Note that the presence of this annotation implies nothing about the quality or performance of the API in question, only the fact that it is not "API-frozen."
It is generally safe for applications to depend on beta APIs, at the cost of some extra work during upgrades. However it is generally inadvisable for libraries (which get included on users' CLASSPATHs, outside the library developers' control) to do so.
@paolomorandini , I can't understand why we implement the beta functionality of the library which is a subject to at least for some alterations. It doesn't look like a step that adds stability to our framework. Quite opposite, it may require some maintenance.
However, I see, that there is a strong opinion on using libraries, so, I'll implement corresponding changes but I've already highlighted the possible risks as well.
Done.
.../redshift-tests/src/test/java/com/google/edwmigration/dumper/integration/SvvColumnsTest.java
Show resolved
Hide resolved
42a0b6e
to
663d9db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please include some instructions on how to run these tests and what contracts they aim to enforce?
My major concern now is that these tests are enforcing an unstable contract, and that we should in fact be enforcing a contract on the output of the loader, given a dump.
implementation 'commons-io:commons-io:2.11.0' | ||
implementation 'org.apache.commons:commons-collections4:4.4' | ||
implementation 'com.google.guava:guava:31.0.1-jre' | ||
implementation 'com.opencsv:opencsv:5.6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract we aim to enforce is that the loader ends up with a particular set of semantic information, not that a test which selects from the same table obtains the same rows. I think this is the key issue about this entire pull-request chain, and I think this thread is not resolved.
implementation 'com.google.protobuf:protobuf-gradle-plugin:0.8.18' | ||
implementation 'com.google.protobuf:protobuf-java:3.20.1' | ||
annotationProcessor 'com.google.auto.value:auto-value:1.9' | ||
compileOnly 'com.google.auto.value:auto-value-annotations:1.9' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(2) and (3) are review comments elsewhere; our convention is to use commons-csv and junit.
(4) should be a review comment; we would like this code to follow the conventions of the repository.
(5) Gradle would normally represent an integration test as a TestSet, and this practice is sufficiently standard that it is now part of Gradle core.
Kokoro build script invokes Dumper and Dumper extracts metadata to a local folder on a build worker, the path to a folder will be stored in EXPORT_PATH env var. After that, integration tests will be run and pointed to EXPORT_PATH to read CSV files from there. The logic of integration tests is to make an SQL request to DB, extract the data and save them to memory, then read CSV files, deserialize it, and then compare with the extraction from DB that the both data sets are field-to-field equal.
663d9db
to
03e12df
Compare