Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PNPM: Handle projects with nested subprojects #9863

Merged
merged 2 commits into from
Jan 31, 2025

Conversation

oheger-bosch
Copy link
Member

This PR changes the parsing of the PNPM list output to handle the special case of nested projects; here the command does not produce syntactically correct JSON. This is related to #9784.

I ran a scan with these changes on an internal project that was failing before, and the generated reports looked at least reasonable. However, I am no expert for PNPM and thus cannot be sure whether the nested project is actually handled correctly. The successful fun tests should prove that no unrelated functionality was broken.

@oheger-bosch oheger-bosch requested a review from a team as a code owner January 29, 2025 15:40
Copy link

codecov bot commented Jan 29, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 68.21%. Comparing base (26381f0) to head (54167a6).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...package-managers/node/src/main/kotlin/pnpm/Pnpm.kt 62.50% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #9863      +/-   ##
============================================
+ Coverage     68.12%   68.21%   +0.08%     
- Complexity     1292     1293       +1     
============================================
  Files           250      250              
  Lines          8840     8849       +9     
  Branches        917      920       +3     
============================================
+ Hits           6022     6036      +14     
+ Misses         2431     2424       -7     
- Partials        387      389       +2     
Flag Coverage Δ
funTest-docker 65.14% <75.00%> (+<0.01%) ⬆️
funTest-non-docker 33.37% <ø> (ø)
test-ubuntu-24.04 35.98% <33.33%> (+0.22%) ⬆️
test-windows-2022 35.96% <33.33%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JSON.decodeToSequence<List<ModuleInfo>>(
ByteArrayInputStream(json.toByteArray()),
DecodeSequenceMode.WHITESPACE_SEPARATED
).toList().flatten()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looking at [1] it seems that "nested project" is just a project which is in a descendant dir of another project.
If ORT analyzes the parent dir or any ancestor, IIUC it would find both definition files / projects and analyze them separately.

Calling flatten, as above, I'd expect the dependencies of the nested project to become dependencies of the outer one. Why is this intended?

(I wonder if instead of flatten(), first() has to be called. (So that the inner project gets ignored, as it will be analyzed separately anyway)

[1] https://github.com/oss-review-toolkit/ort/tree/fbb3a47bb65f25c0b67c4d312cd0adb3e29960c9/plugins/package-managers/node/src/funTest/assets/projects/synthetic/pnpm/nested-project

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looking at [1] it seems that "nested project" is just a project which is in a descendant dir of another project.
If ORT analyzes the parent dir or any ancestor, IIUC it would find both definition files / projects and analyze them separately.

BTW, some analyzers already have code to filter out definition files from nested vendor directories, which is a similar case, also see #9864.

So maybe also for PNPM, nested definition file should be filtered out, as the pnpm CLI takes care by itself to regard them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, some analyzers already have code to filter out definition files from nested vendor directories, which is a similar case, also see #9864.

Can you give a hint why you think this is a similar case? (I wonder whether these projects are actually independent besides being in subdirs)

So maybe also for PNPM, nested definition file should be filtered out, as the pnpm CLI takes care by itself to regard them?

There is also the option of disregarding the nested subproject wrt to a definition file resolveDependencies().

Note: For workspaces, it is necessary to skip the analysis of the individual definition files and analyze them from the workspace root in some cases, because the workspace root is where the lockfile resides. How does the lockfile situation look here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run tests with both approaches, using flatten() and first(), against the real project that was failing before. The results are almost identical; the only difference is that in case of flatten() the nested project shows up in the list of projects in the ORT result, while this is not the case with first().

So, what do you prefer? For me both options would be fine as long as the Analyzer run is successful.

On a side note, is it safe to use first()? Can we assume that there is a defined order in the output of PNPM?

Copy link
Member

@fviernau fviernau Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what do you prefer?

I prefer to analyze the nested project separately, if possible (consider lockfile location), and thus use first / single approach.

On a side note, is it safe to use first()? Can we assume that there is a defined order in the output of PNPM?

Note sure, but it certainly should be possible to find the single item which corresponds to the project in the current working dir. I expect this to be simpler and more correct?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferences of the reviewers are somewhat contradictory with regard to the flatten() vs first() approach.
Would the following be an approach everybody can agree with:

  • parsePnpmList() just returns a Sequence<List<ModuleInfo>>. Then no information is lost at this point.
  • PNPM filters the sequence for the list of paths of the top-level project and processes only those.

@fviernau, it is not clear to me based on which criteria this filtering can be done. How can the list that corresponds to the definition file be identified? If there is only a single path involved, I would go for the shortest path, but if there are multiple, I am not sure how to handle this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following be an approach everybody can agree with:

That would work for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PNPM filters the sequence for the list of paths of the top-level project and processes only those.

As long as this filtering is done, so that the nested projects deps are not considered for the out project, it's fine for me.

nit: Regarding the sequence return type. Does it safe any JSON parsing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the sequence return type. Does it safe any JSON parsing?

The docs of decodeToSequence indicate so: "Transforms the given stream into lazily deserialized sequence of elements of type T".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Regarding the sequence return type. Does it safe any JSON parsing?

I guess this depends on how the filtering logic can be implemented. If all elements of the sequence have to be processed and compared to each other to find the correct one, then probably not.

Copy link
Member

@sschuberth sschuberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this seemingly is a regression, did you verify that new results for nested projects match the ones from before the regression?

result.projects.map { it.id } shouldContainExactlyInAnyOrder listOf(
Identifier("PNPM::pnpm-nested-sub-project:1.0.0"),
Identifier("PNPM::pnpm-package-with-nested-project:1.0.0")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the test similar to the Resolve workspace dependencies correctly in YarnFunTest, so that it uses an entire expected result file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If the approach is to ignore the entries for the nested project, I'm ok to not add this test)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test has been changed.

@oheger-bosch
Copy link
Member Author

Since this seemingly is a regression, did you verify that new results for nested projects match the ones from before the regression?

Yes, I compared the new results against the ones from the last successful run of the affected project (scanning the same revision). In fact, the new results now contain significantly more NPM packages (472 vs 194). Since the processing logic of PNPM has been fully rewritten, I cannot easily tell what is causing this difference. And given the size of the dependency graph, it is also not trivial to further analyse the changes.

@sschuberth
Copy link
Member

In fact, the new results now contain significantly more NPM packages (472 vs 194).

But IIUC, the first() vs. flatten() discussion from above does not have an impact on this discrepancy, correct?

@oheger-bosch
Copy link
Member Author

In fact, the new results now contain significantly more NPM packages (472 vs 194).

But IIUC, the first() vs. flatten() discussion from above does not have an impact on this discrepancy, correct?

Yes, this is correct. Both approaches return the larger set of dependencies.

@oheger-bosch
Copy link
Member Author

Test failures seem to be unrelated.

@fviernau fviernau dismissed their stale review January 30, 2025 14:56

addressed,


val moduleInfos = parsePnpmList(input)

moduleInfos.toList() shouldContainExactly expectedResults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...you can omit the .toList() conversion here and import io.kotest.matchers.sequences.shouldContainExactly instead of the collection version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Both test cases now operate on sequences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, looks like this now leads to "This sequence can be consumed only once"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, switched back to returning a List.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably lost track a bit, but I'm surprised that the test did not need to change again. Anyway, I'll approve 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's because parsePnpmList() is tested, not listModules().

val moduleInfoIterator = iterator()
val first = moduleInfoIterator.nextOrNull() ?: return emptyList()

fun List<ModuleInfo>.matchesWorkingDir() = any { File(it.path).absoluteFile == workingDir }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following do the same?

private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> = 
  first { list ->
      list.any { File(it.path).absoluteFile == workingDir }
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This will not handle the corner case that no element in the sequence satisfies the condition. The current code then returns the first element as fallback, while the proposed one will fail, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, and this one?

private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> = 
  firstOrNull { list ->
      list.any { File(it.path).absoluteFile == workingDir }
  }.orEmpty()

Copy link
Member

@fviernau fviernau Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another tries:

If we want to assume that there is it least one item:

private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> = 
  firstOrNull { list ->
      list.any { File(it.path).absoluteFile == workingDir }
  } ?: first()

Otherwise:

private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> = 
  firstOrNull { list ->
      list.any { File(it.path).absoluteFile == workingDir }
  } ?: firstOrNull().orEmpty()

I hope it's equivalent now?!

On the other hand: I cannot really follow the fear why the condition File(it.path).absoluteFile == workingDir would not work in some unknown cases. I mean, the items do have a path property. And thus the condition should work. I would actually prefer to make the assumption that there is a match, but would also like to hear @sschuberth if he agrees too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer to make the assumption that there is a match, but would also like to hear @sschuberth if he agrees too.

I currently don't have the time to think about this filter logic in detail.

* analyzed. This function tries to detect the corresponding [ModuleInfo]s based on the [workingDir]. If this is not
* possible, as a fallback the first list of [ModuleInfo] objects is returned.
*/
private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider naming this findModulesFor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function was renamed.

val moduleInfoIterator = iterator()
val first = moduleInfoIterator.nextOrNull() ?: return emptyList()

fun List<ModuleInfo>.matchesWorkingDir() = any { File(it.path).absoluteFile == workingDir }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, and this one?

private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> = 
  firstOrNull { list ->
      list.any { File(it.path).absoluteFile == workingDir }
  }.orEmpty()

@oheger-bosch
Copy link
Member Author

@fviernau: Well, this is still not equivalent to the existing code.

My line of thinking is as follows: I am not 100 percent sure whether the condition File(it.path).absoluteFile == workingDir to find the outer project will work in all constellations. Therefore, if there is no match, I want to at least return some paths that can be analysed. Obviously, in the vast majority of cases, PNPM will return only a single list of paths - otherwise, this regression had been detected earlier. So, in this case, it seems reasonable to do the analysis based on this list. For special cases, like nested projects, returning some list seems to be better than nothing. Or would you want to skip the analysis then completely?

val moduleInfoIterator = iterator()
val first = moduleInfoIterator.nextOrNull() ?: return emptyList()

fun List<ModuleInfo>.matchesWorkingDir() = any { File(it.path).absoluteFile == workingDir }
Copy link
Member

@fviernau fviernau Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another tries:

If we want to assume that there is it least one item:

private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> = 
  firstOrNull { list ->
      list.any { File(it.path).absoluteFile == workingDir }
  } ?: first()

Otherwise:

private fun Sequence<List<ModuleInfo>>.findModules(workingDir: File): List<ModuleInfo> = 
  firstOrNull { list ->
      list.any { File(it.path).absoluteFile == workingDir }
  } ?: firstOrNull().orEmpty()

I hope it's equivalent now?!

On the other hand: I cannot really follow the fear why the condition File(it.path).absoluteFile == workingDir would not work in some unknown cases. I mean, the items do have a path property. And thus the condition should work. I would actually prefer to make the assumption that there is a match, but would also like to hear @sschuberth if he agrees too.

@oheger-bosch
Copy link
Member Author

@fviernau, it is not guaranteed that the Sequence can be iterated multiple times; therefore, calling first() as fallback can fail. This is the reason why I am using an Iterator; I could not find a better solution that handles all cases.

Please, this issue is bugging us. Could we defer the "find the most elegant Kotlin solution contest" to a later time?

@fviernau
Copy link
Member

Could we defer the "find the most elegant Kotlin solution contest" to a later time?

You will do it?

fviernau
fviernau previously approved these changes Jan 31, 2025
@sschuberth
Copy link
Member

The unrelated test failure will get addressed by #9879.

sschuberth
sschuberth previously approved these changes Jan 31, 2025
@oheger-bosch
Copy link
Member Author

Rebased on main to get the fixes for the failing tests.

@sschuberth sschuberth enabled auto-merge (rebase) January 31, 2025 10:07
If PNPM encounters nested projects, the output of the `list` command
is not well-formed JSON, but consists of multiple arrays. Change the
`parsePnpmList()` function to handle this format correctly.

In `Pnpm`, only analyze the top-level project, since nested projects
will be handled by the `pnpm` command transparently.

Fixes oss-review-toolkit#9784.

Signed-off-by: Oliver Heger <[email protected]>
The test verifies that nested projects are now handled correctly.

Signed-off-by: Oliver Heger <[email protected]>
@oheger-bosch oheger-bosch dismissed stale reviews from sschuberth and fviernau via 54167a6 January 31, 2025 12:00
@sschuberth sschuberth merged commit fe56fb4 into oss-review-toolkit:main Jan 31, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants