-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve analysis of maven based projects #7965
Comments
Looking forward to your PR, thanks a lot for looking into improving things! As a semi-related side, I've been long aware that the Maven code probably is more complex than it needs to be by now, also see #4720. |
My idea so far that worked is to let maven resolve all dependencies by itself which it can do best it seems (e.g. running a normal build is also fast, even when the artifacts are not cached). Afterwards try to get the repository that maven used to download the dependency from. While debugging, it seems that the DefaultArtifactResolver does correctly set the repository, however this information is later lost, the artifacts that you receive from the project building request do not have a repository field being set though internally that information is known to maven. This feels like a bug in maven, I could workaround that by registering a repositoryListener that gets called whenever an artifact is resolved. There the repository information is still present. |
Such a slow resolution for Maven projects is unusual and often an indication of an issue with one of the used repositories. For example, sometimes repositories are configured which are not available anymore which leads to lots of time consuming network timeouts. I had a short look at the logs and it seems that the majority of the time is spent waiting for https://repository.jboss.org/. Each request takes several seconds, sometimes more than 10. As an example:
Requests to other repositories take only a few milliseconds. I wonder if the JBoss repository is always so slow or if this is a temporary issue. What I would investigate is:
ORT only downloads the POM files for dependencies, for other artifacts only the checksums are downloaded. See: ort/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt Lines 862 to 893 in a464678
This could be an issue with setups where the local Maven repository is kept for ORT scans of different Maven projects. If those projects use different repositories for the same artifacts (like a mirror of Maven central) the information from the local cache might be wrong, because Maven did not attempt a new resolution of artifacts that were already cached. |
The project only has one additional repository setup: repo.eclipse.org for one dependency. The jboss repository probably comes from another dependency's pom and is not used generally. One negative side-effect of the current approach is that ORT tries to download from the repositories that are defined for the project in the order as it retrieved from the maven API. In this project, the repo.eclipse.org repository is first in the list, thus is always tried first to resolve the artifact, which is a waste ofc. I understand that the information in the local maven cache might not reflect the real situation, however using ort like it is currently done does not seem to be working as expected when used in a GitHub Action. The existing ORT action does already some caching but it only affects the ORT job so I would not expect any negative side-effect of it. Maybe a special option could be added to enable this optimization when you are sure that the local maven cache will not have side-effects from other projects? |
I dont think that this is true, looking at ort/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt Lines 722 to 738 in a464678
|
This is exactly what Maven does during dependency resolution. If the POM file of a dependency defines repositories, they are appended to the list of repositories and used to resolve the transitive dependencies of that dependency (but not for other branches of the dependency tree).
Yes, making this optimization optional would be a requirement for some ORT users, but I agree that for the GitHub action it could be a useful option. |
The ort/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt Lines 635 to 638 in a951533
It also does not actually download the artifact but only does an existence check, see: ort/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt Lines 583 to 584 in a951533
|
The order how these repos are tried to resolve an artifact matters. If any artifact is first tried on a repo that only hosts a single dependency but mavencentral is tried last, you are wasting a lot of connections while trying to resolve your artifacts, which in the case of GitHub can have all kinds of side-effect (runner might get throttled). It looks like the maven API that ORT uses returns the repos in LIFO order, so mavencentral is defined in the super pom, but in the list of repos its always last, so ORT tries mavencentral as last resort which is kind of counter-intuitive as mavencentral should be the first option. But thats just what I have seen from debugging this stuff for 1-2 days, will need certainly more time to understand what is going on.
Will need to think about it, but my idea is that maven does a very good job to resolve artifacts and has proven this over many years, any custom resolution mechanism on top of it will be most likely be worse in terms of performance, especially in constrained environements like CI systems. |
As @mnonnenmacher pointed out, ORT does here the same thing as Maven itself would do: Repositories that are defined "closer" to the project / dependency in question are tried first. Otherwise, projects would have no chance of overriding an artifact that also lives in Maven Central. |
ok ty, I did some more digging into the logs from ORT and maven for resolving the dependencies and the way the artifacts are resolved is pretty much the same. Maven seems to do it much faster though. Will need to understand what is different. Connections are dropped / released quite often and have to be re-established, which could trigger some protection mechanisms on some repos. |
I tried to use the ORT GH action to analyse a maven project, but it took more than 30 min to run only the analysis inside a GitHub runner (https://github.com/netomi/macos-notarization-service/actions/runs/7006736431).
The project has some additional repository setup and ORT seems to try find out from which repository a dependency is coming from by trying to download the dependency from each configured repository. There seems to be also some network throttling in place when run as a GH action, when running the same analysis locally, it completed in a couple of minutes.
However, there should be a way to speed this up and I worked on improving the resolution of dependencies in maven projects. In my fork at https://github.com/netomi/ort/tree/disable-remote-verification I did some experiments to get the resolved repository from maven itself (it stores that information in the _remote.repositories file in the local cache).
With these changes the run of the analysis on GitHub could be completed in around 2min, see https://github.com/netomi/macos-notarization-service/actions/runs/7021936345).
My approach so far is quick and dirty and more a PoC, but I will be working on a PR to make this as clean as possible.
The text was updated successfully, but these errors were encountered: