Skip to content

CDS Extractor Rewrite Phase 2 : Improve Performance and Precision #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

data-douser
Copy link
Collaborator

@data-douser data-douser commented May 18, 2025

This PR implements the planned "Phase 2" of the full rewrite of the CDS extractor, focusing on improving performance and precision. It introduces significant changes to the CodeQL CDS extractor, including a major refactor of the extraction process, updates to scripts, and improvements to configuration and debugging. Throughout this multi-phase rewrite process, the approach has been documented in the extractors/cds/tools/autobuild.md file.

This changes of this PR do not fully implement the "autobuild" run mode for the CDS extractor, but it gets reasonably close. New "run modes" were added to the renamed cds-extractor.ts script (formerly index-files.ts), and the arguments to the script have been update to allow for run modes such as index-files (legacy), debug-parser (new), and autobuild (planned / WIP).

While staying within the limitations of the index-files approach, this changes in the PR are an attempt to integrate parsing and compiling of .cds files in a manner that is "project aware", meaning that we try to only compile the top-level .cds files in an effort to avoid duplication of both compilation work and indexed .cds.json files.

Key Changes:

New Features and Functionality:

  • Introduction of cds-extractor.ts: Added a new script to handle CDS file processing, including project dependency graph building, environment setup, and integration with CodeQL tools. This script replaces the previous index-files.js script.
  • Run Modes: Adds support for running the cds-extractor.ts script with different "run mode" values, including autobuild, debug-parser, and index-files.
  • New Project-Aware Dependency Handling: The cds-extractor.ts script has been rewritten to include features like project dependency graph building, project-aware compilation, and diagnostic handling for CDS files. This enables more efficient and context-aware processing of CDS files.
  • Debugging: Enhanced debugging of project dependency graph (post-parsing, pre-compilation) now supported via the debug-parser run mode of the cds-extractor (node) script.

Script Updates:

  • Batch Script (index-files.cmd) Updates: Updated references from index-files.js to cds-extractor.js, added _run_mode parameter, and adjusted logging and execution commands to align with the new script. [1] [2] [3]
  • Shell Script (index-files.sh) Updates: Similar updates as the batch script, including parameter additions and script name changes for consistency. [1] [2] [3] [4]

Configuration Improvements:

  • ESLint Configuration (eslint.config.mjs): Refactored imports for better readability, updated rules and plugin configurations, and added comments to clarify TypeScript and JavaScript-specific settings.

Miscellaneous:

  • .gitignore Update: Added an entry to ignore debug/ files created during debugging of the CDS extractor.

Documentation Updates:

  • Updated Workflow Diagrams: The README.md now reflects the new cds-extractor workflow, replacing outdated references to index-files with the new process and steps for project-aware compilation. [1] [2]

This commit:

- Implements an initial version of a project-aware CDS parser.
- Creates a dedicated "cds" package at "extractors/cds/tools/src/cds".
- Converts existing unit tests to use the new path for functions
  related to parsing and/or compiling .cds files.
This commit:

- fixes a typo in a comment, as identified in a previous PR ( advanced-security#188 );

- updates the logic of the CDS extractor's `findPackageJsonDirs` function;

- fixes a regression in the CDS extractor where a "project directory"
  was not properly recognized when its path was the same as the
  "source root" directory for the CDS extractor scan;

- adds unit tests to cover edge cases idendified for the
  `findPackageJsonDirs` function.
Renames the entrypoint to the CDS extractor script and refactors its
arguments in order to support using different "run modes" for the
extractor, including:

  - "autobuild" : work-in-progress, just a stub right now;
  - "debug-parser" : using for debugging CDS project & file parsing;
  - "index-files" : legacy mode, useful for backwards compatibility;

Updates the usage (help) message for the script to represent the
required arguments for each of the currently planned run modes.

Adds support for the "debug-parser" run mode, which debugs to a file
under the `extractors/cds/tools/out/debug/` directory. Useful for
in-progress rewrite of the CDS extractor to be more performant when
running and more useful in terms of yielding a CodeQL database that
allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS
extractor, using the CDS projects nested under this repository's
`javascript/frameworks/cap/test/queries` directory as testing
targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the
use of the `cds` compiler.

Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the
codeql CLI can be reliably found and to avoid duplicate runs of
the CDS parser's graph building process for "debug-parser" versus
other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging
in order to provide more useful indications of the CDS compiler
version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that
allows for de-duplication of individual `.cds` files that are already
included by another `.cds` file in the project.
This commit:

- Implements an initial version of a project-aware CDS parser.
- Creates a dedicated "cds" package at "extractors/cds/tools/src/cds".
- Converts existing unit tests to use the new path for functions
  related to parsing and/or compiling .cds files.
Renames the entrypoint to the CDS extractor script and refactors its
arguments in order to support using different "run modes" for the
extractor, including:

  - "autobuild" : work-in-progress, just a stub right now;
  - "debug-parser" : using for debugging CDS project & file parsing;
  - "index-files" : legacy mode, useful for backwards compatibility;

Updates the usage (help) message for the script to represent the
required arguments for each of the currently planned run modes.

Adds support for the "debug-parser" run mode, which debugs to a file
under the `extractors/cds/tools/out/debug/` directory. Useful for
in-progress rewrite of the CDS extractor to be more performant when
running and more useful in terms of yielding a CodeQL database that
allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS
extractor, using the CDS projects nested under this repository's
`javascript/frameworks/cap/test/queries` directory as testing
targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the
use of the `cds` compiler.

Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the
codeql CLI can be reliably found and to avoid duplicate runs of
the CDS parser's graph building process for "debug-parser" versus
other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging
in order to provide more useful indications of the CDS compiler
version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that
allows for de-duplication of individual `.cds` files that are already
included by another `.cds` file in the project.
…/codeql-sap-js into data-douser/cds-ts-rewrite-2
Updates the mermaid flowchart for the CDS extractor in order to
reflect recent changes to how the CDS extractor actually works.
Copilot

This comment was marked as outdated.

data-douser and others added 7 commits June 7, 2025 18:26
Fixes detection of .cds file in CDS projects by ensuring that
"node_modules" subdirectories are explicitly ignored and "srv" and "db"
subdirectories are explicitly included.

Migrates some logic from cds-extractor.ts (entrypoint) script to
testable functions under extractors/cds/tools/src/ directory.

Adds and improves unit tests related to code changes from this commit.
Removes an unintended change in CDS compile (to .cds.json) behavior
due to the (mis)use of the "--parse" command.

Fixes a regression in the expected query results in at least one case:

`javascript/frameworks/cap/src/sensitive-exposure/SensitiveExposure.ql`
Refactors cds extractor `src/cds/compiler` and `src/cds/parser`
packages for improved maintainability.

Simplifies the main logic of the CDS extractor such that we always
build a graph that maps CDS projects to their imports / dependencies,
which is part of the longer process of deprecating the "index-files"
run mode of the CDS extractor (in favor of autobuild, eventually).

Attempts to fix CDS file and project parsing for test projects such as:
`javascript/frameworks/cap/test/queries/loginjection/log-injection-without-protocol-none`
Fixes a regression where the project base directory was being used
to set the `cwd` of the process spawned for running the CDS compiler
for "project-aware" compilation. Adds unit tests to ensure the `cwd`
is always set to the value of the `sourceRoot` directory.

Further refactoring of the `cds/compiler` and `cds/parser` packages
within the source code of the CDS extractor.

This commit is expected to actually cause more problems with existing
queries, despite fixing the relative-file-path problem / regression.
Some changes to existing CodeQL queries and/or expected results may be
required as, at this point, the JSON data generated by the CDS compiler
(via the CDS extractor) seems valid.
@data-douser data-douser requested a review from Copilot June 11, 2025 15:57
Addresses a code scanning alert related to the recent introduction
of the `Math.random()` function when creating a session ID for CDS
extractor logging. Replaces `Math.random()` with `Data.now()` in an
effort to remove any perception that the "session ID" is in any way
used in a "security context".
Remove "run modes" in an attempt to simplify the overall CDS extractor.
Fixes a regression in the CDS extractor's processing of monorepos.
Updates the `extractors/cds/tools/.gitignore` file to explititly include
the `dist/` and `node_modules/` subdirectories in order to support
pre-build of the CDS extractor JS (compiled from TS) code.
Improves the efficiency and logging of package dependency installation
tasks in the "packageManager" package of the CDS extractor source code.

Responds to peer feedback on related PR advanced-security#195.

Extends unit test coverage for the `src/packageManager` code.
@data-douser data-douser requested a review from lcartey July 1, 2025 03:12
@data-douser
Copy link
Collaborator Author

@lcartey I believe that this version of the CDS extractor accounts for the case where an annotation is written in a separate .cds file. Because the latest approach uses a single cds compile command to compile all of the .cds files for a given CAP project, the js/cap-entity-exposed-without-authentication query should no longer return FP results related to the TestService in the cloud-cap-samples/bookshop project.

Removes the `dist/` and `node_modules/` directories from git tracking
for the CDS extractor. This might be temporary, but for now is needed
for PR advanced-security#195 readability.
Cleanup remaining, known problems with the CDS extractor rewrite
for PR advanced-security#195 , including:

- Fixes a bug that was introduced to the `index-files.sh` script by the
  previous commit;
- Removes dead `src/**/*.ts` code, where found;
- Replaces hardcoded, system-local paths in CDS project files;
- Improves the organization, logic, and testing of `src/cds/compiler`
  code.
Copy link
Contributor

@lcartey lcartey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some dead code and areas which could be simplified.

This commit:

- removes unused functions from CDS extractor "src/cds/**/*.ts";
- removes logging of CDS extractor memeory usage;
- renames some CDS extractor logging functions for consistency;
- addresses PR comments for the `src/cds/parser/functions.ts` file.
Upates the `compileProjectLevel` function of the CDS extractor to just
use the project directory as the first argument to the `cds compile`
command, which simplifies the code while accounting for a wide variety
of project (directory and file) structures.
data-douser and others added 6 commits July 8, 2025 13:24
Improves unit testing for the TypeScript code in the CDS
extractor's "compiler" module. Aligns the unit testing
coverage for the "compiler" module to be comparable to other
modules (libraries) in the source code for the CDS extractor.
@data-douser data-douser requested a review from lcartey July 10, 2025 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request javascript Pull requests that update javascript code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants