Skip to content

CDS Extractor Rewrite Phase 2 : Improve Performance and Precision #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
5bafe3d
Refactor CDS extractor for dedicated "cds" package
data-douser May 15, 2025
e4c1ff0
Fix CDS extractor findPackageJsonDirs
data-douser May 15, 2025
7e58207
Rename CDS extractor entrypoint and refactor args
data-douser May 16, 2025
af80a68
Add self-parser.test.ts for CDS extractor
data-douser May 16, 2025
ea19649
CDS extractor tests for compiler & packageManager
data-douser May 17, 2025
f9e41aa
Fix CDS extractor environment setup
data-douser May 18, 2025
6412400
Improve CDS extractor logging
data-douser May 18, 2025
009fe42
First attempt at project-aware CDS compilation
data-douser May 18, 2025
2c68c8a
Refactor CDS extractor for dedicated "cds" package
data-douser May 15, 2025
8e09758
Rename CDS extractor entrypoint and refactor args
data-douser May 16, 2025
1dd464b
Add self-parser.test.ts for CDS extractor
data-douser May 16, 2025
729dd2e
CDS extractor tests for compiler & packageManager
data-douser May 17, 2025
0c75133
Fix CDS extractor environment setup
data-douser May 18, 2025
09fa955
Improve CDS extractor logging
data-douser May 18, 2025
c865d94
First attempt at project-aware CDS compilation
data-douser May 18, 2025
d629d1e
Merge branch 'data-douser/cds-ts-rewrite-2' of github.com:data-douser…
data-douser May 18, 2025
5aa2d54
Update node dependencies for CDS extractor
data-douser May 18, 2025
86b5572
Update CDS extractor flowchart diagram
data-douser May 18, 2025
d6a99da
Merge branch 'main' into data-douser/cds-ts-rewrite-2
data-douser Jun 8, 2025
bc82815
Fixes CDS extractor project-aware file detection
data-douser Jun 9, 2025
6315a49
Remove "--parse" from CDS compile command
data-douser Jun 10, 2025
bc4a2cd
Merge branch 'advanced-security:main' into data-douser/cds-ts-rewrite-2
data-douser Jun 10, 2025
27743ba
Simplify CDS extractor logic and refactor
data-douser Jun 10, 2025
7a05f80
Fix project-aware CDS compile file paths
data-douser Jun 11, 2025
af066d7
Merge branch 'advanced-security:main' into data-douser/cds-ts-rewrite-2
data-douser Jun 11, 2025
0ba67d5
Fix code-scanning alerts for insecure tmp files
data-douser Jun 11, 2025
0f1ac9e
Improve testing of CDS extractor graph
data-douser Jun 11, 2025
aaab73b
Update CDS extractor node dependencies
data-douser Jun 11, 2025
e4bbc1f
Implement cdsExtractorLog for consistent logging
data-douser Jun 11, 2025
99923e0
Common project graph for CDS parse and compile
data-douser Jun 25, 2025
efc989f
Fix CdlService getImplementation file location
data-douser Jun 25, 2025
9dfbe3b
Use shell-quote.quote in testCdsCommand()
data-douser Jun 25, 2025
fcac0f1
Fix unit test for CDS extractor
data-douser Jun 25, 2025
3573995
Fix CDS extractor args validation
data-douser Jun 26, 2025
1c5d011
Improve cdsExtractorLog for debugging performance
data-douser Jun 26, 2025
359290e
Replace Math.random() in CDS extractor
data-douser Jun 26, 2025
d4dd37a
Refactor CDS extractor run modes
data-douser Jun 27, 2025
90453cc
Fix CDS extractor monorepo support
data-douser Jun 27, 2025
aa7dac4
Replace CDS extractor autobuild.md with README.md
data-douser Jun 27, 2025
cc29a54
Add CDS extractor JS dist and node_modules
data-douser Jun 27, 2025
9050fe1
Improve and cleanup CDS extractor packageManager
data-douser Jun 30, 2025
cf46788
Update CDS node package deps to latest
data-douser Jun 30, 2025
ee10ad6
Cleanup CDS extractor src/parser/functions.ts
data-douser Jun 30, 2025
6aa4032
Remove dist/ and node_modules/ CDS extractor dirs
data-douser Jul 2, 2025
a1d1c5d
CDS extractor rewrite project cleanup
data-douser Jul 2, 2025
7355f73
Fix failing cdsExtractorLog.test.ts unit test
data-douser Jul 2, 2025
759e118
Remove out/ directory for CDS extractor
data-douser Jul 3, 2025
a6c89e6
More CDS extractor code cleanup
data-douser Jul 6, 2025
119aa29
Simplify CDS project compile command
data-douser Jul 7, 2025
257fac2
Revert "Simplify CDS project compile command"
data-douser Jul 7, 2025
c41597c
CDS extractor cleanup checkpoint 1
data-douser Jul 8, 2025
995fd55
CDS extractor cleanup checkpoint 2
data-douser Jul 8, 2025
581c29b
Merge branch 'main' into data-douser/cds-ts-rewrite-2
data-douser Jul 10, 2025
0842f31
CDS extractor cleanup checkpoint 3
data-douser Jul 10, 2025
cb8518b
Improve unit testing of CDS "compiler" module
data-douser Jul 10, 2025
3901267
Update CDS extractor README.md with TODOs
data-douser Jul 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,5 @@ tmp/
**.testproj
dbs
*.cds.json
.cds-extractor-cache

50 changes: 32 additions & 18 deletions extractors/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,20 @@ pre-finalize.sh`"]
JSE[[javascript extractor]]
DTRAC[codeql database<br>trace-command]
SPF[[pre-finalize.sh]]
DIDX[codeql database index-files<br> --language=cds<br>--include-extension=.cds]
SIF[[index-files.sh]]
SIT[[index-files.ts/js]]
NPM[[npm install & build]]
DETS[[Determine CDS command]]
FIND[[Find package.json dirs]]
INST[[Install dependencies]]
CC[[cds compiler]]
ABCMD[[autobuild.sh/cmd]]
ABT[[cds-extractor.ts/js]]
ENV[[setup & validate<br>environment]]
PDG[[build project<br>dependency graph]]
INSTC[[install dependencies<br>with caching]]
PROC[[process CDS files<br>to JSON]]
PMAP[[project-aware<br>dependency resolution]]
FIND[[find project for<br>CDS file]]
CDCMD[[determine CDS<br>command for project]]
COMP[[compile CDS<br>to JSON]]
CDJ([.cds.json files])
FILT[[configure LGTM<br>index filters]]
JSA[[javascript extractor<br>autobuild script]]
DIAG[[add compilation<br>diagnostics]]
TF([CodeQL TRAP files])
DBF[codeql database finalize<br> -- /path/to/database]
Expand All @@ -54,20 +58,30 @@ pre-finalize.sh`"]
JSE ==> |run autobuild within<br>the javascript extractor| DTRAC
DTRAC ==> |run the build --command| SPF
SPF ==> |run codeql index-files<br>for CDS files| DIDX
DIDX ==> |invoke script via<br>--search-path| SIF
SIF ==> |runs TypeScript version<br>after npm install| NPM
NPM ==> |executes compiled<br>index-files.js| SIT
SPF ==> |run autobuilder<br>for CDS files| ABCMD
ABCMD ==> |runs TypeScript version<br>of CDS extractor| ABT
SIT ==> |finds project directories<br>with package.json| FIND
FIND ==> |install CDS dependencies<br>in project directories| INST
SIT ==> |determines which<br>cds command to use| DETS
DETS ==> |processes each CDS file| CC
ABT ==> |setup and validate<br>environment first| ENV
ABT ==> |build project dependency<br>graph for source root| PDG
PDG ==> |analyze CDS projects<br>structure & relationships| PMAP
ABT ==> |efficiently install<br>required dependencies| INSTC
INSTC ==> |use cached approach for<br>dependency installation| PMAP
ABT ==> |process each CDS file<br>to generate JSON files| PROC
PROC ==> |find which project<br>contains this CDS file| FIND
FIND ==> |uses project-aware<br>dependency resolution| PMAP
FIND ==> |determine appropriate<br>CDS command for project| CDCMD
CDCMD ==> |compile CDS file to JSON<br>with project context| COMP
COMP ==> |generate JSON representation<br>with project awareness| CDJ
COMP --x |if compilation fails,<br>report diagnostics| DIAG
DIAG -.-> |diagnostics stored<br>in database| DB
CC ==> |compile .cds files to<br>create .cds.json files| CDJ
CDJ -.-> |stored in same location<br>as original .cds files| DB
SIT ==> |configures extraction<br>filters for JSON files| JSA
ABT ==> |configure extraction<br>filters for JSON files| FILT
ABT ==> |run JavaScript extractor<br>to process JSON files| JSA
JSA ==> |processes .cds.json files<br>via javascript extractor| CDJ
CDJ ==> |javascript extractor<br>generates TRAP files| TF
Expand Down
12 changes: 0 additions & 12 deletions extractors/cds/tools/.gitignore

This file was deleted.

282 changes: 282 additions & 0 deletions extractors/cds/tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
# CodeQL CDS Extractor

A robust CodeQL extractor for [Core Data Services (CDS)][CDS] files used in [SAP Cloud Application Programming (CAP)][CAP] model projects. This extractor processes `.cds` files and compiles them into `.cds.json` files for CodeQL analysis while maintaining project-aware parsing and dependency resolution.

## Overview

The CodeQL CDS extractor is designed to efficiently process CDS projects by:

- **Project-Aware Processing**: Analyzes CDS files as related project configurations rather than independent definitions
- **Optimized Dependency Management**: Caches and reuses `@sap/cds` and `@sap/cds-dk` dependencies across projects
- **Enhanced Precision**: Reduces false-positives in CodeQL queries by understanding cross-file relationships
- **Performance Optimization**: Avoids duplicate processing and unnecessary dependency installations

## Architecture

The extractor uses an `autobuild` approach with the following key components:

### Core Components

- **`cds-extractor.ts`**: Main entry point that orchestrates the extraction process
- **`src/cds/parser/`**: CDS project discovery and dependency graph building
- **`src/cds/compiler/`**: Compilation orchestration and `.cds.json` generation
- **`src/packageManager/`**: Dependency installation and caching
- **`src/logging/`**: Unified logging and performance tracking
- **`src/environment.ts`**: Environment setup and validation
- **`src/codeql.ts`**: CodeQL JavaScript extractor integration

### Extraction Process

1. **Environment Setup**: Validates CodeQL tools and system requirements
2. **Project Discovery**: Recursively scans for CDS projects and builds dependency graph
3. **Dependency Management**: Installs and caches required CDS compiler dependencies
4. **CDS Compilation**: Compiles `.cds` files to `.cds.json` using project-aware compilation
5. **JavaScript Extraction**: Runs CodeQL's JavaScript extractor on source and compiled files

## Usage

### Prerequisites

- Node.js (accessible via `node` command)
- CodeQL CLI tools
- SAP CDS projects with `.cds` files

### Running the Extractor

The extractor is typically invoked by CodeQL during database creation:

```bash
codeql database create --language=cds --source-root=/path/to/project my-database
```

### Manual Execution

For development and testing purposes:

```bash
# Build the extractor
npm run build

# Run directly (from project source root)
node dist/cds-extractor.js /path/to/source/root
```

## Development

### Project Structure

```text
extractors/cds/tools/
├── cds-extractor.ts # Main entry point
├── src/ # Source code modules
│ ├── cds/ # CDS-specific functionality
│ │ ├── compiler/ # Compilation orchestration
│ │ └── parser/ # Project discovery and parsing
│ ├── logging/ # Logging and performance tracking
│ ├── packageManager/ # Dependency management
│ ├── codeql.ts # CodeQL integration
│ ├── diagnostics.ts # Error reporting
│ ├── environment.ts # Environment setup
│ ├── filesystem.ts # File system utilities
│ └── utils.ts # General utilities
├── test/ # Test suites
├── dist/ # Compiled JavaScript output
└── package.json # Project configuration
```

### Building

```bash
# Install dependencies
npm install

# Build TypeScript to JavaScript
npm run build

# Run all checks and build
npm run build:all
```

### Testing

```bash
# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Run tests in watch mode
npm run test:watch
```

### Code Quality

```bash
# Lint TypeScript files
npm run lint

# Auto-fix linting issues
npm run lint:fix

# Format code
npm run format
```

## Configuration

### Environment Variables

The extractor respects several CodeQL environment variables:

- `CODEQL_DIST`: Path to CodeQL distribution
- `CODEQL_EXTRACTOR_CDS_WIP_DATABASE`: Target database path
- `LGTM_INDEX_FILTERS`: File filtering configuration

### CDS Project Detection

Projects are detected based on:

- Presence of `package.json` files
- CDS files (`.cds`) in the project directory tree
- Valid CDS dependencies (`@sap/cds`, `@sap/cds-dk`) in package.json

### Compilation Strategy

The extractor uses a sophisticated compilation approach:

1. **Dependency Graph Building**: Maps relationships between CDS projects
2. **Smart Caching**: Reuses compiled outputs and dependency installations
3. **Error Recovery**: Handles compilation failures gracefully
4. **Performance Tracking**: Monitors compilation times and resource usage

## Performance Features

### Optimized Dependency Management

- **Shared Dependency Cache**: Single installation per unique dependency combination
- **Isolated Environments**: Dependencies installed in temporary cache directories
- **No Source Modification**: Original project files remain unchanged

### Efficient Processing

- **Project-Level Compilation**: Compiles related CDS files together
- **Duplicate Avoidance**: Prevents redundant processing of imported files
- **Memory Tracking**: Monitors and reports memory usage throughout extraction

### Scalability

- **Large Codebase Support**: Optimized for enterprise-scale CDS projects
- **Parallel Processing**: Where possible, processes independent projects concurrently
- **Resource Management**: Cleans up temporary files and cached dependencies

## Integration with `cds` CLI

### Installation of CDS (Node) Dependencies

#### Installation of `@sap/cds` and `@sap/cds-dk`

The CDS extractor attempts to optimize performance for most projects by caching the installation of the unique combinations of resolved CDS dependencies across all projects under a given source root.

The "unique combinations of resolved CDS dependencies" means that we resolve the **latest** available version **within the semantic version range** for each `@sap/cds` and `@sap/cds-dk` dependency specified in the `package.json` file for a given CAP project.

In practice, this means that if "project-a" requires `@sap/cds@^6.0.0` and "project-b" requires `@sap/cds@^7.0.0` while the latest available version is `@sap/[email protected]` (as a trivial example), the extractor will install `@sap/[email protected]` once and reuse it for both projects.

This is much faster than installing all dependencies for every project individually, especially for large projects with many CDS files. However, this approach has some limitations and trade-offs:

- This latest-first approach is more likely to choose the same version for multiple projects, which can reduce analysis time and can improve consistency in analysis between projects.
- This approach does not read (or respect) the `package-lock.json` file, which means that we are more likely to use a `cds` version that is different from the one most recently tested/used by the project developers.
- We are more likely to encounter incompatibility issues where a particular project hasn't been tested with the latest version of `@sap/cds` or `@sap/cds-dk`.

We can mitigate some of these issues through a (to be implemented) compilation retry mechanism for projects where some CDS compilation task(s) fail to produce the expected `.cds.json` output file(s).
The proposed retry mechanism would install the full set of dependencies for the affected project(s) while respecting the `package-lock.json` file, and then re-run the compilation for the affected project(s).

```text
TODO: retry mechanism expected before next release of the CDS extractor
```

#### Installation of Additional Project-Specific Dependencies

```text
TODO: implement installation of dependencies required for compilation to succeed for a given project
```

### Integration with `cds compile` command

The CDS extractor uses the `cds compile` command to compile `.cds` files into `.cds.json` files, which are then processed by CodeQL's JavaScript extractor.

Where possible, a single `model.cds.json` file is generated for each project, containing all the compiled definitions from the project's `.cds` files. This results in a faster extraction process overall with minimal duplication of CDS code elements (e.g., annotations, entities, services, etc.) within the CodeQL database created from the extraction process.

Where project-level compilation is not possible (e.g., due to project structure), the extractor generates individual `.cds.json` files for each `.cds` file in the project. The main downside to this approach is that if one `.cds` file imports another `.cds` file, the imported definitions will be duplicated in the CodeQL database, which can lead to false positives in queries that expect unique definitions.

```text
TODO: use the unique (session) ID of the CDS extractor run to as the `<session>` part of `<basename>.<session>.cds.json` and set JS extractor env vars to only extractor `.<session>.cds.json` files
```

### Integration with `cds env` command

The current version of the CDS extractor expects CAP projects to follow the [default project structure][CAP-project-structure], particularly regarding the names of the (`app`, `db`, & `srv`) subdirectories in which the extractor will look for `.cds` files to process (in addition to the root directory of the project).

The proposed solution will use the `cds env` command to discover configurations that affect the structure of the project and/or the expected "compilation tasks" for the project, such as any user customization of environment configurations such as:

- `cds.folders.app`
- `cds.folders.db`
- `cds.folders.srv`

```text
TODO : add support for integration with `cds env` CLI command as a means of consistently getting configurations for CAP projects
```

## Integration with `codeql` CLI

### File Processing

The extractor processes both:

- **Source Files**: Original `.cds` files for source code analysis
- **Compiled Files**: Generated `.cds.json` files for semantic analysis

### Database Population

- Integrates with CodeQL's JavaScript extractor for final database population
- Maintains proper file relationships and source locations
- Supports CodeQL's standard indexing and filtering mechanisms

## Troubleshooting

### Common Issues

1. **Missing Node.js**: Ensure `node` command is available in PATH
2. **CDS Dependencies**: Verify projects have valid `@sap/cds` dependencies
3. **Compilation Failures**: Check CDS syntax and cross-file references
4. **Memory Issues**: Monitor memory usage for very large projects

### Debugging

The extractor provides comprehensive logging:

- **Performance Tracking**: Times for each extraction phase
- **Memory Usage**: Memory consumption at key milestones
- **Error Reporting**: Detailed error messages with context
- **Project Discovery**: Information about detected CDS projects

### Log Levels

- `info`: General progress and milestone information
- `warn`: Non-critical issues that don't prevent extraction
- `error`: Critical failures that may affect extraction quality

## References

- [SAP Cloud Application Programming Model][CAP]
- [Default Structure of a CAP Project][CAP-project-structure]
- [Core Data Services (CDS)][CDS]
- [Project-Specific Configurations][CDS-ENV-project-configs]
- [Conceptual Definition Language (CDL)][CDL]
- [CodeQL Documentation](https://codeql.github.com/docs/)

[CAP]: https://cap.cloud.sap/docs/about/
[CAP-project-structure]: https://cap.cloud.sap/docs/get-started/#project-structure
[CDS]: https://cap.cloud.sap/docs/cds/
[CDS-ENV-project-configs]: https://cap.cloud.sap/docs/node.js/cds-env#project-specific-configurations
[CDL]: https://cap.cloud.sap/docs/cds/cdl
Loading