Skip to content

feat: complete c parser #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

feat: complete c parser #13

wants to merge 8 commits into from

Conversation

Hoblovski
Copy link
Collaborator

What type of PR is this?

实现 C parser 到 AST。
考虑到 C 语言无模块,所以只有一个模块。

Check the PR title.

  • This PR title match the format: <type>(optional scope): <description>
  • The description of this PR title is user-oriented and clear enough for others to understand.
  • Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo

(Optional) Translate the PR title into Chinese.

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:
zh(optional):

(Optional) Which issue(s) this PR fixes:

(optional) The PR that updates user documentation:

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 53 files.

Valid Invalid Ignored Fixed
51 2 0 0
Click to see the invalid file list
  • src/lang/cxx/lib.go
  • src/lang/cxx/spec.go

@@ -0,0 +1,27 @@
package cxx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
package cxx
// Copyright 2025 CloudWeGo Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cxx

@@ -0,0 +1,190 @@
package cxx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
package cxx
// Copyright 2025 CloudWeGo Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cxx

@AsterDY
Copy link
Collaborator

AsterDY commented Apr 22, 2025

resolve the conflicts first @Hoblovski

@CLAassistant
Copy link

CLAassistant commented Apr 26, 2025

CLA assistant check
All committers have signed the CLA.

@HeyJavaBean
Copy link
Member

@AsterDY

1. Since clangd does not support semanticTokens/range method, use
  semanticTokens/full + filtering to emulate.
2. Since the concept of package and module does not apply to C/C++,
  treat the whole repo as a single package/module.
lang/parse.go Outdated
@@ -94,6 +95,9 @@ func checkRepoPath(repoPath string, language uniast.Language) (openfile string,
case uniast.Rust:
// NOTICE: open the Cargo.toml file is required for Rust projects
openfile, wait = rust.CheckRepo(repoPath)
case uniast.Cxx:
// NOTICE: open the Cargo.toml file is required for Rust projects
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释改一下吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

lang/lsp/lsp.go Outdated
}

func filterSemanticTokensInRange(resp *SemanticTokens, r Range) {
// LSP starts from 0:0 but the project seems to use 1:1 (see collect PositionOffset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该统一改成0了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思…… 删掉注释了

}

// returns: mod, path, error
func (c *CxxSpec) NameSpace(path string) (string, string, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c也支持多目录作为不同的命名空间。这个实现好像没法支持?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体来说,就是同名symbol在不同文件夹下(没有相互includes)。这个如何在ast里面区分清楚?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C 的目录不影响编译,只是存放位置。比如 main.c:foobar 和 driver/name/lib.c:foobar 在 C 看来都是 foobar,不存在 main.foobar 和 driver.name.lib.foobar。不过目录可以作为给大模型的启发式信息,先 todo 一下?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体来说,就是同名symbol在不同文件夹下(没有相互includes)。这个如何在ast里面区分清楚?

C 会编译错误。

$ for i in 1 2; do    mkdir d$i && echo "int add(int a){return a+1;}" > d$i/add.c    ; done

$ echo "extern int add(int); int main(int argc,char**argv){return add(argc);}" > main.c

$ gcc **/*.c
/usr/bin/ld: /tmp/ccilArbh.o: in function `add':
add.c:(.text+0x0): multiple definition of `add'; /tmp/ccorwzKC.o:add.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不过实际中可能有,比如需要支持不同平台 …… 可以暂时先通过 build_commands.json 规避(就是忽略一些 c 文件?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不同模块可能存放在同一个仓库下。得想清楚这种情况怎么处理。要么解析时候需要指定实际的编译模块,要么将所有编译模块都列举出来

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,我觉得应该是解析的时候指定实际的编译模块比较好。linux 内核中有大量此类情况(例如 x86 和 arm 目录下会有同名文件,但是不会同时编译进一个内核镜像)。用 vscode 写 linux 内核也是使用 compile_commands.json 指定实际有那些模块 [1] [2] [3]。我把论文测试跑完可以给个更完备的描述

[1] https://zhuanlan.zhihu.com/p/558286384
[2] https://gist.github.com/itewqq/4b4ee89ba420d585efb472116879b1ee
[3] https://github.com/amezin/vscode-linux-kernel

Hoblovski and others added 4 commits May 9, 2025 18:14
C allows symbols with the same name in a single module, provided either:
* One is a weak symbol (decl) and one is a strong symbol (def)
* They are both strong symbols, but never linked together.

The first one works fine, but more changes are needed for the second
one.

testdata/cxxsimple illustrates the first scenario. Two instances of
`myself` are present, one (weak) in `pair.h` and one (strong) in `pair.c`.
The dependency is well defined in this scneario:
1. `pair.c:myself` depends on `pair.h:myself`
2. any other function using `myself` depends on both.
To verify, run `./abcoder parse cxx testdata/cxxsimple > cxxsimple.json`.

testdata/cxxduplicate is the second scenario. Two strong instances of
`add` are present, each used in a different executable. clangd handles
this with compile_commands.json. If clangd is invoked as below, the
`main->add` dependency shall point to the `add` in `d1/add.c`.

	mkdir build && cd build && cmake ..
	bear -- make prog1 # generate compile_commands.json
	cd testdata/cduplicate && clangd-18

While clangd does the right job, the current implementation of scanning
during collection does not take into account which files are included in
a compilation (as specified in compile_commands.json). So
`Collector.Collect` will incorrectly include `d2/add.c` even if it is
not used, and mess up with dependencies. That is to say, even for the
compilation `prog1 <- main.c, d1/add.c`, a dependency
`main->d2/add.c:add` will be present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants