Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

His dataset has a big problem #22

Open
YouNotWalkAlone opened this issue Apr 22, 2024 · 10 comments
Open

His dataset has a big problem #22

YouNotWalkAlone opened this issue Apr 22, 2024 · 10 comments

Comments

@YouNotWalkAlone
Copy link

His model, if the dataset is replaced, can have normal binary classification performance. Firstly, on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly,if you solve the loss to 0.69 problem,you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset
So,I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

@MilkteaBoy-code
Copy link

How did you solve this probelm#19 ? I am stuck on it.

@YouNotWalkAlone
Copy link
Author

How did you solve this probelm#19 ? I am stuck on it.

I gave up on solving this problem and have been using minibatches for generation.
The new problem I'm having now is that this model uses other datasets to learn, and the accuracy of his validation and test sets can't rise at around 0.7.

QQ图片20240426205455

QQ图片20240426205500
I don't have an idea of how to solve this problem

@MilkteaBoy-code
Copy link

How did you solve this probelm#19 ? I am stuck on it.

I gave up on solving this problem and have been using minibatches for generation. The new problem I'm having now is that this model uses other datasets to learn, and the accuracy of his validation and test sets can't rise at around 0.7.

QQ图片20240426205455

QQ图片20240426205500 I don't have an idea of how to solve this problem

你好,我看到你在#19中的回复,我猜想你应该也是中国人,在这里我就直接用中文和你交流。
我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右,所以我觉得你应该是复现成功了。
我的问题是,在他给的数据集中(Fmpeg和QEMU),当把select函数限制放开采用所有的数据集进行复现的时候,就会卡死在第四个分片中。我不知道你有没有遇到这个问题,有没有解决?

@MilkteaBoy-code
Copy link

Hello? Do you solve this problem?

@YouNotWalkAlone
Copy link
Author

你是如何解决这个问题的#19 ?我被困住了。

我放弃了解决这个问题,并一直在使用小批量进行生成。我现在遇到的新问题是,这个模型使用其他数据集来学习,他的验证和测试集的准确率不能上升到0.7左右。
QQ图片20240426205455
QQ图片20240426205500我不知道如何解决这个问题

你好,我看到你在#19中的回复,我猜想你应该也是中国人,在这里我就直接用中文和你交流。 我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右,所以我觉得你应该是复现成功了。 我的问题是,在他给的数据集中(Fmpeg和QEMU),当把select函数限制放开采用所有的数据集进行复现的时候,就会卡死在第四个分片中。我不知道你有没有遇到这个问题,有没有解决?

抱歉,我没有看到这个回复。你的图片我无法看到,但是我对于linux卡死问题是进行小批量的pkl文件生成,因为过大的卡死我也无法解决。

@YouNotWalkAlone
Copy link
Author

你是如何解决这个问题的#19 ?我被困住了。

我放弃了解决这个问题,并一直在使用小批量进行生成。我现在遇到的新问题是,这个模型使用其他数据集来学习,他的验证和测试集的准确率不能上升到0.7左右。
QQ图片20240426205455
QQ图片20240426205500我不知道如何解决这个问题

你好,我看到你在#19中的回复,我猜想你应该也是中国人,在这里我就直接用中文和你交流。 我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右,所以我觉得你应该是复现成功了。 我的问题是,在他给的数据集中(Fmpeg和QEMU),当把select函数限制放开采用所有的数据集进行复现的时候,就会卡死在第四个分片中。我不知道你有没有遇到这个问题,有没有解决?

我的整体思路和另外一个人的差不多,就是限制在200-400左右,避免卡死。这是linux虚拟机能够承受的,超过就会卡死,不知道能不能对你有所帮助

@MilkteaBoy-code
Copy link

你是如何解决这个问题的#19 ?我被困住了。

我放弃了解决这个问题,并一直在使用小批量进行生成。我现在遇到的新问题是,这个模型使用其他数据集来学习,他的验证和测试集的准确率不能上升到0.7左右。
QQ图片20240426205455
QQ图片20240426205500我不知道如何解决这个问题

你好,我看到你在#19中的回复,我猜想你应该也是中国人,在这里我就直接用中文和你交流。 我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右,所以我觉得你应该是复现成功了。 我的问题是,在他给的数据集中(Fmpeg和QEMU),当把select函数限制放开采用所有的数据集进行复现的时候,就会卡死在第四个分片中。我不知道你有没有遇到这个问题,有没有解决?

我的整体思路和另外一个人的差不多,就是限制在200-400左右,避免卡死。这是linux虚拟机能够承受的,超过就会卡死,不知道能不能对你有所帮助

我看另一个人解决的是修改了select函数,这是我的回复https://github.com/epicosy/devign/issues/19#issuecomment-2106168341,你看一下我理解的对不对

@epbugaev
Copy link

His model, if the dataset is replaced, can have normal binary classification performance. Firstly, on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly,if you solve the loss to 0.69 problem,you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So,I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

Hello! Am I correct in assuming by a bn layer you mean applying 1-d BatchNorm before the final linear head? I encountered the same problem (loss stuck on 0.68-69), applying 1-d BatchNorm seems to solve it.

Could you please explain why sigmoid function causes this? Is it because it disturbs gradient flow?

@YouNotWalkAlone
Copy link
Author

His model, if the dataset is replaced, can have normal binary classification performance. Firstly, on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly,if you solve the loss to 0.69 problem,you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So,I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

你好!我假设 bn 层是指在最终线性头之前应用 1-d BatchNorm 是否正确?我遇到了同样的问题(损失停留在 0.68-69 上),应用 1-d BatchNorm 似乎可以解决它。

您能解释一下为什么 S 形函数会导致这种情况吗?是因为它扰乱了梯度流吗?

Because the output of the sigmoid function is too close to 0.5, this problem can be solved by using the bn layer and expanding the variance, but the bn layer will be simpler.

@epbugaev
Copy link

His model, if the dataset is replaced, can have normal binary classification performance. Firstly, on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly,if you solve the loss to 0.69 problem,you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So,I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

你好!我假设 bn 层是指在最终线性头之前应用 1-d BatchNorm 是否正确?我遇到了同样的问题(损失停留在 0.68-69 上),应用 1-d BatchNorm 似乎可以解决它。
您能解释一下为什么 S 形函数会导致这种情况吗?是因为它扰乱了梯度流吗?

Because the output of the sigmoid function is too close to 0.5, this problem can be solved by using the bn layer and expanding the variance, but the bn layer will be simpler.

Thank you for the answer, it helped me a lot. I'm going to create a pull-request regarding this issue since it seems like this BatchNorm layer should be included by default, if you can look at it later I would be grateful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants