His dataset has a big problem #22

YouNotWalkAlone · 2024-04-22T07:48:04Z

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset
So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

MilkteaBoy-code · 2024-04-24T16:16:46Z

How did you solve this probelm#19 ? I am stuck on it.

YouNotWalkAlone · 2024-04-26T12:56:17Z

How did you solve this probelm#19 ? I am stuck on it.

I gave up on solving this problem and have been using minibatches for generation.
The new problem I'm having now is that this model uses other datasets to learn, and the accuracy of his validation and test sets can't rise at around 0.7.

I don't have an idea of how to solve this problem

MilkteaBoy-code · 2024-04-28T16:01:39Z

How did you solve this probelm#19 ? I am stuck on it.

I gave up on solving this problem and have been using minibatches for generation. The new problem I'm having now is that this model uses other datasets to learn, and the accuracy of his validation and test sets can't rise at around 0.7.

I don't have an idea of how to solve this problem

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。
我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。
我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

MilkteaBoy-code · 2024-05-12T04:09:12Z

Hello？ Do you solve this problem?

YouNotWalkAlone · 2024-05-12T06:13:53Z

你是如何解决这个问题的#19 ？我被困住了。

我放弃了解决这个问题，并一直在使用小批量进行生成。我现在遇到的新问题是，这个模型使用其他数据集来学习，他的验证和测试集的准确率不能上升到0.7左右。

我不知道如何解决这个问题

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

抱歉，我没有看到这个回复。你的图片我无法看到，但是我对于linux卡死问题是进行小批量的pkl文件生成，因为过大的卡死我也无法解决。

YouNotWalkAlone · 2024-05-12T06:16:21Z

你是如何解决这个问题的#19 ？我被困住了。

我放弃了解决这个问题，并一直在使用小批量进行生成。我现在遇到的新问题是，这个模型使用其他数据集来学习，他的验证和测试集的准确率不能上升到0.7左右。

我不知道如何解决这个问题

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

我的整体思路和另外一个人的差不多，就是限制在200-400左右，避免卡死。这是linux虚拟机能够承受的，超过就会卡死，不知道能不能对你有所帮助

MilkteaBoy-code · 2024-05-12T08:40:05Z

你是如何解决这个问题的#19 ？我被困住了。

我放弃了解决这个问题，并一直在使用小批量进行生成。我现在遇到的新问题是，这个模型使用其他数据集来学习，他的验证和测试集的准确率不能上升到0.7左右。

我不知道如何解决这个问题

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

我的整体思路和另外一个人的差不多，就是限制在200-400左右，避免卡死。这是linux虚拟机能够承受的，超过就会卡死，不知道能不能对你有所帮助

我看另一个人解决的是修改了select函数，这是我的回复https://github.com/epicosy/devign/issues/19#issuecomment-2106168341，你看一下我理解的对不对

epbugaev · 2024-05-21T12:43:09Z

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

Hello! Am I correct in assuming by a bn layer you mean applying 1-d BatchNorm before the final linear head? I encountered the same problem (loss stuck on 0.68-69), applying 1-d BatchNorm seems to solve it.

Could you please explain why sigmoid function causes this? Is it because it disturbs gradient flow?

YouNotWalkAlone · 2024-05-21T16:13:07Z

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

你好！我假设 bn 层是指在最终线性头之前应用 1-d BatchNorm 是否正确？我遇到了同样的问题（损失停留在 0.68-69 上），应用 1-d BatchNorm 似乎可以解决它。

您能解释一下为什么 S 形函数会导致这种情况吗？是因为它扰乱了梯度流吗？

Because the output of the sigmoid function is too close to 0.5, this problem can be solved by using the bn layer and expanding the variance, but the bn layer will be simpler.

epbugaev · 2024-05-22T16:48:51Z

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

你好！我假设 bn 层是指在最终线性头之前应用 1-d BatchNorm 是否正确？我遇到了同样的问题（损失停留在 0.68-69 上），应用 1-d BatchNorm 似乎可以解决它。
您能解释一下为什么 S 形函数会导致这种情况吗？是因为它扰乱了梯度流吗？

Because the output of the sigmoid function is too close to 0.5, this problem can be solved by using the bn layer and expanding the variance, but the bn layer will be simpler.

Thank you for the answer, it helped me a lot. I'm going to create a pull-request regarding this issue since it seems like this BatchNorm layer should be included by default, if you can look at it later I would be grateful.

MilkteaBoy-code mentioned this issue Apr 28, 2024

Whenever the input data is larger the program gets stuck #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

His dataset has a big problem #22

His dataset has a big problem #22

YouNotWalkAlone commented Apr 22, 2024

MilkteaBoy-code commented Apr 24, 2024

YouNotWalkAlone commented Apr 26, 2024

MilkteaBoy-code commented Apr 28, 2024

MilkteaBoy-code commented May 12, 2024

YouNotWalkAlone commented May 12, 2024

YouNotWalkAlone commented May 12, 2024

MilkteaBoy-code commented May 12, 2024

epbugaev commented May 21, 2024

YouNotWalkAlone commented May 21, 2024

epbugaev commented May 22, 2024

His dataset has a big problem #22

His dataset has a big problem #22

Comments

YouNotWalkAlone commented Apr 22, 2024

MilkteaBoy-code commented Apr 24, 2024

YouNotWalkAlone commented Apr 26, 2024

MilkteaBoy-code commented Apr 28, 2024

MilkteaBoy-code commented May 12, 2024

YouNotWalkAlone commented May 12, 2024

YouNotWalkAlone commented May 12, 2024

MilkteaBoy-code commented May 12, 2024

epbugaev commented May 21, 2024

YouNotWalkAlone commented May 21, 2024

epbugaev commented May 22, 2024