-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cocoeval GT-DT matching implementation is wrong #564
Comments
@bertsky I do not think this repo is maintained from the authors, anyway can you give an implementation of your solution to the problem? I think you are right in your point |
@andreaceruti, no I did not bother with pycocotools for that purpose (but it would not be difficult to do based on the above analysis I believe). I went for my own algorithm here ff. to get truly n:m matches (and FP and FN and pixel-wise measures and over/undersegmentation measures). |
@bertsky Wow, really nice work! If i would like to evaluate my custom dataset on instance segmentation and object detection task using your implementation could I just use the evaluate_coco function? Or do you think I should do some other changes? Anyway I will go more in depth with your code in these days since it is documented better than this repo :) |
@andreaceruti Yes, |
@bertsky I am almost there. Yes I have 2 json files reresenting coco_gt and coco_dt. I have applied your method and then I just look at the dict constructed by the method. " 'by-category': {'grape bunch': {'IoDT': 0.8433501965488758, And this for one sample image If I understand correctly the idea behind these metrics are taken from "rethinking semantic segmentation evaluation" paper, but could you explain to me how could I obtain AP,TPs,FPs,FNs for instance segmentation task? |
@bertsky this is the image I have used as example. On the left you can see the ground truth and on the right you can see the detections |
@andreaceruti this is off topic here and too verbose – let's discuss in ocrd_segment |
@bertsky Hi! I don't think it is an arbitrary choice in cocoeval.evaluate() function. (as you say ,the arbitrary choice means "of whichever DT came first to "reserve" the GT"). Before a dt find its correct gt,the list of dt has already been sorted by their class confidence score, so if a dt can "come first",that means this dt has a higher class confidence score.In conclusion,if a gt meets more then one dt,the dt who has the highest class confidence score can match this gt. |
@volcanolee4 I did not say the choice was random though. (In fact, I did allude to the fact the candidates are sorted by confidence with my formulation GT is already reserved from some earlier – i.e. higher-scoring – DT.) But (given the actual problem to solve here – how to adequately compare a given prediction with the true segmentation) I insist that choice is arbitrary: any DT candidate with just slightly higher confidence but much lower relative or absolute overlap can replace the natural best fit for a GT region. (And that assignment then not only becomes exclusive for the GT, it can also prevent the DT from matching at all.) It is one thing to factor in the confidence of a predictor, but another to properly assess its accuracy. The current implementation not only conflates the two, it makes both impossible. |
I don't know how this could go undetected for so long (cannot find any mention in the issues), but I am certain there is a bug in the algorithm that is used to match DT (detection) and GT (ground-truth) objects. It gets more severe when you use a lower iouThr, but you can always have pathological cases even at high degrees of overlap. Both the Python and the Matlab implementation are affected.
The basic idea of
evaluateImg(imgId, catId)
algorithm is to (for each iouThr value independently)Now, the outer-inner loop structure already dictates asymmetry: If there are multiple potential matches, then we obviously cannot get the best DT match per GT object – only the best GT match per DT object. That's already a problem in itself (e.g. 2 DTs could "share" 1 GT, but the GT would have to "choose"; ideally the data structure and the algorithm for filling it should be rich enough to completely map multiple matches, symmetrically). But one could live with that asymmetry (and if need be for the other preference, just call COCOeval in reverse direction).
However, it gets worse by an additional criterion that seemingly addresses this – but in the wrong way:
cocoapi/PythonAPI/pycocotools/cocoeval.py
Lines 279 to 281 in 8c9bcc3
cocoapi/MatlabAPI/CocoEval.m
Lines 383 to 384 in 8c9bcc3
That effectively ruins even the other criterion (of selecting the best GT match per DT object), replacing it with an arbitrary choice (of whichever DT came first to "reserve" the GT). We end up with up to 1 GT per DT (the "best free") and up to 1 DT per GT (the "first best"). It's only "up to" 1, because obviously we can now miss the only match for a pair (even if it is actually better IoU-wise) when its GT is already reserved from some earlier (i.e. higher-scoring) DT.
I wonder what motivation the authors had for that additional criterion. Could it be speed (i.e. shortcutting the combinatorial explosion of comparisons)? Or was it consistency (i.e. ensuring choices in gtMatches always mirror those in dtMatches)?
IMO one should modify that criterion to at least compare the IoU of the previous DT match of the GT object with the current pair's IoU: iff the former is actually larger than the latter, then keep it, but otherwise drop it (i.e. reassign the GT) or add it (i.e. allow dtMatches to contain the same GT id multiple times, while gtMatches only contains the largest DT id).
Please correct me if I am getting it all wrong! (Also, if you have better ideas, or different perspectives.)
The text was updated successfully, but these errors were encountered: