Error threshold is too low? #12

psyhtest · 2018-07-14T14:07:35Z

While running directconv-armcl-opencl and conv-armcl-opencl experiments with ArmCL v18.05, I noticed some validation failures:

anton@diviniti:~/CK_REPOS/local/experiment$ grep -c 4836425781 nntest-conv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-debug-mate10pro-767mhz-debug-conv-0001/*.0001.json | grep -v :0
nntest-conv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-debug-mate10pro-767mhz-debug-conv-0001/ckp-a0b5c6f13d81b8db.0001.json:2
anton@diviniti:~/CK_REPOS/local/experiment$ grep -c 4836425781 nntest-directconv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-debug-mate10pro-767mhz-debug-conv-0001/*.0001.json | grep -v :0
nntest-directconv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-debug-mate10pro-767mhz-debug-conv-0001/ckp-6a52a60e334d3b88.0001.json:2

on the same tensor shape:

anton@diviniti:~/CK_REPOS/local/experiment$ grep dataset_file nntest-conv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-debug-mate10pro-767mhz-debug-conv-0001/ckp-a0b5c6f13d81b8db.0001.json
    "dataset_file": "shape-256-13-13-3-384-1-1", 
anton@diviniti:~/CK_REPOS/local/experiment$ grep dataset_file nntest-directconv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-debug-mate10pro-767mhz-debug-conv-0001/ckp-6a52a60e334d3b88.0001.json
    "dataset_file": "shape-256-13-13-3-384-1-1",

The text was updated successfully, but these errors were encountered:

psyhtest · 2018-07-14T14:15:35Z

The cause of failure is the same (repeated many times over):

"fail_reason": "Numerical outputs differ:\n46) 24.4836425781 vs 24.4825687408\n215) ...

As 0.0010 < | 24.4836425781 - 24.4825687408 | < 0.0011, it seems simply the case of a too low a threshold for this shape.

psyhtest · 2018-07-14T16:41:12Z

Indeed, this fails:

$ ck benchmark program:conv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-13-13-3-384-1-1 \
--env.CK_ABS_DIFF_THRESHOLD=0.0010

while this doesn't:

$ ck benchmark program:conv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-13-13-3-384-1-1 \
--env.CK_ABS_DIFF_THRESHOLD=0.0011

psyhtest · 2018-07-14T16:41:54Z

I smell something fishy, however:

       - check failed on "tmp-ck-output.json" (Numerical outputs differ:
46) 24.4836425781 vs 24.4825687408
215) 24.4836425781 vs 24.4825687408
384) 24.4836425781 vs 24.4825687408
553) 24.4836425781 vs 24.4825687408
722) 24.4836425781 vs 24.4825687408
891) 24.4836425781 vs 24.4825687408
1060) 24.4836425781 vs 24.4825687408
1229) 24.4836425781 vs 24.4825687408
1398) 24.4836425781 vs 24.4825687408
1567) 24.4836425781 vs 24.4825687408
1736) 24.4836425781 vs 24.4825687408
1905) 24.4836425781 vs 24.4825687408
2074) 24.4836425781 vs 24.4825687408
2243) 24.4836425781 vs 24.4825687408
2412) 24.4836425781 vs 24.4825687408
2581) 24.4836425781 vs 24.4825687408
2750) 24.4836425781 vs 24.4825687408
2919) 24.4836425781 vs 24.4825687408
3088) 24.4836425781 vs 24.4825687408
3257) 24.4836425781 vs 24.4825687408
3426) 24.4836425781 vs 24.4825687408
3595) 24.4836425781 vs 24.4825687408
3764) 24.4836425781 vs 24.4825687408
3933) 24.4836425781 vs 24.4825687408
4102) 24.4836425781 vs 24.4825687408
4271) 24.4836425781 vs 24.4825687408
4440) 24.4836425781 vs 24.4825687408
4609) 24.4836425781 vs 24.4825687408
4778) 24.4836425781 vs 24.4825687408
4947) 24.4836425781 vs 24.4825687408
5116) 24.4836425781 vs 24.4825687408
5285) 24.4836425781 vs 24.4825687408
5454) 24.4836425781 vs 24.4825687408

Do you see any pattern?

psyhtest · 2018-07-14T16:44:28Z

The values mismatch at indices 46 + 169n. Suspiciously, the tensor is shape-256-13-13-3-384-1-1...

psyhtest · 2018-07-15T17:58:03Z

On HiKey960, the default threshold is good enough with ArmCL v18.05:

$ ck benchmark program:conv-armcl-opencl --cmd_key=default \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-13-13-3-384-1-1
...
Some statistics:

* Failed: no
...

psyhtest · 2018-07-15T22:50:33Z

On Mediatek X20, I could not get any results until rebuilt the library as follows:

$ ck install package:lib-armcl-opencl-18.05 --target_os=android24-arm64 \
--env.USE_GRAPH=ON --env.USE_NEON=ON --env.USE_EMBEDDED_KERNELS=ON \
--env.DEBUG=ON --extra_version=-debug

Here, the default threshold also didn't cause any problems:

$ ck benchmark program:conv-armcl-opencl --target_os=android24-arm64  --cmd_key=default \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-13-13-3-384-1-1
...
Some statistics:

* Failed: no
...

I'm beginning to think that we should not change the threshold....

psyhtest · 2018-07-17T22:49:51Z

A similar problem on another dataset file?

anton@diviniti:~/CK_REPOS/local/experiment$ grep -c 740882873 nntest-winogradconv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-mate10pro-767mhz-conv3x3-inception-v3/*.0001.json | grep -v :0
nntest-winogradconv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-mate10pro-767mhz-conv3x3-inception-v3/ckp-f0d94e1992c17fc9.0001.json:2
anton@diviniti:~/CK_REPOS/local/experiment$ grep dataset_file nntest-winogradconv-armcl-opencl-arm-compute-library-opencl-18.05-b3a371b-mate10pro-767mhz-conv3x3-inception-v3/ckp-f0d94e1992c17fc9.0001.json
    "dataset_file": "shape-448-8-8-3-384-1-1",

psyhtest · 2018-07-17T22:52:33Z

The values mismatch at indices 59 + 64n. Again, 64=8*8=H*W.

       - check failed on "tmp-ck-output.json" (Numerical outputs differ:
59) -65.7422866821 vs -65.7408828735
123) -65.7422866821 vs -65.7408828735
187) -65.7422866821 vs -65.7408828735
251) -65.7422866821 vs -65.7408828735
315) -65.7422866821 vs -65.7408828735
379) -65.7422866821 vs -65.7408828735
443) -65.7422866821 vs -65.7408828735
507) -65.7422866821 vs -65.7408828735
571) -65.7422866821 vs -65.7408828735
635) -65.7422866821 vs -65.7408828735
699) -65.7422866821 vs -65.7408828735
763) -65.7422866821 vs -65.7408828735
827) -65.7422866821 vs -65.7408828735
891) -65.7422866821 vs -65.7408828735
955) -65.7422866821 vs -65.7408828735
1019) -65.7422866821 vs -65.7408828735
1083) -65.7422866821 vs -65.7408828735
1147) -65.7422866821 vs -65.7408828735
1211) -65.7422866821 vs -65.7408828735
...

psyhtest · 2018-07-17T23:05:45Z

As 0.0010 < | -65.7422866821 + 65.7408828735 | < 0.0015, updating the threshold to 0.0015 does stop the failure:

$ ck benchmark program:winogradconv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv3x3-inception-v3 --dataset_file=shape-448-8-8-3-384-1-1 \
--env.CK_ABS_DIFF_THRESHOLD=0.0015 --repetitions=1
...
Some statistics:

* Failed: no
...

However, conv and directconv do not fail even with the default threshold:

$ ck benchmark program:conv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv3x3-inception-v3 --dataset_file=shape-448-8-8-3-384-1-1 \
--repetitions=1
...
Some statistics:

* Failed: no
...
$ ck benchmark program:directconv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv3x3-inception-v3 --dataset_file=shape-448-8-8-3-384-1-1 \
--repetitions=1
...
Some statistics:

* Failed: no
...

psyhtest · 2018-07-17T23:16:32Z

Changing the seed stops all the above failures:

$ ck benchmark program:winogradconv-armcl-opencl \
--cmd_key=default --target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv3x3-inception-v3 --dataset_file=shape-448-8-8-3-384-1-1 \
--repetitions=1 --env.CK_SEED=1
...
Some statistics:

* Failed: no
...
$ ck benchmark program:conv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-13-13-3-384-1-1 \
--repetitions=1 --env.CK_SEED=1
...
Some statistics:

* Failed: no
...
$ ck benchmark program:directconv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-13-13-3-384-1-1 \
--repetitions=1 --env.CK_SEED=1
...
Some statistics:

* Failed: no
...

So perhaps the numerical error accumulates for certain tensor coordinates depending on the seed.

Maybe we should just increase it to the maximum value for which no failures happen for the default seed (42). Based on the failures observed so far, it's 0.0015.

psyhtest · 2018-07-20T01:35:03Z

Mismatches ("Numerical outputs differ") on tensors in tensor-conv-0001:

conv:
- shape-256-63-63-3-512-1-1: -34.0275344849 vs -34.026222229 (abs diff 0.0013122559000038336 < 0.0014)
- shape-256-13-13-3-384-1-1: 24.4836425781 vs 24.4825687408 (abs diff 0.0010738372999981038 < 0.0011)
- shape-384-13-13-3-256-1-1: 32.0275421143 vs 32.0293502808 (abs diff 0.0018081665000053704 < 0.0019)
- shape-384-13-13-3-384-1-1: 32.0275421143 vs 32.0293502808 (abs diff 0.0018081665000053704 < 0.0019)
directconv
- shape-256-13-13-3-384-1-1: 24.4836425781 vs 24.4825687408 (abs diff 0.0010738372999981038 < 0.0011)
- shape-384-13-13-3-256-1-1: 32.0275421143 vs 32.0293502808 (abs diff 0.0018081665000053704 < 0.0019)
- shape-384-13-13-3-384-1-1: 32.0275421143 vs 32.0293502808 (abs diff 0.0018081665000053704 < 0.0019)

Based on these failures, the threshold should be raised to 0.0019. But as it depends on the tensor, why don't we include it with the tensor metadata?

Note that while it also seems to depend on the operator implementation (conv fails on shape-256-63-63-3-512-1-1, while directconv doesn't), the directconv data is currently incomplete (with 17 tensors out of 24).

psyhtest · 2018-07-20T08:33:39Z

Yes, directconv also fails on shape-256-63-63-3-512-1-1:

$ ck benchmark program:directconv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-63-63-3-512-1-1 \
--repetitions=1 --deps.compiler=f4947b23287580ee
...
       - check failed on "tmp-ck-output.json" (Numerical outputs differ:
720) -34.0275344849 vs -34.026222229
...

with the same acceptance threshold of 0.0014:

$ ck benchmark program:directconv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-63-63-3-512-1-1 \
--repetitions=1 --deps.compiler=f4947b23287580ee \
--env.CK_ABS_DIFF_THRESHOLD=0.0014
...
Some statistics:

* Failed: no
...

psyhtest · 2018-07-20T11:23:12Z

Raising the threshold for all tensor shapes in the same way (e.g. from 0.001 to 0.002) does not sound like a good idea, if only a handful of shapes require this and only for certain device / driver / library combinations.

I tried to make the following change locally for shape-256-63-63-3-512-1-1:

diff --git a/dataset/tensor-conv-0001/shape-256-63-63-3-512-1-1.json b/dataset/tensor-conv-0001/shape-256-63-63-3-512-1-1.json
index 38fb5a6..b5e888a 100644
--- a/dataset/tensor-conv-0001/shape-256-63-63-3-512-1-1.json
+++ b/dataset/tensor-conv-0001/shape-256-63-63-3-512-1-1.json
@@ -5,5 +5,6 @@
   "CK_OUT_SHAPE_C": 512, 
   "CK_CONV_KERNEL": 3, 
   "CK_CONV_STRIDE": 1, 
-  "CK_CONV_PAD": 1
-}
\ No newline at end of file
+  "CK_CONV_PAD": 1, 
+  "CK_ABS_DIFF_THRESHOLD": 0.0015 
+}

This actually solved the issue so that:

$ ck benchmark program:directconv-armcl-opencl --cmd_key=default \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-256-63-63-3-512-1-1 \
--repetitions=1 --deps.compiler=f4947b23287580ee

would not fail!

However, I noticed that CK still showed the CK_ABS_DIFF_THRESHOLD variable set to 0.001, as per the metadata of e.g. program:conv-armcl-opencl:

  "run_vars": {
    "CK_ABS_DIFF_THRESHOLD": 0.001,
    "CK_IN_SHAPE_N": 1,
    "CK_OUT_RAW_DATA": "tmp-ck-output.bin",
    "CK_SEED": 42
  },

Moreover, when I tried to set --env.CK_ABS_DIFF_THRESHOLD=0.0001 (i.e. a smaller value than even the default which would be likely to cause an error), the test still passed.

I will try to reproduce this on another shape shortly, but for now I think the behaviour is:

CK_ABS_DIFF_THRESHOLD set in a shape file overrides everything (the default environment of the operator and the environment set via the command line).
CK misleadingly still prints the default value of CK_ABS_DIFF_THRESHOLD.

I would prefer the following behavior:

CK_ABS_DIFF_THRESHOLD set in a shape file overrides the default environment of the operator.
CK_ABS_DIFF_THRESHOLD set via the command line overrides everything (the default environment of the operator and the default environment of the shape).
Print the actual value of CK_ABS_DIFF_THRESHOLD.

What do you think?

Chunosov · 2018-07-20T12:11:34Z

A program only reads env var CK_ABS_DIFF_THRESHOLD once in postprocessing. We should ask @gfursin how ck initializes it. But seems it reads program meta, then overrides values with ones passed via command line and prints them, and then overrides them with values from a dataset.

psyhtest · 2018-07-23T11:39:58Z

An interesting case of non-uniform periodic output difference requiring different thresholds:

$ ck benchmark program:winogradconv-armcl-opencl --cmd_key=default --repetitions=1 \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-96-27-27-5-256-1-2 \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO
...
       - check failed on "tmp-ck-output.json" (Numerical outputs differ:
219) 22.9101810455 vs 22.9088726044
354) -27.9759273529 vs -27.9771194458
746) 29.9037017822 vs 29.9026870728
1000) 22.9101810455 vs 22.9088726044
1135) -27.9759273529 vs -27.9771194458
1137) 29.9037017822 vs 29.9026870728

22.9101810455 vs 22.9088726044 (abs diff 0.0013084411000008345 < 0.0014)
-27.9759273529 vs -27.9771194458 (abs diff 0.0011920928999984426 < 0.0012)
29.9037017822 vs 29.9026870728 (abs diff 0.0010147093999997026 < 0.0011)

$ ck benchmark program:winogradconv-armcl-opencl --cmd_key=default --repetitions=1 \
--dataset_uoa=tensor-conv-0001 --dataset_file=shape-96-27-27-5-256-1-2 \
--target_os=android24-arm64 --env.CK_PUSH_LIBS_TO_REMOTE=NO \
--env.CK_ABS_DIFF_THRESHOLD=0.0014
...
Some statistics:

* Failed: no
...

psyhtest · 2018-07-23T11:42:16Z

It seems that we need a higher threshold for high values of $W$ and $H$ (8, 13, 27, 63).

Chunosov · 2018-07-26T08:59:04Z

higher threshold for high values of $W$ and $H$ (8, 13, 27, 63).

maybe some normalization is needed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error threshold is too low? #12

Error threshold is too low? #12

psyhtest commented Jul 14, 2018

psyhtest commented Jul 14, 2018 •

edited

Loading

psyhtest commented Jul 14, 2018 •

edited

Loading

psyhtest commented Jul 14, 2018

psyhtest commented Jul 14, 2018

psyhtest commented Jul 15, 2018

psyhtest commented Jul 15, 2018 •

edited

Loading

psyhtest commented Jul 17, 2018

psyhtest commented Jul 17, 2018

psyhtest commented Jul 17, 2018 •

edited

Loading

psyhtest commented Jul 17, 2018 •

edited

Loading

psyhtest commented Jul 20, 2018 •

edited

Loading

psyhtest commented Jul 20, 2018

psyhtest commented Jul 20, 2018 •

edited

Loading

Chunosov commented Jul 20, 2018

psyhtest commented Jul 23, 2018

psyhtest commented Jul 23, 2018

Chunosov commented Jul 26, 2018

Error threshold is too low? #12

Error threshold is too low? #12

Comments

psyhtest commented Jul 14, 2018

psyhtest commented Jul 14, 2018 • edited Loading

psyhtest commented Jul 14, 2018 • edited Loading

psyhtest commented Jul 14, 2018

psyhtest commented Jul 14, 2018

psyhtest commented Jul 15, 2018

psyhtest commented Jul 15, 2018 • edited Loading

psyhtest commented Jul 17, 2018

psyhtest commented Jul 17, 2018

psyhtest commented Jul 17, 2018 • edited Loading

psyhtest commented Jul 17, 2018 • edited Loading

psyhtest commented Jul 20, 2018 • edited Loading

psyhtest commented Jul 20, 2018

psyhtest commented Jul 20, 2018 • edited Loading

Chunosov commented Jul 20, 2018

psyhtest commented Jul 23, 2018

psyhtest commented Jul 23, 2018

Chunosov commented Jul 26, 2018

psyhtest commented Jul 14, 2018 •

edited

Loading

psyhtest commented Jul 14, 2018 •

edited

Loading

psyhtest commented Jul 15, 2018 •

edited

Loading

psyhtest commented Jul 17, 2018 •

edited

Loading

psyhtest commented Jul 17, 2018 •

edited

Loading

psyhtest commented Jul 20, 2018 •

edited

Loading

psyhtest commented Jul 20, 2018 •

edited

Loading