Skip to content

Script to check links on .md files #690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Script to check links on .md files #690

wants to merge 3 commits into from

Conversation

dashohoxha
Copy link
Contributor

@dashohoxha dashohoxha commented Oct 10, 2019

Fix #652

@shcheklein shcheklein temporarily deployed to dvc-org-pr-690 October 10, 2019 13:39 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-690 October 10, 2019 14:22 Inactive
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can accept and merge this, but I feel it won't be used/useful w/o introducing a proper CI setup. It can look like an extra CircleCI step that run the Node app and triggers this script on all .md files that has changed compared to master. Also, you don't even add a mention about the script to the docs.

  • It does not fix 652 (we'll need to reopen the ticket) or remove from the description and change title, please
  • It's a good practice to have set -euxo pipefail
  • It's a good practice to keep it clean from debug prints and other commented code

@dashohoxha dashohoxha changed the title Fix #652: Script to check links on .md files Script to check links on .md files Oct 12, 2019
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about set -euox pipefail to make it more robust?

@dashohoxha
Copy link
Contributor Author

How about set -euox pipefail to make it more robust?

I don't think it will make this script more robust.

By the way, maybe we don't have to merge this PR if @algomaster99 is going to itegrate it somewhere else.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the script 🙂 but also not seeing point (nor problem) in merging until it gets integrated to CI tests.

UPDATE: Please see my actual #690 (review) below...

@algomaster99
Copy link
Contributor

@jorgeorpinel @shcheklein @dashohoxha
Screenshot from 2019-10-15 02-42-14

I altered one URL in version.md as /doc/use-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache but it still gives a 200 response. I am currently figuring out how can we differentiate between actual and Not Found pages.

Just for testing, I disabled -q.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I found one thing to review (see below)
And also, this doesn't seem to run on Mac:

$ ./scripts/check-links.sh 
./scripts/check-links.sh: line 29: shopt: globstar: invalid shell option name
grep: static/**/*.md: No such file or directory

@@ -0,0 +1,41 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be #!/bin/sh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Can you try and let me know?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 14, 2019

(https://dvc.org/doc/use-guide/large-dataset-optimization) still gives a 200 response...

I think you found an engine bug. It should return a 404 header.

  • Please open an issue for this 🙂

p.s. no need to wait for that bug to be fixed before finalizing this. You can test with some other 404 returning URL like http://connectivitycheck.gstatic.com/

@shcheklein
Copy link
Member

@jorgeorpinel @algomaster99 it might be not an easy thing to fix though - it's the way the engine is written/works (or at least it might turn out that it's easier to fix in the script, e.g. using selenium of what not).

@algomaster99
Copy link
Contributor

algomaster99 commented Oct 14, 2019

@jorgeorpinel @shcheklein One way is to grep the output and check for "404 Oops! Not Found!". But, I think, it is a very inefficient method. I am looking for a way to modify the engine only.

Also, I think the script is written fine. We might need to grep its response codes and fail the test whenever we get 404. So we can focus on sending 404 response code when such a page is requested.

@shcheklein
Copy link
Member

@algomaster99 there is not easy way to grep the response. You need to execute JS. It's not some static content.

@algomaster99
Copy link
Contributor

@shcheklein My workaround was getting the content using wget and then searching over it.

@shcheklein
Copy link
Member

@algomaster99 does wget generated response contain 404 for those links?

@algomaster99
Copy link
Contributor

@shcheklein Not for those but it works for URLs like http://dvc.org/oc/.
Screenshot from 2019-10-15 03-54-45

@shcheklein
Copy link
Member

@algomaster99 so, the point is to being able to test for any docs related URLs. For majority of them, wget method won't work because of the dynamic nature of the engine.

@jorgeorpinel
Copy link
Contributor

I think we should definitely open a separate issue for the false 404 pages rendered by React which doesn't really block this PR. If that's not fixable, maybe we can use a different strategy to check these internal links, like check that the file exists (since the URL paths are also file paths).

@algomaster99
Copy link
Contributor

@jorgeorpinel Okay, I will open one :)

@shcheklein
Copy link
Member

@jorgeorpinel I'm not sure I understand why do you consider them "false" 404s, though?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 14, 2019

Because the engine renders a huge number "404", however it already responded with HTTP 200:

image

https://dvc.org/doc/command-reference/bad

I guess they're not "false"; They are actual 404s, but with an incorrect response header.

@shcheklein
Copy link
Member

@jorgeorpinel yep, but this is because it's not even reloading the page, right? no way it can redo a response if you click on a broken link on an already opened page in docs.

@shcheklein
Copy link
Member

@jorgeorpinel so, this is the way the engine works - it preloads a simple "empty" page with basic elements. This request returns 200. Then it executes some JS to download the md file to render it on the page.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 14, 2019

Yes, I understand it's dynamic but in theory it could check for the path validity before responding 200 and preloading a simple layout. And I do think it reloads when you click on a link (but it responds with 304 Not Modified in that case, apparently). Let's move this to #697?

@shcheklein
Copy link
Member

Let's move this to #697?

Sure.

@dashohoxha
Copy link
Contributor Author

The script currently reports these issues to me:

static/docs/changelog/0.18.md: 'discuss.dvc.org'
static/docs/changelog/0.35.md: 'https://plugins.jetbrains.com/plugin/11368-data-version-control-dvc-support'
static/docs/command-reference/add.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/command-reference/checkout.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/command-reference/config.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/command-reference/config.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/command-reference/destroy.md: '/doc/user-guide/dvc-files-and-directories'
static/docs/command-reference/get-url.md: 'https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html'
static/docs/command-reference/remote/add.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/command-reference/remote/add.md: 'https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html'
static/docs/command-reference/remote/add.md: 'https://minio.io/'
static/docs/command-reference/remote/add.md: 'https://docs.microsoft.com/en-us/azure/storage/common/storage-create-storage-account'
static/docs/command-reference/remote/index.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/command-reference/remote/modify.md: 'https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html'
static/docs/command-reference/remote/modify.md: 'https://minio.io/'
static/docs/command-reference/update.md: 'https://github.com/iterative/example-get-started'
static/docs/get-started/add-files.md: '/docs/user-guide/large-dataset-optimization'
static/docs/get-started/experiments.md: '/docs/user-guide/large-dataset-optimization'
static/docs/get-started/index.md: '/chat'
static/docs/get-started/pipeline.md: '/doc/tutorial'
static/docs/tutorials/deep/define-ml-pipeline.md: 'https://data.dvc.org/tutorial/ver/data.zip'
static/docs/tutorials/deep/preparation.md: 'https://code.dvc.org/tutorial/nlp/code.zip'
static/docs/tutorials/pipelines.md: '/doc/tutorial'
static/docs/tutorials/versioning.md: '/chat'
static/docs/understanding-dvc/collaboration-issues.md: '<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning'
static/docs/understanding-dvc/related-technologies.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/understanding-dvc/resources.md: 'https://www.kaggle.com/rtatman/kerneld4769833fe'
static/docs/use-cases/share-data-and-model-files.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/use-cases/share-data-and-model-files.md: 'https://docs.aws.amazon.com/cli/latest/reference/s3/mb.html'
static/docs/user-guide/contributing-docs.md: 'https://github.com/iterative/dvc.org/tree/master/src/Documentation/sidebar.json'
static/docs/user-guide/contributing-docs.md: 'https://github.com/iterative/dvc.org.git'
static/docs/user-guide/contributing-docs.md: 'https://nodejs.org/'
static/docs/user-guide/contributing-docs.md: 'https://marketplace.visualstudio.com/items?itemName=stkb.rewrap'
static/docs/user-guide/contributing-docs.md: 'https://raw.githubusercontent.com/iterative/dvc.org/master/static/docs/user-guide/contributing-doc.md'
static/docs/user-guide/contributing.md: 'https://github.com/iterative/dvc.git'
static/docs/user-guide/contributing.md: '/chat'
static/docs/user-guide/contributing.md: 'https://docs.aws.amazon.com/en_us/cli/latest/userguide/cli-chap-install.html'
static/docs/user-guide/contributing.md: 'https://cloud.google.com/sdk/docs/quickstarts'
static/docs/user-guide/contributing.md: 'https://github.com/ambv/black'
static/docs/user-guide/dvc-files-and-directories.md: '/docs/user-guide/large-dataset-optimization'
static/docs/user-guide/large-dataset-optimization.md: '/docs/user-guide/update-tracked-files'
static/docs/user-guide/plugins.md: 'https://plugins.jetbrains.com/plugin/11368-dvc-support-poc'
static/docs/user-guide/running-dvc-on-windows.md: '<https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2003/cc778996(v=ws.10'
static/docs/user-guide/update-tracked-files.md: '/docs/user-guide/large-dataset-optimization'

Maybe not all of them need to be fixed, but some of them should be fixed.
@algomaster99, @jorgeorpinel: Can you create an issue for checking and fixing them?
And then let's close this PR.

@dashohoxha
Copy link
Contributor Author

By the way, I see two reasons why this script is not 100% reliable:

  1. Examples (that are surrounded with backticks or blockquotes) should not be checked, but it tries to check them. For example something like this:
    `[example](https://example.url)`
    
  2. It cannot really check the anchors inside a page, something like this:
    [section](/doc/page#anchor)
    
    If #anchor is wrong, it does not detect it.

Maybe there are other problems too.

@jorgeorpinel
Copy link
Contributor

@dashohoxha thanks for the list in #690 (comment) please open separate issue so we can address when possible.

I believe @algomaster99 is taking over this branch and PR so no need to close it. Aman, please note Dashamir's notes in #690 (comment) – things to fix in the script (if not now, feel free to open separate issue to improve it after CI integration).

Thanks

@shcheklein
Copy link
Member

closing this until we find a way to run it via CI and check the links reliably

@shcheklein shcheklein closed this Oct 25, 2019
@dashohoxha dashohoxha deleted the check-links branch October 29, 2019 20:15
@casperdcl casperdcl restored the check-links branch January 29, 2020 20:39
@casperdcl casperdcl self-assigned this Jan 29, 2020
@casperdcl casperdcl added type: enhancement Something is not clear, small updates, improvement suggestions A: docs Area: user documentation (gatsby-theme-iterative) labels Jan 29, 2020
@casperdcl casperdcl mentioned this pull request Jan 29, 2020
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ci: test to check all links
5 participants