Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

docstring format for api, possibly across code base? #3176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jan 17, 2020 · 34 comments
Closed

docstring format for api, possibly across code base? #3176

ghost opened this issue Jan 17, 2020 · 34 comments
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint

Comments

@ghost
Copy link

ghost commented Jan 17, 2020

Extracted from #3130 (review)

We currently don't have a desired style to write docstrings.

Should we come up with one? If so, which one?

@ghost ghost added triage Needs to be triaged discussion requires active participation to reach a conclusion enhancement Enhances DVC and removed triage Needs to be triaged labels Jan 17, 2020
@efiop efiop added refactoring Factoring and re-factoring p3-nice-to-have It should be done this or next sprint labels Jan 17, 2020
@bhavaniravi

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@Suor
Copy link
Contributor

Suor commented Jan 18, 2020

I don't think we should spend time formalizing this, we have very limited amount of user facing functions really. And we may find better way to spend our time than argue about docstring formatting.

@shcheklein
Copy link
Member

@Suor I think you brought this topic, right? :) Are we good to close it (discussion around formatting) then and let Jorge decide this, for example?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 18, 2020

I agree it's only important at the moment to have good full code doc for the API functions and I'm happy to just pick one format but would prefer to get some feedback from the @iterative/engineering team. In fact I wanted to summarize the options here (extracted from #3130 (comment) +):

is there a standard way of doing markup that is understood by the tools that generate a view for this (like F1 in PyCharm, quick help in VS Code, docs generators, etc) ?

Among doc generators, Sphinx is by far the most popular, with pdoc a remote 2nd place, it seems. Sphinx uses reStructuredText format. pdoc is interesting because it supports Markdown and other nicer-looking formats worth checking out actually (from Numpy and Google).

As for IDEs, VSCode intellisense doesn't interpret any format in the docstrings it seems. It mostly focuses on building and inspecting the Python code by itself. From what I can see here, PyCharm auto completion works similarly.

So easy to read docstrings e.g. numpydoc or Google style may be best if we expect IDEs code completion will be the main way users reach our docstrings. And especially since we're writing the API documentation manually anyway (not generating it).


p.s. An interesting VSCode extension to help write docstrings in any of the above formats is https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring, BTW.

@jorgeorpinel jorgeorpinel removed research refactoring Factoring and re-factoring labels Jan 18, 2020
@jorgeorpinel jorgeorpinel changed the title inconsistent docstring formatting across the code base docstring formatting for api, possibly across code base Jan 18, 2020
@jorgeorpinel jorgeorpinel added the question I have a question? label Jan 18, 2020
@jorgeorpinel jorgeorpinel changed the title docstring formatting for api, possibly across code base docstring formatting for api, possibly across code base? Jan 18, 2020
@jorgeorpinel jorgeorpinel changed the title docstring formatting for api, possibly across code base? docstring format for api, possibly across code base? Jan 18, 2020
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 21, 2020

OK... I'm going to suggest we adopt the Google style for modules (files) and functions. Reasons:

  • We don't need to produce automatic docs (i.e.reST/Sphinx) since we're manually writing them in the dvc.org repo – still, pdoc does support generating docs from Google style docstrings just in case.
  • Seems like the easiest one to read – uses plain language, not many symbols. This is good for IDE completion which I think will be the main way people interact with our docstrings.
    Its reference is the shortest and simplest one, but covers all the basics: module license; fn descriptions, arguments, return/yield, exceptions raised.
  • Doesn't use back quotes `, so @Suor will be happy 🙂

All in favor? ✋

@Suor
Copy link
Contributor

Suor commented Jan 21, 2020

I am in favor of no style, it will only slow down reviews with no real gain.

@casperdcl
Copy link
Contributor

I'm in favour of an easy-to-use style so it doesn't slow down reviews :)

@ghost
Copy link
Author

ghost commented Jan 21, 2020

Also in favor of not slowing down reviews for doc strings.

@shcheklein
Copy link
Member

I'm in favor of some simple consistent style for the external API functions (better to be more or less consistent internally, but would never block any review for this). If we have a ticket like this #2316 and we deal with this, it's better to agree what are some basic expectations and how this should looks like. I'm fine with Google style.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 23, 2020

I am in favor of no style

I think it's less about exact styles and more about completeness of docstings. I.e. Google format specifies which sections to include in the docstring (basically description, args, result/yield, throws I think). Yes, a style is attached to that but it's pretty simple... And consistency is good!

not slowing down reviews for doc strings ... some simple consistent style for the external API functions (better to be more or less consistent internally, but would never block any review for this

Yes guys, this issue is mostly about the API code. I think it's totally up to the core engineering team to decide about docstrings in the rest of the code.


I'm confused about the voting results so far: 1 for no format, 2 for Google, 2 invalid haha – no consensus yet :(

@ghost
Copy link
Author

ghost commented Jan 23, 2020

What I don't like from Google style is that it "annotates" the arguments with types.
I would prefer to have some proper type hints instead.

I'm afraid that if we start documenting everything, we will end up including useless comments just for the sake of, like:

def do_stuff(x, y):
    """A method that do_stuff given x and y
    Args: 
         x (str): This is an X
         y (str): This is a Y.
     """

Ideally, the method name should be clear enough to understand what it does, so no need to put that on the description, same with argument names.

Things that I think are important to document:

Side effects

dvc/dvc/analytics.py

Lines 67 to 81 in ec2c8ad

def send(report):
"""
Side effect: Removes the report after sending it.
The report is generated and stored in a temporary file, see:
`collect_and_send_report`. Sending happens on another process,
thus, the need of removing such file afterwards.
"""
url = "https://analytics.dvc.org"
headers = {"content-type": "application/json"}
with open(report, "rb") as fobj:
requests.post(url, data=fobj, headers=headers, timeout=5)
os.remove(report)

Explain convoluted implementations / What & Why?

dvc/dvc/analytics.py

Lines 26 to 49 in ec2c8ad

def collect_and_send_report(args=None, return_code=None):
"""
Collect information from the runtime/environment and the command
being executed into a report and send it over the network.
To prevent analytics from blocking the execution of the main thread,
sending the report is done in a separate process.
The inter-process communication happens through a file containing the
report as a JSON, where the _collector_ generates it and the _sender_
removes it after sending it.
"""
report = _runtime_info()
# Include command execution information on the report only when available.
if args and hasattr(args, "func"):
report.update({"cmd_class": args.func.__name__})
if return_code is not None:
report.update({"cmd_return_code": return_code})
with tempfile.NamedTemporaryFile(delete=False, mode="w") as fobj:
json.dump(report, fobj)
daemon(["analytics", fobj.name])

Give examples when code is hard to read

dvc/dvc/repo/__init__.py

Lines 271 to 298 in ec2c8ad

def _collect_graph(self, stages=None):
"""Generate a graph by using the given stages on the given directory
The nodes of the graph are the stage's path relative to the root.
Edges are created when the output of one stage is used as a
dependency in other stage.
The direction of the edges goes from the stage to its dependency:
For example, running the following:
$ dvc run -o A "echo A > A"
$ dvc run -d A -o B "echo B > B"
$ dvc run -d B -o C "echo C > C"
Will create the following graph:
ancestors <--
|
C.dvc -> B.dvc -> A.dvc
| |
| --> descendants
|
------- pipeline ------>
|
v
(weakly connected components)

DISCLAIMER: I wrote all of those docstrings. I'm completely bias. 🙈

@casperdcl
Copy link
Contributor

I'm afraid that if we start documenting everything, we will end up including useless comments just for the sake of

I fully agree, except if we start doing intelligent parsing of docstrings in order to auto-gen docs in such a way that requires having to (annoyingly) document everything. I prefer to not document the obvious. And any intelligent parsing should look at the func signature as well as docstrings to pick up args.

@shcheklein
Copy link
Member

@MrOutis @casperdcl 💯 about the "code must be good enough to read without comments in the first place", no doubts about this! But for this discussion, let's focus on external APIs only. (dvc.api.* for now).

My bias and experience here. I really like when I can press a hotkey, get to the source code of the API and see some documentation there (~ short version of what you can get online, but enough in most cases to understand everything, probably no examples, but definitely edge cases, what do we expect from the params, etc). It tremendously simplifies the experience in my case.

So, back to the question. Do we want to have a bit more then just an obvious "This is a get method, it gets something" kind of docstrings at all (again, for external, user-facing API)? If not, should we just delete existing one-sentence comments (that do not provide any value whatsoever)?

Btw, a solution might be to put just a link the docs? (similar to what we do in CLI).

@casperdcl
Copy link
Contributor

enough in most cases to understand everything, probably no examples, but definitely edge cases, what do we expect from the params, etc)

Yes, definitely

Do we want to have a bit more then just an obvious "This is a get method, it gets something"

Only if needed to describe a complex operation

should we just delete existing one-sentence comments (that do not provide any value whatsoever)?

Yes if no genuine value provided

Btw, a solution might be to put just a link the docs? (similar to what we do in CLI).

Yes, though I'd recommend adding a travis job which checks for dead links.


Overall in all cases I'd always recommend the original dev does the sensible thing, whether it's comment in prose, in bullet points, in code examples, in links - whatever they think will be most useful for people who may read it. There may not be a one-style-fits-all-functions approach.

@shcheklein
Copy link
Member

Only if needed to describe a complex operation

💯 ! In our case all operations (dvc.api) are not simple, neither self-descriptive imo.

There may not be a one-style-fits-all-functions approach.

If we though have some external APIs we want to describe args, exceptions, some edge cases for - it's better to use the same style, right? (And that's all this discussion is about.) And I believe we have those (I would find docstrings def useful for myself - again, I have a bias - I use IDEs and navigate to libraries all the time and prefer it over online docs).

@casperdcl
Copy link
Contributor

I also prefer inline to online. I grudgingly accept online if well done. I don't like any duplication.

I feel like online should perhaps have prosaic expansion on examples, while inline should be as succinct as possible.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 28, 2020

Btw, a solution might be to put just a link the docs? (similar to what we do in CLI).

I don't see this being of much value, you can just open he website and find the function easily on your own.

I'd recommend adding a travis job which checks for dead links.

Unrelated @casperdcl but if you have any experience on this, please participate in iterative/dvc.org/issues/652


OK so still no decision yet on which docstring format 😢 I may have to just start using Google style for now to get the inline doc going but we can revisit later...

@casperdcl

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@Suor
Copy link
Contributor

Suor commented Jan 29, 2020

I think it's less about exact styles and more about completeness of docstings. I.e. Google format specifies which sections to include in the docstring (basically description, args, result/yield, throws I think). Yes, a style is attached to that but it's pretty simple... And consistency is good!

I am against going for completeness too. Google format is a good example of how bloat is created by doing so. The same reason to avoid consistency if that causes bloat. Python stdlib lives happily without that.

@shcheklein the doc-string should describe what function is doing, so that people wouldn't need to guess by name. Params should only be described if that is non-obvious, formal descriptions is bloat and will only stop people from reading it or at least slow them down. Any edge cases should never make it to doc-strings, unless they represent some common use case at the same time.

@shcheklein
Copy link
Member

Google format is a good example of how bloat is created by doing so.

don't see how does it create bloat. Any docs are valuable for me (list of exceptions to expect if it can be provided, types, brief param descriptions, etc). I do use them extensively if they exist. Agreed that we can omit parameters if they are obvious. On the other hand how is it different from CLI - why do we include brief description to seemingly obvious options?

Python stdlib lives happily without that.

Not sure I understand why should we rely on this. We can check other very popular libraries? Also, they have extensive docs for a lot of stuff even inside. Have you seen any guidelines they use by chance? Do they have any?

Any edge cases should never make it to doc-strings, unless they represent some common use case at the same time.

check stdlib, I think they have tons of notes, long descriptions like os.walk, etc. Hard to judge if they are rare edge cases or common situations. I would expect that everything that starts with Note is kind-a edge case?


To reiterate. I don't care that much about philosophy, style guides, stdlib, google, etc. As I said my bias is very simple - I prefer offline to online and I do read and use public APIs docs via IDEs extensively. Plus it's better to be consistent. That's why I voted the way I voted.

@Suor
Copy link
Contributor

Suor commented Jan 29, 2020

os.walk() has complicated interface, thus a long doc string. It also shows the value of examples and no value of formal style guide.

Simple functions, like ones from os.path have very simple doc-strings, usually one-liners. Any non-obvious things are mentioned briefly, see os.path.exists() for example.

@shcheklein
Copy link
Member

Okay, to summarize.

Looks no one wants to introduce a heavy formats like google's, etc (with sections, and whatnot) and more in favor of stdlib's approach, namely: single or multi-line block of text which might (and probably should) include param names, exceptions, etc, and describes what's happening in some reasonable way.

I'm totally fine with this as well. If we all agree on this, @jorgeorpinel @efiop let's close this?

For the issue of improving (non-) existing docstrings we should create a separate issue? which I think is way more important to be honest than discussing a specific format in the first place.

@casperdcl summarized it well enough as well:

Overall in all cases I'd always recommend the original dev does the sensible thing, whether it's comment in prose, in bullet points, in code examples, in links - whatever they think will be most useful for people who may read it. There may not be a one-style-fits-all-functions approach.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 29, 2020

in favor of stdlib's approach, namely: single or multi-line block of text which might (and probably should) include param names, exceptions, etc, and describes what's happening in some reasonable way.

It sounds pretty vague to me. Is the policy "please write some sort of docstrings when you think its reasonable"? We would be back at square one but I can't really have a strong opinion here so up to you guys 🙂

p.s. just a question: does this mean it's not allowed to use a full format? Or is it totally fee for everyone to chose their favorite docstring flavor? Bc I've already been using Google's a little bit 😋

@shcheklein
Copy link
Member

@jorgeorpinel I would avoid mixing different styles - e.g. Google and stdlib. The way I understand stdlib - you just write a "description" that is valuable to the end-user of what's going on, using param names as terms in it.

please write some sort of docstrings when you think its reasonable

we go from a format discussion to the usefulness of them, let's move it to another topic please?

@efiop
Copy link
Contributor

efiop commented Feb 26, 2020

I think we should write docstrings in some sane format. Google's style is a good choice too, but maybe there is something better. It will allow editors to pick them up easily, improving the dev experience.

Sloppy docstrings like in stdlib are as good as nothing. If we start bothering with these, we might as well stick to some good format that will be actually useful.

I don't like docstrings myself, but python is too high level to live without them. Even the exceptions introduce some non-obvious flow control, that you can't really tell from looking at one function's body. So I think we need to be good developers and use docstrings to document what needs to be documented.

@jorgeorpinel
Copy link
Contributor

I just want to add a link to this discussion about exceptions in stdlib docstrings: #3352 (comment)

The summary is that they seem to just be mentioned (not even with specific type) some times (when it seems relevant). So, a pretty flexible common-sense approach as well.

@efiop
Copy link
Contributor

efiop commented Feb 28, 2020

@jorgeorpinel stdlib is quite old, so I would totally believe that they don't use proper docstrings because back in a day those didn't exist for python and nowadays no one wants to go through the whole codebase to consistently use new format everywhere.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 28, 2020

Yeah, I didn't pick that style. I voted for Google style originally, but others weren't convinced and decided to use a super simple "no style" docstrings inspired on stdlib (at least for now), which is why I'm focusing on stdlib for docstring related matters. (Please refer to discussion above.)

Recently I've noticed we kind of already use Google style docstrings throughout the cod base, so I'm a bit confused again... But will continue to assume the resolution so far is in #3176 (comment) ^ Totally up to you guys.

@jorgeorpinel jorgeorpinel removed the question I have a question? label Feb 28, 2020
@skshetry
Copy link
Collaborator

skshetry commented Feb 28, 2020

I don't really see a point of the discussions here, feels like bikeshedding.
We could just start with documenting dvc.api first without any formats. At least, it's a good start, and helpful for people doing:

>> help(dvc.api.open)

Then, we could start docstring format for api later on if it provides more benefits (and, it's not hard to change formats for just dvc.api). I don't think it's important to discuss about formats for internals, yet (the benefits of having a docstring vs docstring in a format for internals is small).

@Suor
Copy link
Contributor

Suor commented Mar 5, 2020

I am still against any format, including googles. It only adds bloat. You should only mention things important for a specific function, not a predefined list in a predefined format, half of which people will need to skim through.

I don't care if some IDEs would be happy parsing it, I prefer it to be human readable instead.

@jorgeorpinel
Copy link
Contributor

The one thing I've seen that would actually be pretty useful is to use back quotes around param names when explaining them in the docstring, since we're avoiding a special Args section. This was discussed here early on and it was mostly Alex against it. We ended up agreeing not to use them but now that I've been writing them I kind of miss them. This is because param names or other literals tend to be common words like path, so using backquotes around them makes it obvious that they're literals and not just a word in a sentence.

def func(path, x, y)
```Calculates something with x and y argument,
   based on a file found in the given path arg.```

vs.

def func(path, x, y)
```Calculates something with `x` and `y`, based on the file in `path`.```

It's just a bit faster and more obvious using `. Maybe we can even allow this partially? (Not required but allowed when needed)

@efiop
Copy link
Contributor

efiop commented Mar 9, 2020

So we agree that we need docstrings in one shape or another, right? If so, please take a good look at the discussions in this ticket and notice how we all have different preferences. Most of us remember the constant fights we've had about the coding style a year ago, before we've started using black. We all come from different backgrounds and we all have different preferences, which will be a constant source of fights unless we agree on some style and just stick to it. "no-style" is a bad idea, because we will drown in fights as we did with code style and then we will still choose some style to put the end to our fights. So why not just skip ahead and pick something instead of wasting our energy on constant fights? That is why I'm admittedly pushing towards picking some style and sticking to it right away.

So from my research, there is 1 defacto standard tool for checking docstrings: pydocstyle and it supports three styles pep257, numpy and google. And I think we should choose one of those.

IMPO, numpy is aight but I haven't run into that docstring style that much and I suppose that is probably the reality. google is everywhere, so probably the most popular one in my experience. pep257 looks alright to me and, well, it is an active pep :) I would be fine with google or pep257, but I would pick pep257 because it doesn't have google-bias and is a de-jure active standard, that things like black mention so looks like the wisest choice.

@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

7 participants