-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: update cheatsheet #39806
DOC: update cheatsheet #39806
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OliEfr
Thanks for your efforts, but I'm not OK with some of these changes, so please do the following:
- On first page, please put back the top left part with "Data Wrangling with pandas Cheat Sheet". You can make the font smaller if you want. You can include the references to the pandas user guide and API reference, but do not include the references to Seaborn and matplotlib. We're not in a position to pick any particular visualization tool.
- I'm OK with changing the title of the left panel from "Syntax - Creating DataFrames" to "Creating DataFrames", but please don't include the reference to IO Tools and read_csv there (see below for a suggestion)
- Under Method Chaining, put back the comment "This improves readability of code"
- At top of first page, please put back the Tidy Data section. Move the "Display and Visualize Data" section back to page 2 with just the original examples (but see below for a suggestion)
- In Subset Observations, please put back the pictures that illustrate what subsetting is all about. Try to fit in a
df.query
example. Separate out the ones for rows (which are about subsetting observations) and columns (which are about subsetting the variables), as they were before. I'm fine with moving the last 3 examples on the current cheat sheet that subset variables above thedf.filter
example, which would move theregex
table down. Note that those 3 example are about filtering columns (with the last example being about filtering rows and columns) - On page 2, I'm OK with adding
df.shape
, but if you reformat that part to putdf.shape
to the right ofdf.len()
, then things will fit better. - Remove the comment in "Group Data" about "Possibly use
reset_index()
after!" - Remove the "apply Functions" example, and put the simple plotting example back.
df.assign
is also described above, so no need to include it again. (But see suggestion below). - For the attribution on the bottom of the 2 pages, please use "Cheatsheet for pandas (http://pandas.pydata.org/) originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet". Keep it flush right on both pages (that was a nice change)
Now for the suggestion. I've long wanted to add a third page that would have additional information that people would find useful, and just never had the time to do it, so it would be great if you could start this. This could include:
- More visualization examples (but do NOT use references to other libraries, e.g. your
seaborn
example) - The options you included, and more (also
display_max_columns
) See the list of frequently used ones here: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#frequently-used-options - I/O: A whole section showing a variety of popular IO usage (CSV, Excel, SQL, HTML), and also output file formats (feather, parquet, HDF)
- The
Apply Functions
section - although I'm not sure that is the right name for the section you created. - The new Extension Types (
String
,Integer
,Float
) and howpd.NA
works - Anything else that could fill up the space (if needed)
I havent had time yet, but I agree with that third page. Will work on it later.
|
Idea: when we have a third page, there will be enough space. So why don't we create a section - I don't mind on which page - "Design Principles" or "Best Practice". Within this section we can have the comment about tidy data, vectorization and also things such as "First column is 0". |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
First, sorry for the delay on reviewing this. It slipped through the proverbial cracks.
Some people start using pandas by using the cheat sheet, so that's why I like putting the stuff at the top. Also, it maps the words observation to row and variable to column, which corresponds to wording in the statistics community.
We're getting close. See below.
No. Now, for other comments:
|
Alright, I adapted everything, apart from:
For me, it was not obvious at first that the first index is I removed |
@OliEfr We're getting even closer Additional changes to make:
|
Check |
Should have caught a few more issues (some of which existed prior to your PR):
|
Check - and corrected a few more comma and spacing issues. |
Thanks @OliEfr |
If you want, you can create an issue with your ideas for a third page/further ideas. I'll probably keep working on it. |
Good idea. Created #40680 If you do that work, I'll try to be more responsive! |
* Added links to official docs in cheat sheets * DOC: added links to official docs in cheat sheet (update) * DOC: update cheatsheet * DOC: minor changes cheatsheet; update honors * DOC: rework according to requested changes * Update Cheatsheet * Update Cheatsheet * update cheatsheet * update cheatsheets * update cheatsheet!
* Added links to official docs in cheat sheets * DOC: added links to official docs in cheat sheet (update) * DOC: update cheatsheet * DOC: minor changes cheatsheet; update honors * DOC: rework according to requested changes * Update Cheatsheet * Update Cheatsheet * update cheatsheet * update cheatsheets * update cheatsheet!
changesfor update cheatsheet issue #39806
remake as simple learning Data Frames how to read the data from the csv file:*for example
data = load_iris() 2.subset variables / observations insert some python codes useful for pandas data frames and updates
Select the last element in the list(the slice starts at the last element, and ends at the end of the list)surveys_df[-1:] Using the 'copy() method'true_copy_surveys_df = surveys_df.copy() Using the '=' operatorref_surveys_df = surveys_df ###3.Summarizing Data and Handling the missing Data: summarize the datasetprint(dataset.describe()) count the number of missing values for each columnnum_missing = (dataset[[1,2,3,4,5]] == 0).sum() report the result |
@maripisravankumar This is a closed PR on the cheat sheet. If you would like to propose other changes to the cheat sheet, you can create your own PR and do edits, and I will either accept or reject them, or suggest changes. |
Major changes
Let me know what you think about it and make suggestions.
Oli