Skip to content

Commit f768197

Browse files
committed
rozne
1 parent 50c0ed8 commit f768197

File tree

18 files changed

+647
-0
lines changed

18 files changed

+647
-0
lines changed

__scraping__/bit.do/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
It seems minimal working code.
2+
3+
It needs header `'X-Requested-With'` because it is AXAJ/XHR request.
4+
5+
It needs `permasession` but first `GET` doesn't send it so probably it is generated on page with JavaScript. But it works for me with the same `permasession` all the time.
6+
7+
Maybe later it will need new/fresh `permasession`
8+
9+
There are spaces in `" site2 "`
10+
11+
```python
12+
import requests
13+
14+
headers={
15+
'X-Requested-With': 'XMLHttpRequest', # need it
16+
}
17+
18+
data = {
19+
'action': 'shorten',
20+
'url': 'https://onet.pl',
21+
'url2': ' site2 ', # need spaces
22+
'url_hash': None,
23+
'url_stats_is_private': 0,
24+
'permasession': '1555801674|ole2ky65f9', # need it
25+
}
26+
27+
r = requests.post('http://bit\.do/mod_perl/url-shortener.pl', headers=headers, data=data)
28+
29+
print(r.status_code)
30+
print(r.json())
31+
```
32+
33+
It didn't need `requests.Session()` nor `User-Agent` nor `GET` request at start.
34+
35+
---
36+
37+
**EDIT:** value `1555801674` in `'permasession': '1555801674|ole2ky65f9'` is timestamp with current date and time.
38+
39+
```python
40+
import datetime
41+
42+
datetime.datetime.fromtimestamp(1555801674)
43+
44+
datetime.datetime(2019, 4, 21, 1, 7, 54)
45+
```
46+
47+
Maybe `ole2ky65f9` is also timestampe but as shortened value.
48+

__scraping__/bit.do/main.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
2+
# date: 2019.04.21
3+
# https://stackoverflow.com/a/55778640/1832058
4+
5+
import requests
6+
7+
# not need Sessions
8+
s = requests.Session()
9+
s.headers.update({
10+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
11+
'Accept-Encoding': 'gzip, deflate',
12+
'Accept-Language': 'pl,en-US;q=0.7,en;q=0.3',
13+
'Cache-Control': 'no-cache',
14+
'Connection': 'keep-alive',
15+
})
16+
17+
#r = s.get('http://bit.do/')
18+
#print(r.status_code)
19+
#print(r.cookies)
20+
21+
22+
# ------------------------------------
23+
24+
headers={
25+
'X-Requested-With': 'XMLHttpRequest', # need it
26+
#'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0',
27+
#'Cookie': 'permasession=1555801674|ole2ky65f9', #
28+
}
29+
30+
data = {
31+
'action': 'shorten',
32+
'url': 'https://onet.pl',
33+
'url2': ' site2 ', # need spaces
34+
'url_hash': None,
35+
'url_stats_is_private': 0,
36+
'permasession': '1555801674|ole2ky65f9', # need it
37+
}
38+
39+
r = requests.post('http://bit.do/mod_perl/url-shortener.pl', headers=headers, data=data)
40+
print(r.status_code)
41+
print(r.json())
42+
43+
44+
45+
import datetime
46+
47+
datetime.datetime.fromtimestamp(1555801674)
48+
49+

__scraping__/spotifychart.com/main.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# date: 2019.04.16
2+
# https://stackoverflow.com/questions/55699472/web-scraping-python-indexing-issue-for-dataframe/55700180#55700180
3+
4+
import requests
5+
from bs4 import BeautifulSoup
6+
import pandas as pd
7+
8+
base_url = 'https://spotifycharts.com/regional/global/daily/'
9+
10+
r = requests.get(base_url)
11+
12+
soup = BeautifulSoup(r.text, 'html.parser')
13+
14+
chart = soup.find('table', {'class': 'chart-table'})
15+
tbody = chart.find('tbody')
16+
17+
all_rows = []
18+
19+
for tr in tbody.find_all('tr'):
20+
21+
rank_text = tr.find('td', {'class': 'chart-table-position'}).text
22+
23+
artist_text = tr.find('td', {'class': 'chart-table-track'}).find('span').text
24+
artist_text = artist_text.replace('by ','').strip()
25+
26+
title_text = tr.find('td', {'class': 'chart-table-track'}).find('strong').text
27+
28+
streams_text = tr.find('td', {'class': 'chart-table-streams'}).text
29+
30+
all_rows.append([rank_text, artist_text, title_text, streams_text])
31+
32+
# after `for` loop
33+
34+
df = pd.DataFrame(all_rows, columns=['Rank','Artist','Title','Streams'])
35+
print(df)#.head(15))

csv/incorrectly-save-csv/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
`sample.csv` is incorrectly saved CSV.
2+
3+
Probably someone created one string with all items in row and used `csv` to save it.
4+
But `csv` saved it as single column with long string, not as many columns.
5+
6+
Example use `csv` to read it again, and write it back as normal file.
7+
This way it removes `"` at the both sides of long string,
8+
and it converts double `""` to single `"`
9+
10+
Now it is correct CSV and there is no problem to read it in `pandas.read_csv()`

csv/incorrectly-save-csv/main.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import csv
2+
import pandas as pd
3+
4+
5+
f1 = open('sample.csv')
6+
f2 = open('temp.csv', 'w')
7+
reader = csv.reader(f1)
8+
for row in reader:
9+
f2.write(row[0] + '\n')
10+
f2.close()
11+
f1.close()
12+
13+
14+
df = pd.read_csv('temp.csv')
15+
16+
print(len(df.columns))
17+
print(df)

csv/incorrectly-save-csv/sample.csv

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
"Store code,""Biz"",""Add"",""Labels"",""TotalSe"",""DirectSe"",""DSe"",""TotalVe"",""SeVe"",""MaVe"",""Totalac"",""Webact"",""Dions"",""Ps"""
2+
",,,,""Numsearching"",""Numsearchingbusiness"",""Numcatprod"",""Numview"",""Numviewed"",""Numviewed2"",""Numaction"",""Numwebsite"",""Numreques"",""Numcall"""
3+
"Nora,""Ora"",""Sgo, Mp, 2000"",,111,44,33,121,1232,53411,4,5,3,3"
4+
"mc11,""21 old"",""tjis that place, somewher, Netherlands, 2434"",,3245,325,52454,3432,243,4353,343,23,23,18"

csv/incorrectly-save-csv/temp.csv

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Store code,"Biz","Add","Labels","TotalSe","DirectSe","DSe","TotalVe","SeVe","MaVe","Totalac","Webact","Dions","Ps"
2+
,,,,"Numsearching","Numsearchingbusiness","Numcatprod","Numview","Numviewed","Numviewed2","Numaction","Numwebsite","Numreques","Numcall"
3+
Nora,"Ora","Sgo, Mp, 2000",,111,44,33,121,1232,53411,4,5,3,3
4+
mc11,"21 old","tjis that place, somewher, Netherlands, 2434",,3245,325,52454,3432,243,4353,343,23,23,18
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
2+
# date: 2019.04.18
3+
# Bartlomiej 'furas' Burek
4+
#
5+
# https://stackoverflow.com/questions/16467479/normalizing-unicode
6+
#
7+
#
8+
9+
import os
10+
import zipfile
11+
import unicodedata
12+
import argparse
13+
14+
parser = argparse.ArgumentParser()
15+
parser.add_argument('filename', help='zip file with MAC OS X names')
16+
args = parser.parse_args()
17+
18+
def convert(name):
19+
name = name.encode('cp437').decode('utf-8')
20+
name = unicodedata.normalize('NFC', name)
21+
return name
22+
23+
if args.filename:
24+
z = zipfile.ZipFile(args.filename)
25+
for item in z.filelist:
26+
#if not item.filename.startswith('__MACOSX'):
27+
new_name = convert(item.filename)
28+
print(new_name)
29+
if item.is_dir():
30+
os.makedirs(new_name, exist_ok=True)
31+
else:
32+
with open(new_name, 'wb') as f:
33+
f.write(z.read(item))
34+
35+

decode-encode/macosx-linux/main.py

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
2+
# date: 2019.04.18
3+
# Bartlomiej 'furas' Burek
4+
#
5+
# https://stackoverflow.com/questions/16467479/normalizing-unicode
6+
# https://www.pythonsheets.com/notes/python-unicode.html
7+
#
8+
9+
import os
10+
import zipfile
11+
import unicodedata
12+
from unidecode import unidecode
13+
import ftfy
14+
15+
16+
def test(data):
17+
text, expected = data
18+
19+
text2 = text.encode('cp437').decode('utf-8')
20+
21+
text3 = unidecode(text2)
22+
text4 = unicodedata.normalize('NFC', text2)
23+
24+
text5 = unidecode(text4)
25+
26+
print(' text:', text, '| len:', len(text))
27+
print(' expected:', expected, ' | len:', len(expected))
28+
print(' text == expected:', text == expected)
29+
print('-------------------------------------')
30+
print('text.encode("cp437").decode("utf-8"):', text2, ' | len:', len(text2), '| expected:', text2 == expected)
31+
print(' unicode(text2):', text3, ' | len:', len(text3), '| expected:', text3 == expected)
32+
print('-------------------------------------')
33+
print(' unicodedata.normalize("NFC", text2):', text4, ' | len:', len(text4), '| expected:', text4 == expected)
34+
print(' unicode(text4):', text5, ' | len:', len(text5), '| expected:', text5 == expected)
35+
print('-------------------------------------')
36+
print(' ftfy.fix_text(text):', ftfy.fix_text(text))
37+
print('-------------------------------------')
38+
39+
a1 = 'ą'
40+
41+
a2 = a1.encode('cp437').decode('utf-8')
42+
a4 = unidecode(a2)
43+
a3 = unicodedata.normalize('NFC', a2)
44+
45+
a5 = unidecode(a3)
46+
print(a1, a2, len(a2), a3, len(a3), a4, a5)
47+
48+
49+
examples = [
50+
('a╠¿', 'ą'),
51+
('e╠¿', 'ę'),
52+
('z╠ü', 'ż'),
53+
('┼é', 'ł'),
54+
# 'źle'
55+
]
56+
57+
for data in examples:
58+
test(data)
59+
print('----------------------------------------------------------------')
60+
61+
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
2+
# date: 2019.04.16
3+
# https://stackoverflow.com/questions/55699046/filling-specific-missing-value-in-python?noredirect=1#comment98080392_55699046
4+
5+
job_title = '''ANALYST, BRAND DEVELOPMENT
6+
ANESTHESIOLOGIST
7+
ANESTHESIOLOGIST
8+
BUSINESS INTELLIGENCE ANALYSTS
9+
CIVIL ENGINEER
10+
CIVIL ENGINEER
11+
COMPUTER PROGRAMMER
12+
COMPUTER PROGRAMMER ANALYST
13+
COMPUTER SYSTEM ANALYST
14+
COMPUTER SYSTEM ANALYST
15+
COMPUTER SYSTEMS ANAGLYST
16+
COMPUTER SYSTEMS ANALYST
17+
CONSULTANT
18+
CORPORATE COMMUNICATIONS SPECIALIST
19+
COUNSELOR
20+
DESIGN
21+
ELEMENTARY CO-TEACHER
22+
FASHION MODEL
23+
FIELD ENGINEER
24+
FINANCIAL ANALYST
25+
FINANCIAL SENIOR ANALYST
26+
FINANCIAL SPECIALIST'''.split('\n')
27+
28+
job_title = list(set(job_title))
29+
30+
# --- create random data with some NaN
31+
import random
32+
33+
data = []
34+
35+
for _ in range(1):
36+
for item in job_title:
37+
data.append( (item, None))
38+
39+
for _ in range(2):
40+
for item in job_title:
41+
data.append( (item, random.randint(10000,100000)))
42+
43+
random.shuffle(data)
44+
45+
# --- get mean salary for JOB_TITLE ---
46+
47+
import pandas as pd
48+
49+
df = pd.DataFrame(data, columns=['JOB_TITLE', 'SALARY'])
50+
51+
rows_with_na = df['SALARY'].isna()
52+
53+
print('\n--- before ---\n')
54+
print(df[ rows_with_na ])
55+
56+
print('\n--- mean ---\n')
57+
groups = df.groupby(['JOB_TITLE'])
58+
59+
60+
# it doesn't work as I expected - it doesn't change data in original `df`
61+
# (or i would say I expected this will not work but I still hoped it will work :)
62+
63+
for idx, grp in groups:
64+
mean = grp['SALARY'].mean()
65+
print('mean:', mean, idx)
66+
print(grp['SALARY'].fillna(mean))
67+
print('---')
68+
69+
# this works
70+
#df['SALARY'] = groups.transform(lambda x: x.fillna(x.mean()))
71+
#df['SALARY'] = groups.transform(lambda x: x.fillna(x.mean()))['SALARY']
72+
df['SALARY'] = groups['SALARY'].transform(lambda x: x.fillna(x.mean()))
73+
74+
print('\n--- after ---\n')
75+
print(df[ rows_with_na ])
76+
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
2+
Image:
3+
4+
![#1](images/tkinter-duck-hunt.png?raw=true)
Loading

0 commit comments

Comments
 (0)