You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/io.tex
+141-35
Original file line number
Diff line number
Diff line change
@@ -57,11 +57,13 @@ \section{Properly using bytes(), str() and unicode()}
57
57
58
58
Again, this does nothing in Python 3, which is what we want.
59
59
60
-
But careful, we don't want to turn \textbf{all} \lstinline{str()} to \lstinline{unicode()}!
60
+
One side effect though, is that text strings in Python 3 require between two and four times as much memory to store as int Python 2, plus big text blob take more time to copy. There is no fix for that.
61
+
62
+
In any case, be careful, as we don't want to turn \textbf{all} \lstinline{str()} to \lstinline{unicode()}!
61
63
62
64
We do want to convert all strings meant to be readable by a human: help text, user messages, labels, etc.
63
65
64
-
But we do \textbf{not} want to convert bytes (mistakenly stored in a \lstinline{str()} in Python 2): network packets, pickled objects, dumps from the \lstinline{struct} module, file paths, anything to be written in a file in "b" mode, etc.
66
+
But we do \textbf{not} want to convert data meant to be raw bytes but mistakenly stored in a \lstinline{str()} in Python 2: network packets, pickled objects, dumps from the \lstinline{struct} module, anything to be written in a file in "b" mode, etc.
65
67
66
68
So you need to go through your files, and find those. Then mark them as \lstinline{bytes()}, by adding a \textquote{b} prefix.
67
69
@@ -73,7 +75,25 @@ \section{Properly using bytes(), str() and unicode()}
73
75
74
76
In Python 2, this will keep it as a \lstinline{str()}. In Python 3, it will make it a \lstinline{bytes()}. Again, this is what we want.
75
77
76
-
Be careful though, indexing or iterating through a \lstinline{str()} in Python 2 gives you \lstinline{str()}:
78
+
Sometimes, for some reason, you may need to check if something is not just text, but potential text, so text and bytes. In Python 2, we could do:
79
+
80
+
\begin{py2}
81
+
isinstance(data, basestring)
82
+
\end{py2}
83
+
84
+
While there is tooling to help with that, a possible manual fix that works with Python 2 and 3 is:
85
+
86
+
\begin{py2and3}
87
+
try:
88
+
basestring
89
+
except NameError:
90
+
basestring = str
91
+
isinstance(data, basestring)
92
+
\end{py2and3}
93
+
94
+
Be careful though, as I said that \lstinline{str()} in Python
95
+
96
+
indexing or iterating through a \lstinline{str()} in Python 2 gives you \lstinline{str()}:
77
97
78
98
\begin{py2}
79
99
>>> list(b'qwerty')
@@ -132,13 +152,13 @@ \section{Properly using bytes(), str() and unicode()}
132
152
iterbytes = bytes
133
153
\end{py2and3}
134
154
135
-
Or use tooling to provide a standardized \lstinline{bytes()} for you.
155
+
Or use the tools we'll introduce later to provide a standardized \lstinline{bytes()} for you.
136
156
137
157
\end{warning}
138
158
139
159
\section{Opening files}
140
160
141
-
When you use \lstinline{open()} to read a file, it has two modes: binary mode, and text mode. It's very misleading, in fact, all files are binaries. Some binaries actually contains text, although most file don't. Still, not only \lstinline{open()} maintain this false dichotomy, it actually opens files in text mode by default.
161
+
When you use \lstinline{open()} to read a file, it has two modes: binary mode, and text mode. It's very misleading, in fact, all files are binaries. Some binaries actually contains text, although most files don't. Still, not only \lstinline{open()} maintain this false dichotomy, it actually opens files in text mode by default.
142
162
143
163
If you open a so-called binary file (zip, avi, mp3, odt, doc, etc.), you should use the \textquote(b) flag:
144
164
@@ -147,7 +167,7 @@ \section{Opening files}
147
167
data = f.read()
148
168
\end{py2and3}
149
169
150
-
There is no notion of lines, so you must use \lstinline{.read()}. You will get \lstinline{str()} in Python 2, and \lstinline{bytes()} in Python 3. There is not much to do here, just be careful of what you do with the \lstinline{bytes()} after, since we have in the previous section there are small differences.
170
+
There is no notion of lines in this mode, so you must use \lstinline{.read()}. You will get \lstinline{str()} in Python 2, and \lstinline{bytes()} in Python 3. There is not much to do here, just be careful of what you do with the \lstinline{bytes()} after, since we have in the previous section there are small differences.
151
171
152
172
However, if you open a so-called text file (json, csv, ini, xml, etc.), you should use the \lstinline{encoding} parameter:
153
173
@@ -159,22 +179,21 @@ \section{Opening files}
159
179
160
180
This will give you \lstinline{unicode()} in Python 2 and \lstinline{str()} in Python 3, and let you iterate line by line. This is where the previous chapter about encoding is useful. Follow its advices to choose the proper encoding.
161
181
162
-
Now, Python 2 doesn't have this parameter, but you can use \lstinline{codecs.open}:
182
+
Now, Python 2 doesn't have this parameter, but you can use \lstinline{io.open()}:
163
183
164
184
\begin{py2and3}
165
-
if sys.version_info.major < 3:
166
-
from codecs import open
185
+
from io import open
167
186
\end{py2and3}
168
187
169
-
Please note, however, that it is much, much slower than the original Python 2 \lstinline{open()}.
188
+
This does nothing in Python 3 and provides you with the \lstinline{open()} from Python 3 in Python 2.
170
189
171
190
\section{Fun with file paths}
172
191
173
192
File paths are one of those features that just fork 99\% of the time, until it doesn't. One reason is that Python is a cross-plateform language, but different operating systems may treat paths differently.
174
193
175
194
We usually picture file paths as strings, but infortunatly and as stated by the Python documentation itself: \textquote{some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.}
176
195
177
-
So Python accept both \lstinline{str()} (or \lstinline{unicode()} in Python2) and \lstinline{\bytes()} when you interract with the file system. And it will return you the same type you used as input:
196
+
So, python accept both \lstinline{str()} (or \lstinline{unicode()} in Python2) and \lstinline{\bytes()} when you interract with the file system. And it will return you the same type you used as input:
178
197
179
198
\begin{py3}
180
199
>>> import os
@@ -184,47 +203,134 @@ \section{Fun with file paths}
184
203
<class 'str'>
185
204
\end{py3}
186
205
187
-
And of course most Python 2 programs just use the \lstinline{str()} type to deal with path, using it like a string, while it really behaves like a \lstinline{bytes()} under the hood. Also remember, any \lstinline{str()} in Python 2 is \textbf{in the encoding of the code file}. So people create file names with implicit encoding without knowing it. Luckily most of the time, this is ASCII, as developpers all around the world have been bitten with file names enought to be very careful to choose the most basic ones when they can.
206
+
It's not a problem with Python in itself. It's the reality of computing and all languages have to deal with it in some way. However, we do get an extra issue on our hand while porting code from Python 2 to Python 3, since the new version changes the semantic of text handling. Our problem has a problem now.
207
+
208
+
Because of course most Python 2 programs just use the \lstinline{str()} type to deal with path, using it like a string, while it really behaves like a \lstinline{bytes()} under the hood. Also remember, any \lstinline{str()} in Python 2 is \textbf{in the encoding of the code file}. So people write file names with implicit encoding without knowing it. Luckily most of the time, this is ASCII, as developpers all around the world have been bitten with file names enought to be very careful to choose the most basic ones when they can. Still, this is a source of bugs in the transition.
209
+
210
+
It also means that you can easily create a file in Python 2 you can't open the same way in Python 3.
211
+
212
+
Let's say you have a CP850 encoded Python 2 script, and you do:
213
+
214
+
\begin{py2}
215
+
>>> with open("chevron_Å.txt","w") as f:
216
+
... f.write("Locked")
217
+
\begin{py2}
218
+
219
+
Or even if you have an ASCII encoded script and you do:
220
+
221
+
\begin{py2}
222
+
>>> with open("chevron_\x8f.txt","w") as f:
223
+
... f.write("Locked")
224
+
\begin{py2}
225
+
226
+
Or an user entered than. Or a database fed you that. Or it's stored in some config file somewhere.
227
+
228
+
Trying to open it with Python 3, the stdlib will read \lstinline{sys.getfilesystemencoding()}, which will be something else than CP850 (probably UTF8), and use it to encode the filename then pass it to the OS:
229
+
230
+
\begin{py2}
231
+
>>> with open("chevron_Å.txt") as f:
232
+
... f.write("Locked")
233
+
Traceback (most recent call last):
234
+
File "<stdin>", line 1, in <module>
235
+
FileNotFoundError: [Errno 2] No such file or directory: 'chevron_Å.txt'
236
+
\end{py2}
188
237
189
238
What does this mean for you ?
190
239
191
-
First, look at all the hard coded file path in your code. Decide if they should be text or arbitrary bytes and mark them accordingly. Hint: unless you are doing something very specific, and on Unix, it should be text. Check that the return value is of the type you expect (\lstinline{str()} or \lstinline{bytes()}, depending of what you passed), and that the rest of the code using this value is made to handle this type.
240
+
You need to assess your situation. Most of the time, there is not much to do: your application probably needs only basic path support. The 99\%that works will do just fine. Use only \lstinline{str()} on Python 3 and \lstinline{unicode()} on Python 2 for the path you hardcode. For your config file, choose an encoding (prefer UTF8), and decode, and use the text result. It's like with other texts.
192
241
193
-
Then, got through all code using the \lstinline{os}, \lstinline{shutil} and \lstinline{glob} modules
242
+
One of the rare scenario that requires work would be if you were in te case were you have harcoded non ASCII file paths in a code file with an encoding that is different that the file system encoding. E.G: you have \lstinline{os.listdir("./Téléchargements")} (which is \textquote{Download} in Frenc) hardcoded in a \textquote{latin-1} Python file on an Ubuntu server using a UTF8 for its file system. If you change your path from bytes to text, Python will use \lstinline{sys.getfilesystemencoding()} to encode it and it will fail:
194
243
195
-
File path coming from somewhere else (database, config files, etc).
244
+
\begin{py2}
245
+
>>> os.listdir('./Téléchargements/')
246
+
Traceback (most recent call last):
247
+
File "<stdin>", line 1, in <module>
248
+
OSError: [Errno 2] No such file or directory: './Téléchargements/'
249
+
\end{py2}
196
250
197
-
Surogate escape
251
+
That's an edge case, it's inlikely it will happen for you. But just in case it does, know that you shall encode it manually to the legacy encoding once, check if it exists, and rename it using \lstinline{sys.getfilesystemencoding()}:
198
252
199
-
\section{Formatting}
253
+
\begin{py2}
254
+
path = './Téléchargements/'
255
+
legacy_encoded_path = path.encode('latin-1))
256
+
if os.path.isdir(legacy_encoded_path):
257
+
os.path.rename(legacy_encoded_path, path)
258
+
os.listdir(path)
259
+
\end{py2}
200
260
261
+
Another possible, albeit very specific, source of problem, is if you are scanning a collection of files or getting you paths from an external source (DB, socket) you don't have control over, but still need to do some path manipulation with it. Indeed, you may - although rarely - encounter paths that are badly encoded, or with no encoding metadata to decode them, and provided to you as bytes. In that case you need to use \textquote{surrogateescape}: it's a special non destructive encoding that decodes anything to utf8, and can encode back to the original one. It won't look pretty, but you'll keep the data intact, and it will result in a \lstinline{str()}:
So when you get an input with file paths that are bytes, decode them using \lstinline{.decode('utf8', errors='surrogateescape')}. Make all operations you need on the path. Then when you want to pass it to a file-related function, pass it through \lstinline{.encode('utf8', errors='surrogateescape')} to get the original bytes back.
%All backslashes in raw string literals are interpreted literally. This means that '\U' and '\u' escapes in raw strings are not treated specially. For example, r'\u20ac' is a string of 6 characters in Python 3.0, whereas in 2.6, ur'\u20ac' was the single “euro” character. (Of course, this change only affects raw string literals; the euro character is '\u20ac' in Python 3.0.)
313
+
At last, if you share paths with another system, encode it with UTF8 with surrogateescape to get the original bytes and share them 'as-is' with them, since you don't know what strategy they are going to use to deal with weird file names.
222
314
223
-
% has been removed and reinstroduced
315
+
Again, I'd like to insist those are not common use cases. Python softwares that manage a lot of files they have no control over (E.G: Dropbox syncing your files or Beet loading up your music library) have to deal with this, but you may very well not.
224
316
225
-
pathlib
317
+
I also know it is tempting to use the excellent \lstinline{pathlib} at this point, especially since there is a backport on pypi. But in my opinion, it would add complexity to the migration. Better keep it for the new python-3-only projects.
318
+
319
+
\section{Formatting}
320
+
321
+
In Python 2, you could call \lstinline{.format()} on both \lstinline{str()} and \lstinline{unicode()}. This ability has been removed in Python 3: only text can be formatted this way, or using the newest f-strings.
322
+
323
+
For a while, Python 3 also removed the possibility to format using \lstinline{\%}, leaving low-level devs to deal manually with byte formatting for sockets or images. After a push back from the community, it has been added back to Python 3.5.
324
+
325
+
If your bytes are text, you should decode anyway, so you'll be able to format all you want. If your bytes need to be manipulated as-is, then either you must target Python 3.5, or change all your byte formatting code for something you do manually. Given the work the later option represent, I would advise to just target 3.5 if you have a lot of bytes to format.
326
+
327
+
\section{Wait, there is I/O}
328
+
329
+
330
+
331
+
332
+
%All backslashes in raw string literals are interpreted literally. This means that '\U' and '\u' escapes in raw strings are not treated specially. For example, r'\u20ac' is a string of 6 characters in Python 3.0, whereas in 2.6, ur'\u20ac' was the single “euro” character. (Of course, this change only affects raw string literals; the euro character is '\u20ac' in Python 3.0.)
226
333
227
-
%# coding: utf8
228
334
229
335
\section{file()}
230
336
@@ -234,9 +340,9 @@ \section{file()}
234
340
if isinstance(someobj, IOBase):
235
341
\end{py2}
236
342
237
-
\section{from buffer() tp memoryview()}
343
+
\section{from buffer() to memoryview()}
238
344
239
-
Those two functions respectively create objects of the same name - a \lstinline{buffer} and a \lstinline{memoryview}. Both are a way to get a subset of something without copying it:
345
+
Those two functions (in fact, they are more like classes) respectively create objects of the same name - a \lstinline{buffer} and a \lstinline{memoryview}. Both are a way to get a subset of something without copying it:
It works on \lstinline{bytes()}, \lstinline{bytearray}, \lstinline{array.array}...Everything that implements the so-called \textquote{buffer protocol}. This is a very nice optimization that can save quite a lot of memory/CPU if you manipulate huge chunks of bytes and pass around subset of them, e.g: to files or sockets.
377
+
It works on \lstinline{bytes()}, \lstinline{bytearray}, \lstinline{array.array}...Everything that implements an \gls{API} called the\textquote{buffer protocol}. This is a very nice optimization that can save quite a lot of memory/CPU if you manipulate huge chunks of bytes and pass around subset of them, e.g: to files or sockets.
272
378
273
-
With the rise of performant c libs wrapped in Python (numpy, GUI toolkits, database drivers, etc.), the need for more information about the underlying data than what \lstinline{buffer()} was offering became important. \lstinline{memoryview} provides an answer to that, being able to return the shape, dimension or type or the object behind it.
379
+
With the rise of performant C libs wrapped in Python (numpy, GUI toolkits, database drivers, etc.), the need for more information about the underlying data than what \lstinline{buffer()} was offering became important. \lstinline{memoryview} provides an answer to that, being able to return the shape, dimension or type of the object behind it.
274
380
275
-
\lstinline{buffer()} is not more in Python 3, just replace it with \lstinline{memoryview()}. The later exists in Python 2.7, plus it does the same thing, just better. The only difficulty will be that \lstinline{buffer()} accepts \lstinline{unicode()} objects but \lstinline{memoryview()} only accept bytes, and so you'll need to encode them.
381
+
\lstinline{buffer()} is no more in Python 3, just replace it with \lstinline{memoryview()}. The later exists in Python 2.7, plus it does the same thing, just better. The only difficulty will be that \lstinline{buffer()} accepts \lstinline{unicode()} objects but \lstinline{memoryview()} only accept bytes, and so you'll need to encode them.
Copy file name to clipboardExpand all lines: glossaries.tex
+7
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,13 @@
5
5
% makeglossaries main
6
6
% pdflatex main
7
7
8
+
\newglossaryentry{API}
9
+
{
10
+
name=API,
11
+
description={An API, for Application Programming Interface, is the part of a program accessible from outside of said program that can be used to interreact with it programmatically. It is often simply the sum of interfaces of the classes, methods, functions and data structures that you can import and use, but it can also be a communication protocol, like for a Web API. API is a broad term that can be used to talk about many things. E.G: a single function (its API would be the signature, return values and possible exceptions), a class (its API would be a set containing its methods API, its attributes and its parents), a collection of all those for an entire library, or even the JSON format and URL to use for a Web API. We call \textquote{public API} an API that is officially supported and documented with some stability policy, and \textquote{private API} the mechanism that are for internal use only and may change from one version to the next. }
0 commit comments