Skip to content

Commit 5300631

Browse files
committed
Section on file path done
1 parent 9f19ac7 commit 5300631

File tree

4 files changed

+167
-35
lines changed

4 files changed

+167
-35
lines changed

chapters/io.tex

+141-35
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,13 @@ \section{Properly using bytes(), str() and unicode()}
5757

5858
Again, this does nothing in Python 3, which is what we want.
5959

60-
But careful, we don't want to turn \textbf{all} \lstinline{str()} to \lstinline{unicode()}!
60+
One side effect though, is that text strings in Python 3 require between two and four times as much memory to store as int Python 2, plus big text blob take more time to copy. There is no fix for that.
61+
62+
In any case, be careful, as we don't want to turn \textbf{all} \lstinline{str()} to \lstinline{unicode()}!
6163

6264
We do want to convert all strings meant to be readable by a human: help text, user messages, labels, etc.
6365

64-
But we do \textbf{not} want to convert bytes (mistakenly stored in a \lstinline{str()} in Python 2): network packets, pickled objects, dumps from the \lstinline{struct} module, file paths, anything to be written in a file in "b" mode, etc.
66+
But we do \textbf{not} want to convert data meant to be raw bytes but mistakenly stored in a \lstinline{str()} in Python 2: network packets, pickled objects, dumps from the \lstinline{struct} module, anything to be written in a file in "b" mode, etc.
6567

6668
So you need to go through your files, and find those. Then mark them as \lstinline{bytes()}, by adding a \textquote{b} prefix.
6769

@@ -73,7 +75,25 @@ \section{Properly using bytes(), str() and unicode()}
7375

7476
In Python 2, this will keep it as a \lstinline{str()}. In Python 3, it will make it a \lstinline{bytes()}. Again, this is what we want.
7577

76-
Be careful though, indexing or iterating through a \lstinline{str()} in Python 2 gives you \lstinline{str()}:
78+
Sometimes, for some reason, you may need to check if something is not just text, but potential text, so text and bytes. In Python 2, we could do:
79+
80+
\begin{py2}
81+
isinstance(data, basestring)
82+
\end{py2}
83+
84+
While there is tooling to help with that, a possible manual fix that works with Python 2 and 3 is:
85+
86+
\begin{py2and3}
87+
try:
88+
basestring
89+
except NameError:
90+
basestring = str
91+
isinstance(data, basestring)
92+
\end{py2and3}
93+
94+
Be careful though, as I said that \lstinline{str()} in Python
95+
96+
indexing or iterating through a \lstinline{str()} in Python 2 gives you \lstinline{str()}:
7797

7898
\begin{py2}
7999
>>> list(b'qwerty')
@@ -132,13 +152,13 @@ \section{Properly using bytes(), str() and unicode()}
132152
iterbytes = bytes
133153
\end{py2and3}
134154

135-
Or use tooling to provide a standardized \lstinline{bytes()} for you.
155+
Or use the tools we'll introduce later to provide a standardized \lstinline{bytes()} for you.
136156

137157
\end{warning}
138158

139159
\section{Opening files}
140160

141-
When you use \lstinline{open()} to read a file, it has two modes: binary mode, and text mode. It's very misleading, in fact, all files are binaries. Some binaries actually contains text, although most file don't. Still, not only \lstinline{open()} maintain this false dichotomy, it actually opens files in text mode by default.
161+
When you use \lstinline{open()} to read a file, it has two modes: binary mode, and text mode. It's very misleading, in fact, all files are binaries. Some binaries actually contains text, although most files don't. Still, not only \lstinline{open()} maintain this false dichotomy, it actually opens files in text mode by default.
142162

143163
If you open a so-called binary file (zip, avi, mp3, odt, doc, etc.), you should use the \textquote(b) flag:
144164

@@ -147,7 +167,7 @@ \section{Opening files}
147167
data = f.read()
148168
\end{py2and3}
149169

150-
There is no notion of lines, so you must use \lstinline{.read()}. You will get \lstinline{str()} in Python 2, and \lstinline{bytes()} in Python 3. There is not much to do here, just be careful of what you do with the \lstinline{bytes()} after, since we have in the previous section there are small differences.
170+
There is no notion of lines in this mode, so you must use \lstinline{.read()}. You will get \lstinline{str()} in Python 2, and \lstinline{bytes()} in Python 3. There is not much to do here, just be careful of what you do with the \lstinline{bytes()} after, since we have in the previous section there are small differences.
151171

152172
However, if you open a so-called text file (json, csv, ini, xml, etc.), you should use the \lstinline{encoding} parameter:
153173

@@ -159,22 +179,21 @@ \section{Opening files}
159179

160180
This will give you \lstinline{unicode()} in Python 2 and \lstinline{str()} in Python 3, and let you iterate line by line. This is where the previous chapter about encoding is useful. Follow its advices to choose the proper encoding.
161181

162-
Now, Python 2 doesn't have this parameter, but you can use \lstinline{codecs.open}:
182+
Now, Python 2 doesn't have this parameter, but you can use \lstinline{io.open()}:
163183

164184
\begin{py2and3}
165-
if sys.version_info.major < 3:
166-
from codecs import open
185+
from io import open
167186
\end{py2and3}
168187

169-
Please note, however, that it is much, much slower than the original Python 2 \lstinline{open()}.
188+
This does nothing in Python 3 and provides you with the \lstinline{open()} from Python 3 in Python 2.
170189

171190
\section{Fun with file paths}
172191

173192
File paths are one of those features that just fork 99\% of the time, until it doesn't. One reason is that Python is a cross-plateform language, but different operating systems may treat paths differently.
174193

175194
We usually picture file paths as strings, but infortunatly and as stated by the Python documentation itself: \textquote{some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.}
176195

177-
So Python accept both \lstinline{str()} (or \lstinline{unicode()} in Python2) and \lstinline{\bytes()} when you interract with the file system. And it will return you the same type you used as input:
196+
So, python accept both \lstinline{str()} (or \lstinline{unicode()} in Python2) and \lstinline{\bytes()} when you interract with the file system. And it will return you the same type you used as input:
178197

179198
\begin{py3}
180199
>>> import os
@@ -184,47 +203,134 @@ \section{Fun with file paths}
184203
<class 'str'>
185204
\end{py3}
186205

187-
And of course most Python 2 programs just use the \lstinline{str()} type to deal with path, using it like a string, while it really behaves like a \lstinline{bytes()} under the hood. Also remember, any \lstinline{str()} in Python 2 is \textbf{in the encoding of the code file}. So people create file names with implicit encoding without knowing it. Luckily most of the time, this is ASCII, as developpers all around the world have been bitten with file names enought to be very careful to choose the most basic ones when they can.
206+
It's not a problem with Python in itself. It's the reality of computing and all languages have to deal with it in some way. However, we do get an extra issue on our hand while porting code from Python 2 to Python 3, since the new version changes the semantic of text handling. Our problem has a problem now.
207+
208+
Because of course most Python 2 programs just use the \lstinline{str()} type to deal with path, using it like a string, while it really behaves like a \lstinline{bytes()} under the hood. Also remember, any \lstinline{str()} in Python 2 is \textbf{in the encoding of the code file}. So people write file names with implicit encoding without knowing it. Luckily most of the time, this is ASCII, as developpers all around the world have been bitten with file names enought to be very careful to choose the most basic ones when they can. Still, this is a source of bugs in the transition.
209+
210+
It also means that you can easily create a file in Python 2 you can't open the same way in Python 3.
211+
212+
Let's say you have a CP850 encoded Python 2 script, and you do:
213+
214+
\begin{py2}
215+
>>> with open("chevron_Å.txt","w") as f:
216+
... f.write("Locked")
217+
\begin{py2}
218+
219+
Or even if you have an ASCII encoded script and you do:
220+
221+
\begin{py2}
222+
>>> with open("chevron_\x8f.txt","w") as f:
223+
... f.write("Locked")
224+
\begin{py2}
225+
226+
Or an user entered than. Or a database fed you that. Or it's stored in some config file somewhere.
227+
228+
Trying to open it with Python 3, the stdlib will read \lstinline{sys.getfilesystemencoding()}, which will be something else than CP850 (probably UTF8), and use it to encode the filename then pass it to the OS:
229+
230+
\begin{py2}
231+
>>> with open("chevron_Å.txt") as f:
232+
... f.write("Locked")
233+
Traceback (most recent call last):
234+
File "<stdin>", line 1, in <module>
235+
FileNotFoundError: [Errno 2] No such file or directory: 'chevron_Å.txt'
236+
\end{py2}
188237
189238
What does this mean for you ?
190239
191-
First, look at all the hard coded file path in your code. Decide if they should be text or arbitrary bytes and mark them accordingly. Hint: unless you are doing something very specific, and on Unix, it should be text. Check that the return value is of the type you expect (\lstinline{str()} or \lstinline{bytes()}, depending of what you passed), and that the rest of the code using this value is made to handle this type.
240+
You need to assess your situation. Most of the time, there is not much to do: your application probably needs only basic path support. The 99\% that works will do just fine. Use only \lstinline{str()} on Python 3 and \lstinline{unicode()} on Python 2 for the path you hardcode. For your config file, choose an encoding (prefer UTF8), and decode, and use the text result. It's like with other texts.
192241
193-
Then, got through all code using the \lstinline{os}, \lstinline{shutil} and \lstinline{glob} modules
242+
One of the rare scenario that requires work would be if you were in te case were you have harcoded non ASCII file paths in a code file with an encoding that is different that the file system encoding. E.G: you have \lstinline{os.listdir("./Téléchargements")} (which is \textquote{Download} in Frenc) hardcoded in a \textquote{latin-1} Python file on an Ubuntu server using a UTF8 for its file system. If you change your path from bytes to text, Python will use \lstinline{sys.getfilesystemencoding()} to encode it and it will fail:
194243
195-
File path coming from somewhere else (database, config files, etc).
244+
\begin{py2}
245+
>>> os.listdir('./Téléchargements/')
246+
Traceback (most recent call last):
247+
File "<stdin>", line 1, in <module>
248+
OSError: [Errno 2] No such file or directory: './Téléchargements/'
249+
\end{py2}
196250
197-
Surogate escape
251+
That's an edge case, it's inlikely it will happen for you. But just in case it does, know that you shall encode it manually to the legacy encoding once, check if it exists, and rename it using \lstinline{sys.getfilesystemencoding()}:
198252
199-
\section{Formatting}
253+
\begin{py2}
254+
path = './Téléchargements/'
255+
legacy_encoded_path = path.encode('latin-1))
256+
if os.path.isdir(legacy_encoded_path):
257+
os.path.rename(legacy_encoded_path, path)
258+
os.listdir(path)
259+
\end{py2}
200260
261+
Another possible, albeit very specific, source of problem, is if you are scanning a collection of files or getting you paths from an external source (DB, socket) you don't have control over, but still need to do some path manipulation with it. Indeed, you may - although rarely - encounter paths that are badly encoded, or with no encoding metadata to decode them, and provided to you as bytes. In that case you need to use \textquote{surrogateescape}: it's a special non destructive encoding that decodes anything to utf8, and can encode back to the original one. It won't look pretty, but you'll keep the data intact, and it will result in a \lstinline{str()}:
201262
202-
formatting bytes
263+
\begin{py3}
264+
>>> # getting some cp850 encoded text as bytes
265+
>>> data = "chevron_Å.txt".encode('cp850')
266+
>>> # decode it as utf8.
267+
>>> decoded = data.decode('utf8', errors='surrogateescape')
268+
>>> decoded
269+
'chevron_\udc8f.txt'
270+
>>> type(decoded)
271+
<class 'str'>
272+
>>> # encode, STILL USING UTF8, to get back the original bytes
273+
>>> encoded = decoded.encode('utf8', errors='surrogateescape')
274+
>>> type(encoded)
275+
<class 'bytes'>
276+
>>> encoded
277+
b'chevron_\x8f.txt'
278+
>>> # data is intact
279+
>>> encoded.decode('cp850')
280+
'chevron_Å.txt'
281+
\end{py3}
203282
204-
\section{Wait, there is I/O}
283+
So when you get an input with file paths that are bytes, decode them using \lstinline{.decode('utf8', errors='surrogateescape')}. Make all operations you need on the path. Then when you want to pass it to a file-related function, pass it through \lstinline{.encode('utf8', errors='surrogateescape')} to get the original bytes back.
205284
206-
% http://www.dabeaz.com/python3io_2010/MasteringIO.pdf
285+
Remember two things:
207286
208-
Text strings in Python 3 require either 2x as much memory to store as Python 2
287+
\begin{itemize}
288+
\item File-related functions return text if you pass it text, otherwise it returns bytes.
289+
\item File-related functions may return some path as bytes in the middle of all the text results. E.G: if it can't decode a path.
290+
\end{itemize}
209291
292+
If you want a truely robust program, you should check for this.
210293
211-
bad filenames were easy to create in python 2
294+
When you want a bullet proof way of writting messed up paths to the terminal, it can get a bit hairy:
212295
213-
If you ever see a \lstinline{\udcxx} character, it means that a non-decodable byte was passed in from a system interface
296+
\begin{py2and3}
297+
import sys
214298
215-
s.decode('utf-8','surrogateescape')
299+
# print will not use surrogateescape so it may fail, so we need to do it
300+
# manually and write the result to sys.stdout.
301+
# Except sys.sdtout doesn't support direct bytes writting in python 3
302+
# so we write to stdout directly in Python 2, or to its buffer in Python 3
303+
stdout_fd = getattr(sys.stdout, 'buffer', sys.stdout)
216304
217-
TextIOWrapper 10 times faster than codecs.open
305+
# We encode to bytes using surrogateescape because writing
306+
# This assume 'text' has been created using surrogate escape
307+
def print_dirty_bytes(text, end=b'\n'):
308+
stdout_fd.write(text.encode("utf8", error="surrogateescape") + end)
218309
219-
basestring
310+
print_dirty_bytes(the_path_less_travelled)
311+
\end{py2and3}
220312
221-
%All backslashes in raw string literals are interpreted literally. This means that '\U' and '\u' escapes in raw strings are not treated specially. For example, r'\u20ac' is a string of 6 characters in Python 3.0, whereas in 2.6, ur'\u20ac' was the single “euro” character. (Of course, this change only affects raw string literals; the euro character is '\u20ac' in Python 3.0.)
313+
At last, if you share paths with another system, encode it with UTF8 with surrogateescape to get the original bytes and share them 'as-is' with them, since you don't know what strategy they are going to use to deal with weird file names.
222314
223-
% has been removed and reinstroduced
315+
Again, I'd like to insist those are not common use cases. Python softwares that manage a lot of files they have no control over (E.G: Dropbox syncing your files or Beet loading up your music library) have to deal with this, but you may very well not.
224316
225-
pathlib
317+
I also know it is tempting to use the excellent \lstinline{pathlib} at this point, especially since there is a backport on pypi. But in my opinion, it would add complexity to the migration. Better keep it for the new python-3-only projects.
318+
319+
\section{Formatting}
320+
321+
In Python 2, you could call \lstinline{.format()} on both \lstinline{str()} and \lstinline{unicode()}. This ability has been removed in Python 3: only text can be formatted this way, or using the newest f-strings.
322+
323+
For a while, Python 3 also removed the possibility to format using \lstinline{\%}, leaving low-level devs to deal manually with byte formatting for sockets or images. After a push back from the community, it has been added back to Python 3.5.
324+
325+
If your bytes are text, you should decode anyway, so you'll be able to format all you want. If your bytes need to be manipulated as-is, then either you must target Python 3.5, or change all your byte formatting code for something you do manually. Given the work the later option represent, I would advise to just target 3.5 if you have a lot of bytes to format.
326+
327+
\section{Wait, there is I/O}
328+
329+
330+
331+
332+
%All backslashes in raw string literals are interpreted literally. This means that '\U' and '\u' escapes in raw strings are not treated specially. For example, r'\u20ac' is a string of 6 characters in Python 3.0, whereas in 2.6, ur'\u20ac' was the single “euro” character. (Of course, this change only affects raw string literals; the euro character is '\u20ac' in Python 3.0.)
226333
227-
%# coding: utf8
228334
229335
\section{file()}
230336
@@ -234,9 +340,9 @@ \section{file()}
234340
if isinstance(someobj, IOBase):
235341
\end{py2}
236342
237-
\section{from buffer() tp memoryview()}
343+
\section{from buffer() to memoryview()}
238344
239-
Those two functions respectively create objects of the same name - a \lstinline{buffer} and a \lstinline{memoryview}. Both are a way to get a subset of something without copying it:
345+
Those two functions (in fact, they are more like classes) respectively create objects of the same name - a \lstinline{buffer} and a \lstinline{memoryview}. Both are a way to get a subset of something without copying it:
240346
241347
\begin{py2}
242348
>>> donkey_lines = "Are we there yet ?\n" * 10000
@@ -268,11 +374,11 @@ \section{from buffer() tp memoryview()}
268374
e
269375
\end{py2}
270376
271-
It works on \lstinline{bytes()}, \lstinline{bytearray}, \lstinline{array.array}...Everything that implements the so-called \textquote{buffer protocol}. This is a very nice optimization that can save quite a lot of memory/CPU if you manipulate huge chunks of bytes and pass around subset of them, e.g: to files or sockets.
377+
It works on \lstinline{bytes()}, \lstinline{bytearray}, \lstinline{array.array}...Everything that implements an \gls{API} called the \textquote{buffer protocol}. This is a very nice optimization that can save quite a lot of memory/CPU if you manipulate huge chunks of bytes and pass around subset of them, e.g: to files or sockets.
272378
273-
With the rise of performant c libs wrapped in Python (numpy, GUI toolkits, database drivers, etc.), the need for more information about the underlying data than what \lstinline{buffer()} was offering became important. \lstinline{memoryview} provides an answer to that, being able to return the shape, dimension or type or the object behind it.
379+
With the rise of performant C libs wrapped in Python (numpy, GUI toolkits, database drivers, etc.), the need for more information about the underlying data than what \lstinline{buffer()} was offering became important. \lstinline{memoryview} provides an answer to that, being able to return the shape, dimension or type of the object behind it.
274380
275-
\lstinline{buffer()} is not more in Python 3, just replace it with \lstinline{memoryview()}. The later exists in Python 2.7, plus it does the same thing, just better. The only difficulty will be that \lstinline{buffer()} accepts \lstinline{unicode()} objects but \lstinline{memoryview()} only accept bytes, and so you'll need to encode them.
381+
\lstinline{buffer()} is no more in Python 3, just replace it with \lstinline{memoryview()}. The later exists in Python 2.7, plus it does the same thing, just better. The only difficulty will be that \lstinline{buffer()} accepts \lstinline{unicode()} objects but \lstinline{memoryview()} only accept bytes, and so you'll need to encode them.
276382
277383
So:
278384

chapters/references.tex

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
2+
3+
% http://www.dabeaz.com/python3io_2010/MasteringIO.pdf

glossaries.tex

+7
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55
% makeglossaries main
66
% pdflatex main
77

8+
\newglossaryentry{API}
9+
{
10+
name=API,
11+
description={An API, for Application Programming Interface, is the part of a program accessible from outside of said program that can be used to interreact with it programmatically. It is often simply the sum of interfaces of the classes, methods, functions and data structures that you can import and use, but it can also be a communication protocol, like for a Web API. API is a broad term that can be used to talk about many things. E.G: a single function (its API would be the signature, return values and possible exceptions), a class (its API would be a set containing its methods API, its attributes and its parents), a collection of all those for an entire library, or even the JSON format and URL to use for a Web API. We call \textquote{public API} an API that is officially supported and documented with some stability policy, and \textquote{private API} the mechanism that are for internal use only and may change from one version to the next. }
12+
}
13+
14+
815
\newglossaryentry{builtin}
916
{
1017
name=builtin,

0 commit comments

Comments
 (0)