Strings and Regexes
Strings are sequences of unicode, and are immutable. You can index into them and get slices of them.
len('hi') # => 2
'hello'[1] # => 'e'
"""This is
a multi-line
string."""
"""Ignore the newline\
at the end of that line."""
# ^^ No space between "newline" and "at".
Embedding unicode:
"this → that"
# or
"this \u2192 that"
# or
"this \N{RIGHTWARDS ARROW} that"
Some methods:
capitalize
lower
upper
title
replace
splitlines
strip, lstrip, rstrip Can take arg specifying what to strip!
count how many non-overlapping times a given substring is present
find returns -1 on failure
index raises exception on failure
format
format_map
startswith can take a tuple
endswith same as above
Easy way to get a list of word:
'foo bar baz moo'.split()
Converting between hex strings and ints:
hex(255) # => '0xff'
import sys
hex(sys.maxunicode) # => 0x10ffff
int('ff', 16) # => 255
# Base 0 means to look at the 2-char prefix to determine the base.
int('0xff', 0) # => 255
# And, of course:
str(0xff) # => '255'
By default, str.replace
does global search/replace, but you can pass a num-times arg to limit how many it performs.
Use repr
to get a string representation of a Python object. That is, how the object would be written in Python code (to be eval’d).
format
The str.format
method. There is also a global format
function, which formats a single value (it just calls its first arg’s format
method).
'a {} b'.format('XX') # => 'a XX b'
'a {foo} b {bar} c'.format(foo='XX', bar='YY') # => 'a XX b YY c'
= {'a': 1, 'b': 2}
d '{a} and {b}'.format(**d) # => '1 and 2'
= 12.348
x format(x, '0.2f') # => '12.35'
Previously had used '...' % ...
instead of the the format
function.
Both format
and str.format
take the sprintf formatting codes.
For more, see the docs at library/string.html. See also https://mkaz.github.io/2012/10/10/python-string-format/.
Regular Expressions
import re
= 'foo123bar456baz'
s r'\d+', s, flags=re.M|re.S)
re.split(#=> ['foo', 'bar', 'baz']
r'\d+', s)
re.findall(#=> ['123', '456']
r'xxx', s)
re.findall(#=> []
\A
is beginning of string\Z
is end of stringUse (?:...)
for a non-capturing group.
re.match and re.search return None if no match.
= re.search(r'some-regex', line)
some_group 0) # Gets you the match object.
some_group.group(
# This is the one you usually want.
r'...', s) # Global search. Returns a possibly-empty list.
re.findall(
r'\{\{(.+?)}}', 'foo 12 {{bar}} 123{{baz}}45moo {{oof}}')
re.findall(#=> ['bar', 'baz', 'oof']
To search/replace: re.sub
. Does a global search/replace. Use \1
, \2
, etc. to use groups in the replace-text.
# re.sub(regex, replacements, text)
r'...(\d)-(\d)...', r'...\2-\1...', some_text) re.sub(
TODO:
- replace unbreakable whitespace with space character.
Unicode
Read http://nedbatchelder.com/text/unipain.html.
Unicode code points are written as 4, 5, or 6 hex digits prefixed with “U+”. Every character has an unambiguous full name in uppercase ASCII (for example, “CHECK MARK”).
Code points map to bytes via an encoding. Use UTF-8.
Legacy: Back in Python 2,
"this"
gave you anstr
— a sequence of bytes.u"this"
gave you aunicode
— a sequence of code points. You could then dounicode_s.encode('utf-8')
to get astr
(bytes), ands.decode('utf-8')
to get aunicode
(u“one of these”). Concatenating u“this” + “that” gets you u“thisthat” (a unicode). Python 2 tries to be helpful by doing implicit conversions, but this can result in pain.
In Python 3: "this"
is a str
, which is a sequence of code points. b"this"
is a bytes
, a sequence of bytes.
Python 3 does not try to implicitly convert for you; 'this' + b'that'
fails. b’this’ != ‘this’.
open('foo.txt', 'r').read()
gets you unicode/str (using the default encoding on this machine as reported by locale.getpreferredencoding()
). open('foo.txt', 'rb').read()
gets you bytes.
import locale
# => 'UTF-8' locale.getpreferredencoding()
Careful: on Windows, the default encoding may be CP-1252 (“Windows-1252”?).
Data coming into or going out of your program is all bytes. Decode incoming bytes into unicode:
'hi there'.encode() # => b'hi there'
'hey'.decode() # => 'hey' b