reading-notes

Automation

RegEx

Source: Python Regular Expressions Tutorial

Basic matching

  1. import re
  2. re.match(pattern, sequence)
    • pattern is the raw string literal
    • interpreted exactly as is
  3. Returns match object or none
import re

pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
    print("Match!")
else: print("Not a match!")

Wildcards

Special characters that are not matched but provide additional functionality

# '.' matches any character
re.search(r'Co.k.e', 'Cookie').group()
Character(s) What it does
. A period. Matches any single character except the newline character.
^ A caret. Matches a pattern at the start of the string.
\A Uppercase A. Matches only at the start of the string.
$ Dollar sign. Matches the end of the string.
\Z Uppercase Z. Matches only at the end of the string.
[ ] Matches the set of characters you specify within it.
\ ∙ If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. ∙ Else the backslash () is treated like any other character and passed through. ∙ It can be used in front of all the metacharacters to remove their special meaning.
\w Lowercase w. Matches any single letter, digit, or underscore.
\W Uppercase W. Matches any character not part of \w (lowercase w).
\s Lowercase s. Matches a single whitespace character like: space, newline, tab, return.
\S Uppercase S. Matches any character not part of \s (lowercase s).
\d Lowercase d. Matches decimal digit 0-9.
\D Uppercase D. Matches any character that is not a decimal digit.
\t Lowercase t. Matches tab.
\n Lowercase n. Matches newline.
\r Lowercase r. Matches return.
\b Lowercase b. Matches only the beginning or end of the word.
+ Checks if the preceding character appears one or more times.
* Checks if the preceding character appears zero or more times.
? ∙ Checks if the preceding character appears exactly zero or one time. ∙ Specifies a non-greedy version of +, *
{ } Checks for an explicit number of times.
( ) Creates a group when performing matches.
< > Creates a named group when performing matches.

Grouping

statement = 'Please contact us at: support@datacamp.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', statement)
if statement:
  print("Email address:", match.group()) # The whole matched text
  print("Username:", match.group(1)) # The username (group 1)
  print("Host:", match.group(2)) # The host (group 2)

statement = 'Please contact us at: support@datacamp.com'
match = re.search(r'(?P<email>(?P<username>[\w\.-]+)@(?P<host>[\w\.-]+))', statement)
if statement:
  print("Email address:", match.group('email'))
  print("Username:", match.group('username'))
  print("Host:", match.group('host'))

Greedy / Non-Greedy

pattern = "cookie"
sequence = "Cake and cookie"
heading  = r'<h1>TITLE</h1>'

# greedy
re.match(r'<.*>', heading).group()
# '<h1>TITLE</h1>'

# non-greedy
re.match(r'<.*?>', heading).group()
# '<h1>'

Helpful functions

Flags

Additional expression behavior to specify

shutil

Source: shutil

High-leve file operations

Copying files

Finding Files

Archives