Jovian
⭐️
Sign In

Regular Expressions

Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose you have a dataset of customer reviews about your restaurant. Say, you want to extract the emojis from the reviews because they are a good predictor of the sentiment of the review.

Take another example, the artificial assistants such as Siri, Google Now use information retrieval to give you better results. When you ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask you to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.

Online tools where you can test your Regular Expressions

https://regex101.com/

Learning Resource

https://regexone.com/

https://pycon2016.regex.training/cheat-sheet - Cheat Sheet

Let's import the regular expression library in python.

In [1]:
import re

Let's do a quick search using a pattern.

In [5]:
re.search('Ravi', 'Ravi is an exceptional student! Ravi is Brilliant')
Out[5]:
<_sre.SRE_Match object; span=(0, 4), match='Ravi'>
In [13]:
# print output of re.search()
match = re.search('Ravi', 'Ravi is an exceptional student! Raviis brilliant')
print(match.group())# Finds only the first instance of a string
Ravi
In [15]:
### Get starting position of the word Ravi
print("Starting position of the word Ravi",match.start())
print("Ending position of the word Ravi",match.end())
Starting position of the word Ravi 0 Ending position of the word Ravi 4

Let's define a function to match regular expression patterns

In [72]:
def find_pattern(text, patterns,flags=None):
    if flags=='I':
        if re.search(patterns, text,flags=re.I):
            return 'Found a match!'
        else:
            return 'Not Found!'
    if re.search(patterns, text):
        return 'Found a match!'
    else:
        return 'Not Found!'

Quantifiers

In [73]:
# '*': Zero or more 
print(find_pattern("ac", "ab*"))
print(find_pattern("abc", "ab*"))
print(find_pattern("abbc", "ab*"))
print(find_pattern("home--brew","home-*brew")) 
print(find_pattern("home brew","home-*brew"))
print(find_pattern("abedbc", "ab*bc"))#abc, abbbbbc are all valid strings here
Found a match! Found a match! Found a match! Found a match! Not Found! Not Found!
Match a binary number that starts with 101 and ends with zero or more number of zeroes.
In [38]:
pattern='1010*'
print(find_pattern("10", pattern))
print(find_pattern("10100", pattern))
print(find_pattern("101000", pattern))
print(find_pattern("101", pattern))
print(find_pattern("100", pattern))
print(find_pattern("1", pattern))

Not Found! Found a match! Found a match! Found a match! Not Found! Not Found!
Write a pattern that starts with 1 and ends with zero but has arbitrary number of 1s (zero or more) in between
In [42]:
pattern = '11*0' #can also be written as 1+0
print(find_pattern("11111110",pattern))
print(find_pattern("11",pattern))
Found a match! Not Found!
In [26]:
# '?': Zero or one (tells whether a pattern is absent or present)
print(find_pattern("ac", "ab?"))
print(find_pattern("abc", "ab?"))
print(find_pattern("abbc", "ab?"))
print(find_pattern("home--brew","home-?brew")) # This matches either home-brew or homebrew
print(find_pattern("home brew","home-?brew"))
print(find_pattern("abedbc", "ab?bc"))

Found a match! Found a match! Found a match! Not Found! Not Found! Not Found!
In [31]:
## Check if word car or cars is present in a string - cars?, says S can either be present or absent
print(find_pattern("I love my car","cars?"))
print(find_pattern("I love  cars","cars?"))
print(find_pattern("I love  cabs","cars?"))

Found a match! Found a match! Not Found!

Write a regular expression that matches the following words:

xyz xy xz x

Make sure that the regular expression doesn’t match the following words: Xyyz Xyzz Xyy Xzz Yz

In [34]:
pattern='xy?z?'
print(find_pattern("Xyyz",pattern))
Not Found!
In [47]:
# '+': One or more
print(find_pattern("ac", "ab+"))
print(find_pattern("abc", "ab+"))
print(find_pattern("abbc", "ab+"))
Not Found! Found a match! Found a match!
Match pattern for multiples of 10
In [49]:
pattern ='[1-9]*0+' 
print(find_pattern("100",pattern))
print(find_pattern("20",pattern))

Found a match! Found a match!
In [8]:
# {n}: Matches if a character is present exactly n number of times
print(find_pattern("abbc", "ab{2}"))

Found a match!
In [50]:
# {m,n}: Matches if a character is present from m to n number of times
print(find_pattern("aabbbbbbc", "ab{3,5}"))   # return true if 'b' is present 3-5 times
print(find_pattern("aabbbbbbc", "ab{7,10}"))  # return true if 'b' is present 7-10 times
print(find_pattern("aabbbbbbc", "ab{,10}"))   # return true if 'b' is present atmost 10 times
print(find_pattern("aabbbbbbc", "ab{10,}"))   # return true if 'b' is present from at least 10 times
Found a match! Not Found! Found a match! Not Found!
Match word hurray - r ocfurs mininumum of two times and maximum of 5
In [53]:
pattern='hur{2,5}ay'
print(find_pattern("hurrrrray",pattern))
Found a match!

Anchors

In [10]:
# '^': Indicates start of a string
# '$': Indicates end of string

print(find_pattern("James", "^J"))   # return true if string starts with 'J' 
print(find_pattern("Pramod", "^J"))  # return true if string starts with 'J' 
print(find_pattern("India", "a$"))   # return true if string ends with 'a'
print(find_pattern("Japan", "a$"))   # return true if string ends with 'a'

Found a match! Not Found! Found a match! Not Found!

Wildcard

In [11]:
# '.': Matches any character
print(find_pattern("a", "."))
print(find_pattern("#", "."))

Found a match! Found a match!

Character sets

In [12]:
# Now we will look at '[' and ']'.
# They're used for specifying a character class, which is a set of characters that you wish to match.
# Characters can be listed individually as follows
print(find_pattern("a", "[abc]"))

# Or a range of characters can be indicated by giving two characters and separating them by a '-'.
print(find_pattern("c", "[a-c]"))  # same as above
Found a match! Found a match!
In [13]:
# '^' is used inside character set to indicate complementary set
print(find_pattern("a", "[^abc]"))  # return true if neither of these is present - a,b or c
Not Found!

Character sets

Pattern Matches
[abc] Matches either an a, b or c character
[abcABC] Matches either an a, A, b, B, c or C character
[a-z] Matches any characters between a and z, including a and z
[A-Z] Matches any characters between A and Z, including A and Z
[a-zA-Z] Matches any characters between a and z, including a and z ignoring cases of the characters
[0-9] Matches any character which is a number between 0 and 9

Meta sequences

Pattern Equivalent to
\s [ \t\n\r\f\v]
\S [^ \t\n\r\f\v]
\d [0-9]
\D [^0-9]
\w [a-zA-Z0-9_]
\W [^a-zA-Z0-9_]

Greedy vs non-greedy regex

In [110]:
print(find_pattern("aabbbbbb", "ab{3,5}")) # return if a is followed by b 3-5 times GREEDY
Found a match!
In [109]:
print(find_pattern("aabbbbbb", "ab{3,5}?")) # return if a is followed by b 3-5 times GREEDY
Found a match!
In [ ]:
 
In [113]:
# Example of HTML code - this gives entire length of string. But, we want to match each html tag 
print(re.search("<.*>","<HTML><TITLE>My Page</TITLE></HTML>"))
<_sre.SRE_Match object; span=(0, 35), match='<HTML><TITLE>My Page</TITLE></HTML>'>
In [114]:
# Example of HTML code
print(re.search("<.*?>","<HTML><TITLE>My Page</TITLE></HTML>"))
<_sre.SRE_Match object; span=(0, 6), match='<HTML>'>

The five most important re functions that you would be required to use most of the times are

match() Determine if the RE matches at the beginning of the string

search() Scan through a string, looking for any location where this RE matches

findall() Find all the substrings where the RE matches, and return them as a list

finditer() Find all substrings where RE matches and return them as asn iterator

sub() Find all substrings where the RE matches and substitute them with the given string

In [18]:
# - this function uses the re.match() and let's see how it differs from re.search()
def match_pattern(text, patterns):
    if re.match(patterns, text):
        return re.match(patterns, text)
    else:
        return ('Not found!')
In [19]:
print(find_pattern("abbc", "b+"))
Found a match!
In [20]:
print(match_pattern("abbc", "b+"))
Not found!
In [21]:
## Example usage of the sub() function. Replace Road with rd.

street = '21 Ramakrishna Road'
print(re.sub('Road', 'Rd', street))
21 Ramakrishna Rd
In [22]:
print(re.sub('R\w+', 'Rd', street))
21 Rd Rd
In [123]:
## Example usage of finditer(). Find all occurrences of word Festival in given sentence

text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'
for match in re.finditer(pattern, text):
    print('START -', match.start(), end="")
    print('END -', match.end())
START - 12END - 20 START - 42END - 50
In [146]:
# Example usage of findall(). In the given URL find all dates
url = "http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12/"
date_regex = '/\d{4}/\d{1,2}/\d{1,2}/'
print(re.findall(date_regex, url))
['/2017/10/28/', '/2017/05/12/']
Sometimes we have toextract sub-patterns out of a larger pattern. This can be done by using grouping. Suppose you have textual data with dates in it and you want to extract only the year. from the dates. You can use a regular expression pattern with grouping to match dates and then you can extract the component elements such as the day, month or the year from the date

We use group() function. Grouping is achieved using the parenthesis operators. Grouping is a very useful technique when you want to extract substrings from an entire match.

In [149]:
## Exploring Groups
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'
m1 = re.search(date_regex, url)
print(m1.group())  ## print the matched group
/2017/10/28/
In [150]:
print(m1.group(1)) # - Print first group
2017
In [151]:
print(m1.group(2)) # - Print second group
10
In [152]:
print(m1.group(3)) # - Print third group
28
In [153]:
print(m1.group(0)) # - Print zero or the default group
/2017/10/28/
Write a regular expression which matches a string where '23' occurs one or more times followed by occurrence of '78' one or more times
In [55]:
pattern='(23){1,}(78){1,}'
print(find_pattern('2378',pattern))
print(find_pattern('235678',pattern))
print(find_pattern('237878',pattern))
Found a match! Not Found! Found a match!
Write a regular expression that matches the following strings:

Basketball Baseball Volleyball Softball Football

In [59]:
pattern='(Basket|Base|Volley|Soft|Foot){1,}ball'
print(find_pattern('Basketball',pattern))
print(find_pattern('Softball',pattern))
print(find_pattern('ball',pattern))
Found a match! Found a match! Not Found!
Write a regular expression that returns True when passed a multiplication equation. For any other equation, it should return False. In other words, it should return True if there an asterisk - ‘*’ - present in the equation.
In [64]:
pattern='[a-z0-9]{1,}\*{1}[a-z0-9]{1,}'
print(find_pattern('3%4',pattern))

Not Found!
Write a pattern that matches all the dictionary words that start with ‘A’
In [71]:
pattern='^A'
print(find_pattern('All',pattern))
print(find_pattern('all',pattern))

Found a match! Not Found!
Write a pattern that matches all the dictionary words that start with ‘A’ or 'a'
In [74]:
pattern='^A'
print(find_pattern('all',pattern,flags='I'))
Found a match!
Write a pattern which matches a word that ends with ‘ing’. Words such as ‘playing’, ‘growing’, ‘raining’, etc. should match while words that don’t have ‘ing’ at the end shouldn’t match (irrespective of case)
In [77]:
pattern='(ing)$'
print(find_pattern('growing',pattern,flags='I'))
print(find_pattern('GrowIng',pattern,flags='I'))
print(find_pattern('Grow',pattern,flags='I'))
Found a match! Found a match! Not Found!
Write a regular expression that matches any string that starts with one or more ‘1’s, followed by three or more ‘0’s, followed by any number of ones (zero or more), followed by ‘0’s (from one to seven), and then ends with either two or three ‘1’s.
In [81]:
#11000011000111
pattern='^1{1,}0{3,}1*0{1,7}1{2,3}$'
print(find_pattern('1000011000111',pattern))
print(find_pattern('110000110001',pattern))
print(find_pattern('0000110001',pattern))
Found a match! Not Found! Not Found!
Write a regex pattern that should match a string that starts with four characters, followed by three 0s and two 1s, followed by any two characters
In [86]:
pattern='.{4}0{3}1{2}.{2}'
print(find_pattern('a00011as',pattern))
print(find_pattern('000000011as',pattern))
print(find_pattern('012300011as',pattern))
print(find_pattern('01ab00011as',pattern))
Not Found! Found a match! Found a match! Found a match!
Write a regular expression to match first names (consider only first names, i.e. there are no spaces in a name) that have length between three and fifteen characters.
In [96]:
pattern='^[a-z]{3,15}$'
print(find_pattern('01ab00011as',pattern,flags='I'))
print(find_pattern('Balasubrahmanyam',pattern,flags='I'))
print(find_pattern('Aiswarya',pattern,flags='I'))
print(find_pattern('Aiswarya Ramachandran',pattern,flags='I'))
Not Found! Not Found! Found a match! Not Found!
Write a regular expression with the help of meta-sequences that matches usernames of the users of a database. The username starts with alphabets of length one to ten characters long and then followed by a number of length 4.
In [105]:
pattern='^[a-zA-Z]{1,10}\d{4}$'
print(find_pattern('sam2340',pattern))
print(find_pattern('irfann2590',pattern))
print(find_pattern('8730',pattern))
print(find_pattern('bobby8903834',pattern))


Found a match! Found a match! Not Found! Not Found!
In [118]:
### Match only first tag - Non Greedy
pattern = "<.*?>"
string="<html> <head> <title> My amazing webpage </title> </head> <body> Welcome to my webpage! </body> </html>"
re.search(pattern,string)
Out[118]:
<_sre.SRE_Match object; span=(0, 6), match='<html>'>
In [119]:
string = "0101"
pattern='(01+){2,}'
re.match(pattern,string)
Out[119]:
<_sre.SRE_Match object; span=(0, 4), match='0101'>
Substitute all the 11-digit phone numbers present in the below string with “####”.

“You can reach us at 07400029954 or 02261562153 ”

In [121]:
string="You can reach us at 07400029954 or 02261562153"
pattern="\d{11}"
replacement="####"
re.sub(pattern,replacement,string)
Out[121]:
'You can reach us at #### or ####'
Write a regular expression such that it replaces the first letter of any given string with ‘$’.
In [122]:
pattern="^[A-z]"
replacement="$"
string="Building careers of tomorrow"
re.sub(pattern,replacement,string)
Out[122]:
'$uilding careers of tomorrow'
Write a regular expression to extract all the words from a given sentence. Then use the re.finditer() function and store all the matched words that are of length more than or equal to 5 letters in a separate list called result.
In [138]:
word_regex='\w*'
string='Do not compare apples with oranges. Compare apples with apples'
result=[]
for val in re.finditer(word_regex, string):
    if(val.end() - val.start())>=5:
        result.append(val.group())
print(result)
['compare', 'apples', 'oranges', 'Compare', 'apples', 'apples']
Write a regular expression to extract all the words that have the suffix ‘ing’ using the re.findall() function. Store the matches in the variable ‘results’ and print its length.
In [144]:
word_regex='(\w+ing)'
string="Playing outdoor games when its raining outside is always fun!"
result=re.findall(word_regex,string)

Out[144]:
['Playing', 'raining']
You have a string which contains a data in the format DD-MM-YYYY. Write a regular expression to extract the date from the string.
In [154]:
string='Today’s date is 18-05-2018.'
date_regex='\d{2}-\d{2}-\d{4}'
m1 = re.search(date_regex, string)
m1.group()
Out[154]:
'18-05-2018'
write the same regular expression. But this time, use grouping to extract the month from the date. The expected date format is DD-MM-YYYY only.
In [156]:
string='Today’s date is 18-05-2018.'
date_regex='(\d{2})-(\d{2})-(\d{4})'
m1 = re.search(date_regex, string)
m1.group(2)
Out[156]:
'05'
Write a regular expression to extract the domain name from an email address. The format of the email is simple - the part before the ‘@’ symbol contains alphabets, numbers and underscores. The part after the ‘@’ symbol contains only alphabets followed by a dot followed by ‘com’
In [161]:
string="user_name_123@gmail.com"
pattern="(\w+)@([A-z]+\.com)"
m1=re.search(pattern,string)
m1.group(2)
Out[161]:
'gmail.com'
In [ ]: