Tokenisation

The notebook contains three types of tokenisation techniques:

Word tokenisation
Sentence tokenisation
Tweet tokenisation
Custom tokenisation using regular expressions

1. Word tokenisation

document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print(document)

At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.

Tokenising on spaces using python

print(document.split())

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God.']