Learn practical skills, build real-world projects, and advance your career

Tokenisation

The notebook contains three types of tokenisation techniques:

  1. Word tokenisation
  2. Sentence tokenisation
  3. Tweet tokenisation
  4. Custom tokenisation using regular expressions

1. Word tokenisation

document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print(document)
At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.

Tokenising on spaces using python

print(document.split())
['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God.']