Learn practical skills, build real-world projects, and advance your career
Created 5 years ago
Tokenisation
The notebook contains three types of tokenisation techniques:
- Word tokenisation
- Sentence tokenisation
- Tweet tokenisation
- Custom tokenisation using regular expressions
1. Word tokenisation
document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print(document)
At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.
Tokenising on spaces using python
print(document.split())
['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God.']