In this lecture, we are going to:
For this tutorial, I would recommend to have:
torch version >= 2.1.2
tiktoken version >= 0.6.0
matplotlib version >=3.9.2
numpy version >=1.24.4
For this tutorial, we will use the following small text for training. Feel free to replace it with any other text file:
Create a Jupyter notebook and load the content of the text file into a string raw_text
, what is the number of characters in it?
<aside> 👉🏻
Expected output (on raw_text
)
Total number of character: 20479
</aside>
Using the re
library, create a simple tokenizer which split raw_text
when any whitespace character (space, stab, or newline) is encountered.
<aside> 👉🏻
Expected output (on raw_text[:299]
)
['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius--though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough--so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that,', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory,', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting,', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow,', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera.', ' ', '(Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'h']
</aside>
Using the re
library, create a simple tokenizer which split raw_text
when the following are encountered:
The tokenizer should only display non-empty tokens.
<aside> 👉🏻
Expected output (on raw_text[:299]
)
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'h']
</aside>