In this lecture, we are going to:

For this tutorial, I would recommend to have:

Understanding the tokenizer

For this tutorial, we will use the following small text for training. Feel free to replace it with any other text file:

the-verdict.txt

  1. Create a Jupyter notebook and load the content of the text file into a string raw_text, what is the number of characters in it?

    <aside> 👉🏻

    Expected output (on raw_text)

    Total number of character: 20479

    </aside>

  2. Using the re library, create a simple tokenizer which split raw_text when any whitespace character (space, stab, or newline) is encountered.

    <aside> 👉🏻

    Expected output (on raw_text[:299])

    ['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius--though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough--so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that,', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory,', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting,', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow,', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera.', ' ', '(Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'h']

    </aside>

  3. Using the re library, create a simple tokenizer which split raw_text when the following are encountered:

    1. any whitespace character (space, stab, or newline)
    2. special symbols (”,”, “.”, “:”, etc.)
    3. special sequences (”—”)

    The tokenizer should only display non-empty tokens.

    <aside> 👉🏻

    Expected output (on raw_text[:299])

    ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'h']

    </aside>