2.2 Text Preprocessing and Feature Engineering/Extraction

As you know, typical machine learning approaches work with (vector of) numbers instead of words.

As a result, text pre-processing and feature extraction are critical steps in natural language processing (NLP) tasks, as they play a crucial role in converting raw textual data into a format that machine learning models can understand and utilize effectively. These steps are essential for achieving accurate and meaningful results in various NLP applications.

Text Preprocessing

Raw text data often contains noise, inconsistencies, and irrelevant information that can hinder the performance of NLP models. Text processing involves several sub-tasks aimed at cleaning, normalizing, and preparing the text data before it is fed into a machine-learning algorithm.

In this section, we introduce several techniques that may be useful to apply to raw data.

Removing HTML tags

Removing Special Characters and Punctuation

Expanding contractions

Lemmatizing text

Removing stopwords (a, an, and, the, etc.)

**Pr evious Section**

Home

Next Section