\n",
- " \n",
- "### Learning Objectives \n",
- " \n",
- "* Learn how to convert text data into numbers through a Bag-of-Words approach.\n",
- "* Understand the TF-IDF algorithm and how it complements the Bag-of-Words representation.\n",
- "* Implement Bag-of-Words and TF-IDF using the `sklearn` package and understand its parameter settings.\n",
- "* Use the numerical representations of text data to perform sentiment analysis.\n",
- "
\n",
- "\n",
- "### Icons Used in This Notebook\n",
- "🔔 **Question**: A quick question to help you understand what's going on. \n",
- "🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop! \n",
- "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for! \n",
- "\n",
- "### Sections\n",
- "1. [Exploratory Data Analysis](#section1)\n",
- "2. [Preprocessing](#section2)\n",
- "3. [The Bag-of-Words Representation](#section3)\n",
- "4. [Term Frequency-Inverse Document Frequency](#section4)\n",
- "5. [Sentiment Classification Using the TF-IDF Representation](#section5)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "880e8a36-bd58-4c24-8593-03a0ea70deed",
- "metadata": {},
- "source": [
- "In the previous part, we learned how to perform text preprocessing. However, we didn't move beyond the text data itself. If we're interested in doing any computational analysis on the text data, we still need approaches to convert the text into a **numeric representation**.\n",
- "\n",
- "In Part 2 of our workshop series, we'll explore one of the most straightforward ways to generate a numeric representation from text: the **bag-of-words** (BoW). We will implement the BoW representation to transform our airline tweets data, and then build a classifier to explore what it tells us about the sentiment of the tweets. At the heart of the bag-of-words approach lies the assumption that the frequency of specific tokens is informative about the semantics and sentiment underlying the text.\n",
- "\n",
- "We'll make heavy use of the `scikit-learn` package to do so, as it provides a nice framework for constructing the numeric representation.\n",
- "\n",
- "Let's install `scikit-learn` firstǃ"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "9e4a3a0d-66f4-44e5-8dd6-5f441146014d",
- "metadata": {
- "scrolled": true,
- "tags": []
- },
- "outputs": [],
- "source": [
- "# Uncomment to install the package\n",
- "# %pip install scikit-learn"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "21ed437f-9767-43b7-abc5-159aa4339a31",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Uncomment to install the NLP packages introduced in Part 1\n",
- "# %pip install NLTK\n",
- "# %pip install spaCy\n",
- "# !python -m spacy download en_core_web_sm"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "f3862ffd-918f-4184-8c90-8a39a8a2a069",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Import other packages\n",
- "import re\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "from string import punctuation\n",
- "%matplotlib inline"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "852ea4a5-7c28-4557-acdd-afe8a97b7235",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "# Exploratory Data Analysis\n",
- "\n",
- "Before we ever do any preprocessing or modeling, we always should perform exploratory data analysis to familiarize ourselves with the data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "id": "4190e351-97b7-4c5b-866e-07aa6cbd42c2",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Read in data\n",
- "tweets_path = '../data/airline_tweets.csv'\n",
- "tweets = pd.read_csv(tweets_path, sep=',')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "79acbaf2-6625-4abb-b50f-97ea54ba0d11",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
+ " \n",
+ "### Learning Objectives\n",
+ " \n",
+ "* Learn how to convert text data into numbers through a Bag-of-Words approach.\n",
+ "* Understand the TF-IDF algorithm and how it complements the Bag-of-Words representation.\n",
+ "* Implement Bag-of-Words and TF-IDF using the `sklearn` package and understand its parameter settings.\n",
+ "* Use the numerical representations of text data to perform sentiment analysis.\n",
+ "
\n",
+ "\n",
+ "### Icons Used in This Notebook\n",
+ "🔔 **Question**: A quick question to help you understand what's going on. \n",
+ "🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop! \n",
+ "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for! \n",
+ "\n",
+ "### Sections\n",
+ "1. [Exploratory Data Analysis](#section1)\n",
+ "2. [Preprocessing](#section2)\n",
+ "3. [The Bag-of-Words Representation](#section3)\n",
+ "4. [Term Frequency-Inverse Document Frequency](#section4)\n",
+ "5. [Sentiment Classification Using the TF-IDF Representation](#section5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "880e8a36-bd58-4c24-8593-03a0ea70deed",
+ "metadata": {
+ "id": "880e8a36-bd58-4c24-8593-03a0ea70deed"
+ },
+ "source": [
+ "In the previous part, we learned how to perform text preprocessing. However, we didn't move beyond the text data itself. If we're interested in doing any computational analysis on the text data, we still need approaches to convert the text into a **numeric representation**.\n",
+ "\n",
+ "In Part 2 of our workshop series, we'll explore one of the most straightforward ways to generate a numeric representation from text: the **bag-of-words** (BoW). We will implement the BoW representation to transform our airline tweets data, and then build a classifier to explore what it tells us about the sentiment of the tweets. At the heart of the bag-of-words approach lies the assumption that the frequency of specific tokens is informative about the semantics and sentiment underlying the text.\n",
+ "\n",
+ "We'll make heavy use of the `scikit-learn` package to do so, as it provides a nice framework for constructing the numeric representation.\n",
+ "\n",
+ "Let's install `scikit-learn` firstǃ"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Esa parte del código es un comentario que indica al usuario que, si elimina el símbolo #, podrá ejecutar el comando mágico %pip install scikit-learn, el cual sirve para instalar la librería de machine learning scikit-learn directamente en el entorno del notebook; sin embargo, en el ejemplo aparece escrito como scikit-lea, lo cual es un error tipográfico que debe corregirse para que la instalación funcione correctamente.*"
],
- "text/plain": [
- " tweet_id airline_sentiment airline_sentiment_confidence \\\n",
- "0 570306133677760513 neutral 1.0000 \n",
- "1 570301130888122368 positive 0.3486 \n",
- "2 570301083672813571 neutral 0.6837 \n",
- "3 570301031407624196 negative 1.0000 \n",
- "4 570300817074462722 negative 1.0000 \n",
- "\n",
- " negativereason negativereason_confidence airline \\\n",
- "0 NaN NaN Virgin America \n",
- "1 NaN 0.0000 Virgin America \n",
- "2 NaN NaN Virgin America \n",
- "3 Bad Flight 0.7033 Virgin America \n",
- "4 Can't Tell 1.0000 Virgin America \n",
- "\n",
- " airline_sentiment_gold name negativereason_gold retweet_count \\\n",
- "0 NaN cairdin NaN 0 \n",
- "1 NaN jnardino NaN 0 \n",
- "2 NaN yvonnalynn NaN 0 \n",
- "3 NaN jnardino NaN 0 \n",
- "4 NaN jnardino NaN 0 \n",
- "\n",
- " text tweet_coord \\\n",
- "0 @VirginAmerica What @dhepburn said. NaN \n",
- "1 @VirginAmerica plus you've added commercials t... NaN \n",
- "2 @VirginAmerica I didn't today... Must mean I n... NaN \n",
- "3 @VirginAmerica it's really aggressive to blast... NaN \n",
- "4 @VirginAmerica and it's a really big bad thing... NaN \n",
- "\n",
- " tweet_created tweet_location user_timezone \n",
- "0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada) \n",
- "1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada) \n",
- "2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) \n",
- "3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada) \n",
- "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) "
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tweets.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "80232c78-ac41-4d74-a581-76c9dac3b8f6",
- "metadata": {},
- "source": [
- "As a refresher, each row in this dataframe correponds to a tweet. The following columns are of main interests to us. There are other columns containing metadata of the tweet, such as the author of the tweet, when it was created, the timezone of the user, and others, which we will set aside for now. \n",
- "\n",
- "- `text` (`str`): the text of the tweet.\n",
- "- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as \"neutral,\" \"positive,\" or \"negative.\" \n",
- "- `airline` (`str`): the airline that is tweeted about.\n",
- "- `retweet count` (`int`): how many times the tweet was retweeted.\n",
- "\n",
- "To prepare us for sentiment classification, we'll partition the dataset to focus on the \"positive\" and \"negative\" tweets for now. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "id": "a1faaf90-8c01-4d25-9468-90c01823f0d5",
- "metadata": {},
- "outputs": [],
- "source": [
- "tweets = tweets[tweets['airline_sentiment'] != 'neutral'].reset_index(drop=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7cb6b039-53e7-4afe-a9e0-b3522c12b2d7",
- "metadata": {},
- "source": [
- "Let's take a look at a few tweets first!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "438830e6-1064-47fe-b578-a1ca693a0ed0",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "@VirginAmerica plus you've added commercials to the experience... tacky.\n",
- "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they have little recourse\n",
- "@VirginAmerica and it's a really big bad thing about it\n",
- "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\n",
- "it's really the only bad thing about flying VA\n",
- "@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)\n"
- ]
- }
- ],
- "source": [
- "# Print first five tweets\n",
- "for idx in range(5):\n",
- " print(tweets['text'].iloc[idx])"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0d6746f8-b29c-40d4-bef6-b4afd4cd6cc1",
- "metadata": {},
- "source": [
- "We can already see that some of these tweets contain negative sentiment—how can we tell this is the case? \n",
- "\n",
- "Next, let's take a look at the distribution of sentiment labels in this dataset. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "01955158-6954-447a-acb6-2989d02a49c3",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "# Make a bar plot showing the count of tweet sentiments\n",
- "sns.countplot(data=tweets,\n",
- " x='airline_sentiment', \n",
- " color='cornflowerblue',\n",
- " order=['positive', 'negative']);"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "eab45abf-adf4-4f5e-ae09-75f6c4fd50d1",
- "metadata": {},
- "source": [
- "It looks like the majority of the tweets in this dataset are expressing negative sentiment!\n",
- "\n",
- "Let's take a look at what gets more retweeted:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "428ddde7-af73-4eb6-92c9-041a1791ca59",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "airline_sentiment\n",
- "negative 0.093375\n",
- "positive 0.069403\n",
- "Name: retweet_count, dtype: float64"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Get the mean retweet count for each sentiment\n",
- "tweets.groupby('airline_sentiment')['retweet_count'].mean()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0d31f3bc-257c-48a8-86a0-fd0d7c3e8cb3",
- "metadata": {},
- "source": [
- "Negative tweets are clearly retweeted more often than positive ones!\n",
- "\n",
- "Let's see which airline receives most negative tweets:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "id": "12aa9f2d-d655-494a-bb72-08ad973518f3",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
airline_sentiment
\n",
- "
negative
\n",
- "
positive
\n",
- "
\n",
- "
\n",
- "
airline
\n",
- "
\n",
- "
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
US Airways
\n",
- "
0.893760
\n",
- "
0.106240
\n",
- "
\n",
- "
\n",
- "
American
\n",
- "
0.853659
\n",
- "
0.146341
\n",
- "
\n",
- "
\n",
- "
United
\n",
- "
0.842560
\n",
- "
0.157440
\n",
- "
\n",
- "
\n",
- "
Southwest
\n",
- "
0.675399
\n",
- "
0.324601
\n",
- "
\n",
- "
\n",
- "
Delta
\n",
- "
0.637091
\n",
- "
0.362909
\n",
- "
\n",
- "
\n",
- "
Virgin America
\n",
- "
0.543544
\n",
- "
0.456456
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
+ "metadata": {
+ "id": "IJ2sHVf6bJBh"
+ },
+ "id": "IJ2sHVf6bJBh"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9e4a3a0d-66f4-44e5-8dd6-5f441146014d",
+ "metadata": {
+ "scrolled": true,
+ "tags": [],
+ "id": "9e4a3a0d-66f4-44e5-8dd6-5f441146014d"
+ },
+ "outputs": [],
+ "source": [
+ "# Uncomment to install the package\n",
+ " %pip install scikit-learn"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Ese bloque de código indica al lector que, al quitar el #, se instalarán las librerías de procesamiento de lenguaje natural usadas en la primera parte. En específico, %pip install NLTK instala NLTK, una librería orientada al análisis y manipulación de texto; %pip install spaCy instala spaCy, una herramienta muy eficiente para el procesamiento avanzado de lenguaje; y finalmente !python -m spacy download en_core_web_sm descarga un modelo pequeño en inglés (en_core_web_sm) necesario para que spaCy pueda identificar palabras, oraciones y su estructura gramatical.*"
],
- "text/plain": [
- "airline_sentiment negative positive\n",
- "airline \n",
- "US Airways 0.893760 0.106240\n",
- "American 0.853659 0.146341\n",
- "United 0.842560 0.157440\n",
- "Southwest 0.675399 0.324601\n",
- "Delta 0.637091 0.362909\n",
- "Virgin America 0.543544 0.456456"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Get the proportion of negative tweets by airline\n",
- "proportions = tweets.groupby(['airline', 'airline_sentiment']).size() / tweets.groupby('airline').size()\n",
- "proportions.unstack().sort_values('negative', ascending=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7042419e-9c41-40e7-8dbf-47bd1e2ad45a",
- "metadata": {},
- "source": [
- "It looks like people are most dissatified with US Airways, followed by American Airline, both having over 85\\% negative tweets!\n",
- "\n",
- "A lot of interesting discoveries could be made if you want to explore more about the data. Now let's return to our task of sentiment analysis. Before that, we need to preprocess the text data so that they are in a standard format."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9e513930-2dc7-489c-bc5a-22eb09add5bf",
- "metadata": {},
- "source": [
- "\n",
- "# Preprocessing\n",
- "\n",
- "We spent much of Part 1 learning how to preprocess data. Let's apply what we learned! Looking at some of the tweets above, we can see that while they are in pretty good shape, we can do some additional processing on them.\n",
- "\n",
- "In our pipeline, we'll omit the tokenization process since we will perform it in a later step. "
- ]
- },
- {
- "cell_type": "markdown",
- "id": "92a83ece-f3b2-4200-9d22-0788fbc07fa4",
- "metadata": {},
- "source": [
- "## 🥊 Challenge 1: Apply a Text Cleaning Pipeline\n",
- "\n",
- "Write a function called `preprocess` that performs the following steps on a text input:\n",
- "\n",
- "* Step 1: Lowercase the text input.\n",
- "* Step 2: Replace the following patterns with placeholders:\n",
- " * URLs → ` URL `\n",
- " * Digits → ` DIGIT `\n",
- " * Hashtags → ` HASHTAG `\n",
- " * Tweet handles → ` USER `\n",
- "* Step 3: Remove extra blankspace.\n",
- "\n",
- "Here are some hints to guide you through this challenge:\n",
- "\n",
- "* For Step 1, recall from Part 1 that a string method called [`.lower()`](https://docs.python.org/3.11/library/stdtypes.html#str.lower) can be usd to convert text to lowercase. \n",
- "* We have integrated Step 2 into a function called `placeholder`. Run the cell below to import it into your notebook, and you can use it just like any other functions.\n",
- "* For Step 3, we have provided the regex pattern for identifying whitespace characters as well as the correct replacement for extract whitespace. \n",
- "\n",
- "Run your `preprocess` function on `example_tweet` (three cells below) to check if it works. If it does, apply it to the entire `text` column in the tweets dataframe."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "id": "21738b02-9ab9-4a61-b41f-ff75888aa747",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "from utils import placeholder"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "03569f0d-34ba-492d-aa1d-1dce9d34f792",
- "metadata": {},
- "outputs": [],
- "source": [
- "blankspace_pattern = r'\\s+'\n",
- "blankspace_repl = ' '\n",
- "\n",
- "def preprocess(text):\n",
- " '''Create a preprocess pipeline that cleans the tweet data.'''\n",
- " \n",
- " # Step 1: Lowercase\n",
- " text = ...\n",
- "\n",
- " # Step 2: Replace patterns with placeholders\n",
- " text = ...\n",
- "\n",
- " # Step 3: Remove extra whitespace characters\n",
- " text = ...\n",
- "\n",
- " return text"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "id": "8990cefd-5d04-46ba-ada2-29978c28cfe8",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo\n",
- "==================================================\n",
- "lol USER and USER are like soo DIGIT HASHTAG HASHTAG saw it on URL HASHTAG\n"
- ]
- }
- ],
- "source": [
- "example_tweet = 'lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo'\n",
- "\n",
- "# Print the example tweet\n",
- "print(example_tweet)\n",
- "print(f\"{'='*50}\")\n",
- "\n",
- "# Print the preprocessed tweet\n",
- "print(preprocess(example_tweet))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "id": "a5f7bb6a-f064-48cc-b650-12c4ef2fbb88",
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 USER plus you've added commercials to the expe...\n",
- "1 USER it's really aggressive to blast obnoxious...\n",
- "2 USER and it's a really big bad thing about it\n",
- "3 USER seriously would pay $ DIGIT a flight for ...\n",
- "4 USER yes, nearly every time i fly vx this “ear...\n",
- "Name: text_processed, dtype: object"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Apply the function to the text column and assign the preprocessed tweets to a new column\n",
- "tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess(x))\n",
- "tweets['text_processed'].head()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "1576acc6-b305-492a-8fde-65b343cb779c",
- "metadata": {},
- "source": [
- "Congratualtions! Preprocessing is done. Let's dive into the bag-of-words!"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "53282330-54da-4e1c-bfe5-e77cb8fa3add",
- "metadata": {},
- "source": [
- "\n",
- "# The Bag-of-Words Representation\n",
- "\n",
- "The idea of bag-of-words (BoW), as the name suggests, is quite intuitive: we take a document and toss it in a bag. The action of \"throwing\" the document in a bag disregards the relative position between words, so what is \"in the bag\" is essentially \"an unsorted set of words\" [(Jurafsky & Martin, 2024)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf). In return, we have a list of unique words and the frequency of each of them. \n",
- "\n",
- "For example, as shown in the following illustration, the word \"coffee\" appears twice. \n",
- "\n",
- "\n",
- "\n",
- "With a bag-of-words representation, we make heavy use of word frequency but not too much of word order. \n",
- "\n",
- "In the context of sentiment analysis, the sentiment of a tweet is conveyed more strongly by specific words. For example, if a tweet contains the word \"happy,\" it likely conveys positive sentiment, but not always (e.g., \"not happy\" denotes the opposite sentiment). When these words come up more often, they'll probably more strongly convey the sentiment."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b9d9bdbd-406d-469b-a8f6-41d1b3687c37",
- "metadata": {},
- "source": [
- "## Document Term Matrix\n",
- "\n",
- "Now let's implement the idea of bag-of-words. Before we dive deeper, let's step back for a moment. In practice, text analysis often involves handling many documents; from now on, we use the term **document** to represent a piece of text on which we perform analysis. It could be a phrase, a sentence, a tweet, or any other text—as long as it can be represented by a string, the length dosen't really matter. \n",
- "\n",
- "Imagine we have four documents (i.e., the four phrases shown above), and we toss them all in the bag. Instead of a word-frequency list, we'd expect a document-term matrix (DTM) in return. In a DTM, the word list is the **vocabulary** (V) that holds all unique words occur across the documents. For each **document** (D), we count the number of occurence of each word in the vocabulary, and then plug the number into the matrix. In other words, the DTM we will construct is a $D \\times V$ matrix, where each row corresponds to a document, and each column corresponds to a token (or \"term\").\n",
- "\n",
- "The unique tokens in this set of documents, arranged in alphabetical order, form the columns. For each document, we mark the occurence of each word present in the document. The numerical representation for each document is a row in the matrix. For example, the first document, \"the coffee roaster,\" has the numerical representation $[0, 1, 0, 0, 0, 1, 1, 0]$.\n",
- "\n",
- "Note that the left index column now displays these documents as text, but typically we would just assign an index to each of them. \n",
- "\n",
- "$$\n",
- "\\begin{array}{c|cccccccccccc}\n",
- " & \\text{americano} & \\text{coffee} & \\text{iced} & \\text{light} & \\text{roast} & \\text{roaster} & \\text{the} & \\text{time} \\\\\\hline\n",
- "\\text{the coffee roaster} &0 &1\t&0\t&0\t&0\t&1\t&1\t&0 \\\\ \n",
- "\\text{light roast} &0 &0\t&0\t&1\t&1\t&0\t&0\t&0 \\\\\n",
- "\\text{iced americano} &1 &0\t&1\t&0\t&0\t&0\t&0\t&0 \\\\\n",
- "\\text{coffee time} &0 &1\t&0\t&0\t&0\t&0\t&0\t&1 \\\\\n",
- "\\end{array}\n",
- "$$\n",
- "\n",
- "To create a DTM, we will use `CountVectorizer` from the package `sklearn`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "id": "cd2adf56-ba93-459d-8cfa-16ce8dc9284b",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.feature_extraction.text import CountVectorizer"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4989781d-6b40-417a-be70-eeba05cd8a50",
- "metadata": {},
- "source": [
- "The following illustration depicts the three-step workflow of creating a DTM with `CountVectorizr`.\n",
- "\n",
- "\n",
- "\n",
- "Let's walk through these steps with the toy example shown above."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "34174034-46b9-43e2-a511-5972d378cb00",
- "metadata": {},
- "source": [
- "### A Toy Example"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "id": "4da2bd3d-0460-4b5f-9b9e-02940db0d7ca",
- "metadata": {},
- "outputs": [],
- "source": [
- "# A toy example containing four documents\n",
- "test = ['the coffee roaster',\n",
- " 'light roast',\n",
- " 'iced americano',\n",
- " 'coffee time']"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "dff7c1d3-fcee-4e20-b9a7-17306ebd5fc2",
- "metadata": {},
- "source": [
- "The first step is to initialize a `CountVectorizer` object. Within the round paratheses, we can specify parameter settings if desired. Let's take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and see what options are available. \n",
- "\n",
- "For now we can just leave it blank to use the default settings. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "id": "9de3fe6a-9abf-4e11-aad1-e54c891567bb",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Create a CountVectorizer object\n",
- "vectorizer = CountVectorizer()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "1b5a7d0d-0bfc-4fb9-8e5f-e91e39797fb5",
- "metadata": {},
- "source": [
- "The second step is to `fit` this `CountVectorizer` object to the data, which means creating a vocabulary of tokens from the set of documents. Thirdly, we `transform` our data according to the \"fitted\" `CountVectorizer` object, which means taking each of the document and counting the occurrences of tokens according to the vocabulary established during the \"fitting\" step.\n",
- "\n",
- "It may sound a bit complex but steps 2 and 3 can be done in one swoop using a `fit_transform` function."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "id": "da1bbad4-bb1a-4b92-9096-6e17558b4a42",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Fit and transform to create a DTM\n",
- "test_count = vectorizer.fit_transform(test)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "324d3b65-4e98-48bf-87d2-399457f4939c",
- "metadata": {},
- "source": [
- "The return of `fit_transform` is supposed to be the DTM. \n",
- "\n",
- "Let's take a look at it!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "id": "cb044001-8eb2-4489-b025-2d8e2d4bfee2",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "<4x8 sparse matrix of type ''\n",
- "\twith 9 stored elements in Compressed Sparse Row format>"
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "test_count"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f9817b09-a806-42c4-9436-822cc27a38b9",
- "metadata": {},
- "source": [
- "Apparently we've got a \"sparse matrix\"—a matrix that contains a lot of zeros. This makes sense. For each document, there are words that don't occur at all, and these are counted as zero in the DTM. This sparse matrix is stored in a \"Compressed Sparse Row\" format, a memory-saving format designed for handling sparse matrices. \n",
- "\n",
- "Let's convert it to a dense matrix, where those zeros are probably represented, as in a numpy array."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "id": "bb03a238-87d8-40c9-b20e-66e7c9b6576b",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "matrix([[0, 1, 0, 0, 0, 1, 1, 0],\n",
- " [0, 0, 0, 1, 1, 0, 0, 0],\n",
- " [1, 0, 1, 0, 0, 0, 0, 0],\n",
- " [0, 1, 0, 0, 0, 0, 0, 1]])"
- ]
- },
- "execution_count": 18,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Convert DTM to a dense matrix \n",
- "test_count.todense()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "28b58a63-d7f6-4b9f-aadf-4d4fc7341336",
- "metadata": {},
- "source": [
- "So this is our DTM! The matrix is the same as shown above. To make it more reader-friendly, let's convert it to a dataframe. The column names should be tokens in the vocabulary, which we can access with the `get_feature_names_out` function."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "id": "714de5d3-e37d-4a19-9ade-3c6629e38d4e",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array(['americano', 'coffee', 'iced', 'light', 'roast', 'roaster', 'the',\n",
- " 'time'], dtype=object)"
- ]
- },
- "execution_count": 19,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Retrieve the vocabulary\n",
- "vectorizer.get_feature_names_out()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "id": "6a7729a2-ca2e-4de7-8795-74dfedb7a4d5",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Create a DTM dataframe\n",
- "test_dtm = pd.DataFrame(data=test_count.todense(),\n",
- " columns=vectorizer.get_feature_names_out())"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "781da407-f394-40f2-9d45-1fac39f02047",
- "metadata": {},
- "source": [
- "Here it is! The DTM of our toy data is now a dataframe. The index of `test_dtm` corresponds to the position of each document in the `test` list. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "id": "e41dd243-cd2e-43c3-80f8-5eaab6e64210",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
americano
\n",
- "
coffee
\n",
- "
iced
\n",
- "
light
\n",
- "
roast
\n",
- "
roaster
\n",
- "
the
\n",
- "
time
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
0
\n",
- "
1
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
1
\n",
- "
1
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
1
\n",
- "
1
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
1
\n",
- "
0
\n",
- "
1
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
0
\n",
- "
1
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
1
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
+ "metadata": {
+ "id": "TTSrSpDSb8o7"
+ },
+ "id": "TTSrSpDSb8o7"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "21ed437f-9767-43b7-abc5-159aa4339a31",
+ "metadata": {
+ "id": "21ed437f-9767-43b7-abc5-159aa4339a31"
+ },
+ "outputs": [],
+ "source": [
+ "# Uncomment to install the NLP packages introduced in Part 1\n",
+ "%pip install NLTK\n",
+ "%pip install spaCy\n",
+ "!python -m spacy download en_core_web_sm"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ " *Ese fragmento de código muestra que, tras el comentario, se están importando varias librerías y módulos necesarios para trabajar con datos y visualización: **re** sirve para manejar expresiones regulares, **numpy** (abreviado como np) para cálculos numéricos y operaciones con arreglos, **pandas** (abreviado como pd) para el manejo de datos en tablas, **matplotlib.pyplot** (abreviado como plt) y **seaborn** (abreviado como sns) para crear gráficos, mientras que punctuation de la librería string proporciona un conjunto de caracteres de puntuación útiles para procesar texto; finalmente, **%matplotlib inline** es un comando mágico de Jupyter que permite mostrar los gráficos directamente dentro del notebook.*"
],
- "text/plain": [
- " americano coffee iced light roast roaster the time\n",
- "0 0 1 0 0 0 1 1 0\n",
- "1 0 0 0 1 1 0 0 0\n",
- "2 1 0 1 0 0 0 0 0\n",
- "3 0 1 0 0 0 0 0 1"
- ]
- },
- "execution_count": 21,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "test_dtm"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d59a03b4-94fa-4fe7-8f5d-7280e31b9bc4",
- "metadata": {},
- "source": [
- "Hopefully this toy example provides a clear walkthrough of creating a DTM.\n",
- "\n",
- "Now it's time for our tweets data!\n",
- "\n",
- "### DTM for Tweets\n",
- "\n",
- "We'll begin by initializing a `CountVectorizer` object. In the following cell, we have included a few parameters that people often adjust. These parameters are currently set to their default values.\n",
- "\n",
- "When we construct a DTM, the default is to lowercase the input text. If nothing is provided for `stop_words`, the default is to keep them. The next three parameters are used to control the size of the vocabulary, which we'll return to in a minute."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "id": "783e44a4-4a22-4290-b222-282b02c080dc",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Create a CountVectorizer object\n",
- "vectorizer = CountVectorizer(lowercase=True,\n",
- " stop_words=None,\n",
- " min_df=1,\n",
- " max_df=1.0, \n",
- " max_features=None)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "id": "f85e76ea-bc54-4775-bcda-432a03d2c96f",
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "<11541x8751 sparse matrix of type ''\n",
- "\twith 191139 stored elements in Compressed Sparse Row format>"
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Fit and transform to create DTM\n",
- "counts = vectorizer.fit_transform(tweets['text_processed'])\n",
- "counts"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "id": "87119057-c78c-4eb2-a9d6-3e9f44e4c22b",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([[0, 0, 0, ..., 0, 0, 0],\n",
- " [0, 0, 0, ..., 0, 0, 0],\n",
- " [0, 0, 0, ..., 0, 0, 0],\n",
- " ...,\n",
- " [0, 0, 0, ..., 0, 0, 0],\n",
- " [0, 0, 0, ..., 0, 0, 0],\n",
- " [0, 0, 0, ..., 0, 0, 0]])"
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Do not run if you have limited memory - this includes DataHub and Binder\n",
- "np.array(counts.todense())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "id": "99322b85-1a15-46a5-bb80-bb5eaa6eeb7b",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Extract tokens\n",
- "tokens = vectorizer.get_feature_names_out()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "id": "43620587-3795-4434-8f1f-145c81b93706",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "(11541, 8751)\n"
- ]
- }
- ],
- "source": [
- "# Create DTM\n",
- "first_dtm = pd.DataFrame(data=counts.todense(),\n",
- " index=tweets.index,\n",
- " columns=tokens)\n",
- "\n",
- "# Print the shape of DTM\n",
- "print(first_dtm.shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2dd257d5-4244-436c-afe7-5688232caf8f",
- "metadata": {},
- "source": [
- "If we leave the `CountVectorizer` to the default setting, the vocabulary size of the tweet data is 8751. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "id": "bb3604ec-d909-4238-9a3f-67e7d4ae2ac5",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
_exact_
\n",
- "
_wtvd
\n",
- "
aa
\n",
- "
aaaand
\n",
- "
aadv
\n",
- "
aadvantage
\n",
- "
aal
\n",
- "
aaron
\n",
- "
ab
\n",
- "
aback
\n",
- "
...
\n",
- "
zero
\n",
- "
zig
\n",
- "
zip
\n",
- "
zippers
\n",
- "
zone
\n",
- "
zones
\n",
- "
zoom
\n",
- "
zukes
\n",
- "
zurich
\n",
- "
zz
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- " \n",
- "
\n",
- "
5 rows × 8751 columns
\n",
- "
"
+ "metadata": {
+ "id": "1lxnCapVcUCM"
+ },
+ "id": "1lxnCapVcUCM"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f3862ffd-918f-4184-8c90-8a39a8a2a069",
+ "metadata": {
+ "id": "f3862ffd-918f-4184-8c90-8a39a8a2a069"
+ },
+ "outputs": [],
+ "source": [
+ "# Import other packages\n",
+ "import re\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "from string import punctuation\n",
+ "%matplotlib inline"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "852ea4a5-7c28-4557-acdd-afe8a97b7235",
+ "metadata": {
+ "id": "852ea4a5-7c28-4557-acdd-afe8a97b7235"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "# Exploratory Data Analysis\n",
+ "\n",
+ "Before we ever do any preprocessing or modeling, we always should perform exploratory data analysis to familiarize ourselves with the data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Ese bloque de código comienza con un comentario que indica que se van a leer datos. Primero se define la variable tweets_path, que guarda la ruta del archivo llamado airline_tweets.csv ubicado en la carpeta ../data/. Luego, con pd.read_csv(tweets_path, sep=','), se utiliza la librería pandas para cargar ese archivo CSV en un DataFrame llamado tweets, especificando que los valores están separados por comas. De esta forma, los datos del archivo quedan listos para ser analizados dentro del notebook.*"
],
- "text/plain": [
- " _exact_ _wtvd aa aaaand aadv aadvantage aal aaron ab aback ... \\\n",
- "0 0 0 0 0 0 0 0 0 0 0 ... \n",
- "1 0 0 0 0 0 0 0 0 0 0 ... \n",
- "2 0 0 0 0 0 0 0 0 0 0 ... \n",
- "3 0 0 0 0 0 0 0 0 0 0 ... \n",
- "4 0 0 0 0 0 0 0 0 0 0 ... \n",
- "\n",
- " zero zig zip zippers zone zones zoom zukes zurich zz \n",
- "0 0 0 0 0 0 0 0 0 0 0 \n",
- "1 0 0 0 0 0 0 0 0 0 0 \n",
- "2 0 0 0 0 0 0 0 0 0 0 \n",
- "3 0 0 0 0 0 0 0 0 0 0 \n",
- "4 0 0 0 0 0 0 0 0 0 0 \n",
- "\n",
- "[5 rows x 8751 columns]"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "first_dtm.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "095d34e2-52f8-4419-b4c7-ed20dbd5df89",
- "metadata": {},
- "source": [
- "Most of the tokens have zero occurences at least in the first five tweets. \n",
- "\n",
- "Let's take a closer look at the DTM!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "id": "f432154a-eae0-4723-a797-55f3cfdd71c4",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "user 12882\n",
- "to 6987\n",
- "digit 6927\n",
- "the 5088\n",
- "you 3635\n",
- "for 3386\n",
- "flight 3320\n",
- "and 3276\n",
- "on 3142\n",
- "my 2751\n",
- "dtype: int64"
- ]
- },
- "execution_count": 28,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Most frequent tokens\n",
- "first_dtm.sum().sort_values(ascending=False).head(10)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "id": "26c7f1c9-dd66-49f2-b337-01253da551d2",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "_exact_ 1\n",
- "mightmismybrosgraduation 1\n",
- "midterm 1\n",
- "midnite 1\n",
- "midland 1\n",
- "michelle 1\n",
- "michele 1\n",
- "michael 1\n",
- "mhtt 1\n",
- "mgmt 1\n",
- "dtype: int64"
- ]
- },
- "execution_count": 29,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Least frequent tokens\n",
- "first_dtm.sum().sort_values(ascending=True).head(10)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5d230f79-e752-4e32-93db-4f013287f8e2",
- "metadata": {},
- "source": [
- "It is not surprising to see \"user\" and \"digit\" to be among the most frequent tokens as we replaced each idiosyncratic one with these placeholders. The rest of the most frequent tokens are mostly stop words.\n",
- "\n",
- "Perhaps a more interesting pattern is to look for which token appears most in any given tweet:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "id": "efb8f4d8-4c88-4155-a6c5-c72a5b4e8bb8",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
token
\n",
- "
number
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
3127
\n",
- "
lt
\n",
- "
6
\n",
- "
\n",
- "
\n",
- "
918
\n",
- "
worst
\n",
- "
6
\n",
- "
\n",
- "
\n",
- "
10572
\n",
- "
to
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
8148
\n",
- "
the
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
10742
\n",
- "
to
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
152
\n",
- "
to
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
5005
\n",
- "
to
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
10923
\n",
- "
the
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
7750
\n",
- "
to
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
355
\n",
- "
to
\n",
- "
5
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
+ "metadata": {
+ "id": "-Gu0S6LUdNpw"
+ },
+ "id": "-Gu0S6LUdNpw"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4190e351-97b7-4c5b-866e-07aa6cbd42c2",
+ "metadata": {
+ "id": "4190e351-97b7-4c5b-866e-07aa6cbd42c2"
+ },
+ "outputs": [],
+ "source": [
+ "# Read in data\n",
+ "tweets_path = '../data/airline_tweets.csv'\n",
+ "tweets = pd.read_csv(tweets_path, sep=',')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*El comando tweets.head() se utiliza para mostrar las primeras filas del DataFrame llamado tweets. Por defecto enseña las primeras cinco filas, lo que permite al usuario revisar rápidamente cómo están organizados los datos, verificar los nombres de las columnas y observar algunos valores iniciales sin necesidad de mostrar todo el conjunto de datos.*"
],
- "text/plain": [
- " token number\n",
- "3127 lt 6\n",
- "918 worst 6\n",
- "10572 to 5\n",
- "8148 the 5\n",
- "10742 to 5\n",
- "152 to 5\n",
- "5005 to 5\n",
- "10923 the 5\n",
- "7750 to 5\n",
- "355 to 5"
- ]
- },
- "execution_count": 30,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "counts = pd.DataFrame()\n",
- "\n",
- "# Retrieve the index of the tweet where a token appears most frequently\n",
- "counts['token'] = first_dtm.idxmax(axis=1)\n",
- "\n",
- "# Retrieve the number of occurrence \n",
- "counts['number'] = first_dtm.max(axis=1)\n",
- "\n",
- "# Filter out placeholders\n",
- "counts[(counts['token']!='digit')\n",
- " & (counts['token']!='hashtag')\n",
- " & (counts['token']!='user')].sort_values('number', ascending=False).head(10)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7cdac4ef-6b9d-4aad-9b24-c70f6c2eb8f0",
- "metadata": {},
- "source": [
- "It looks like among all tweets, at most a token appears six times, and it is either the word \"It\" or the word \"worst.\" \n",
- "\n",
- "Let's go back to our tweets dataframe and locate the 918th tweet."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "id": "5e7cacd8-1fb3-4f0d-a744-4ee0994a089f",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "\"@united is the worst. Worst reservation policies. Worst costumer service. Worst worst worst. Congrats, @Delta you're not that bad!\""
- ]
- },
- "execution_count": 31,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Retrieve 918th tweet: \"worst\"\n",
- "tweets.iloc[918]['text']"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3dba8e37-4880-4565-b6fc-7e7c96958f0f",
- "metadata": {},
- "source": [
- "## Customize the `CountVectorizer`\n",
- "\n",
- "So far we've always used the default parameter setting to create our DTMs, but in many cases we may want to customize the `CountVectorizer` object. The purpose of doing so is to further filter out unnecessary tokens. In the example below, we tweak the following parameters:\n",
- "\n",
- "- `stop_words = 'english'`: ignore English stop words \n",
- "- `min_df = 2`: ignore words that don't occur at least twice\n",
- "- `max_df = 0.95`: ignore words if they appear in more than 95\\% of the documents\n",
- "\n",
- "🔔 **Question**: Let's pause for a minute to discuss whether it sounds reasonable to set these parameters! What do you think?\n",
- "\n",
- "Oftentimes, we are not interested in words whose frequencies are either too low or too high, so we use `min_df` and `max_df` to filter them out. Alternatively, we can define our vocabulary size as $N$ by setting `max_features`. In other words, we tell `CountVectorizer` to only consider the top $N$ most frequent tokens when constructing the DTM."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "id": "37a0a93e-9dd8-43dc-a82c-06a24bf02bc9",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Customize the parameter setting\n",
- "vectorizer = CountVectorizer(lowercase=True,\n",
- " stop_words='english',\n",
- " min_df=2,\n",
- " max_df=0.95,\n",
- " max_features=None)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "id": "b53e5ecf-7be3-4915-9d11-fd3edb913400",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Fit, transform, and get tokens\n",
- "counts = vectorizer.fit_transform(tweets['text_processed'])\n",
- "tokens = vectorizer.get_feature_names_out()\n",
- "\n",
- "# Create the second DTM\n",
- "second_dtm = pd.DataFrame(data=counts.todense(),\n",
- " index=tweets.index,\n",
- " columns=tokens)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6d2e66bc-2eaa-4642-8848-74459948084b",
- "metadata": {},
- "source": [
- "Our second DTM has a substantially smaller vocabulary compared to the first one."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 34,
- "id": "570fb598-fa81-4111-9e36-7172d8034713",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "(11541, 8751)\n",
- "(11541, 4471)\n"
- ]
- }
- ],
- "source": [
- "print(first_dtm.shape)\n",
- "print(second_dtm.shape)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "id": "d8deabb2-20eb-4047-b592-48cb1564fd2a",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
"
+ ],
+ "text/plain": [
+ " tweet_id airline_sentiment airline_sentiment_confidence \\\n",
+ "0 570306133677760513 neutral 1.0000 \n",
+ "1 570301130888122368 positive 0.3486 \n",
+ "2 570301083672813571 neutral 0.6837 \n",
+ "3 570301031407624196 negative 1.0000 \n",
+ "4 570300817074462722 negative 1.0000 \n",
+ "\n",
+ " negativereason negativereason_confidence airline \\\n",
+ "0 NaN NaN Virgin America \n",
+ "1 NaN 0.0000 Virgin America \n",
+ "2 NaN NaN Virgin America \n",
+ "3 Bad Flight 0.7033 Virgin America \n",
+ "4 Can't Tell 1.0000 Virgin America \n",
+ "\n",
+ " airline_sentiment_gold name negativereason_gold retweet_count \\\n",
+ "0 NaN cairdin NaN 0 \n",
+ "1 NaN jnardino NaN 0 \n",
+ "2 NaN yvonnalynn NaN 0 \n",
+ "3 NaN jnardino NaN 0 \n",
+ "4 NaN jnardino NaN 0 \n",
+ "\n",
+ " text tweet_coord \\\n",
+ "0 @VirginAmerica What @dhepburn said. NaN \n",
+ "1 @VirginAmerica plus you've added commercials t... NaN \n",
+ "2 @VirginAmerica I didn't today... Must mean I n... NaN \n",
+ "3 @VirginAmerica it's really aggressive to blast... NaN \n",
+ "4 @VirginAmerica and it's a really big bad thing... NaN \n",
+ "\n",
+ " tweet_created tweet_location user_timezone \n",
+ "0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada) \n",
+ "1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada) \n",
+ "2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) \n",
+ "3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada) \n",
+ "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) "
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " aa aadv aadvantage aal abandoned abc ability able aboard abq ... \\\n",
- "0 0 0 0 0 0 0 0 0 0 0 ... \n",
- "1 0 0 0 0 0 0 0 0 0 0 ... \n",
- "2 0 0 0 0 0 0 0 0 0 0 ... \n",
- "3 0 0 0 0 0 0 0 0 0 0 ... \n",
- "4 0 0 0 0 0 0 0 0 0 0 ... \n",
- "\n",
- " yummy yup yvonne yvr yyj yyz zero zone zoom zurich \n",
- "0 0 0 0 0 0 0 0 0 0 0 \n",
- "1 0 0 0 0 0 0 0 0 0 0 \n",
- "2 0 0 0 0 0 0 0 0 0 0 \n",
- "3 0 0 0 0 0 0 0 0 0 0 \n",
- "4 0 0 0 0 0 0 0 0 0 0 \n",
- "\n",
- "[5 rows x 4471 columns]"
- ]
- },
- "execution_count": 35,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "second_dtm.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "998fe2c3-ec90-4027-8c7f-417327a33a27",
- "metadata": {},
- "source": [
- "The most frequent token list now includes words that make more sense to us, such as \"cancelled\" and \"service.\" "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "id": "ffa7bf4e-640b-49bc-b64b-721140f67f76",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "digit 6927\n",
- "flight 3320\n",
- "hashtag 2633\n",
- "cancelled 956\n",
- "thanks 921\n",
- "service 910\n",
- "just 801\n",
- "customer 726\n",
- "time 695\n",
- "help 687\n",
- "dtype: int64"
- ]
- },
- "execution_count": 36,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "second_dtm.sum().sort_values(ascending=False).head(10)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3e8b5145-d505-4e36-9a39-a40d25d8ec6f",
- "metadata": {},
- "source": [
- "## 🥊 Challenge 2: Lemmatize the Text Input\n",
- "\n",
- "Recall from Part 1 that we introduced using `spaCy` to perform lemmatization, i.e., to \"recover\" the base form of a word. This process will reduce vocabulary size by keeping word variations minimal—a smaller vocabularly may help improve model performance in sentiment classification. \n",
- "\n",
- "Now let's implement lemmatization on our tweet data and use the lemmatized text to create a third DTM. \n",
- "\n",
- "Complete the function `lemmatize_text`. It requires a text input and returns the lemmas of all tokens. \n",
- "\n",
- "Here are some hints to guide you through this challenge:\n",
- "\n",
- "- Step 1: initialize a list to hold lemmas\n",
- "- Step 2: apply the `nlp` pipeline to the input text\n",
- "- Step 3: iterate over tokens in the processed text and retrieve the lemma of the token\n",
- " - HINT: lemmatization is one of the linguistic annotations that the `nlp` pipeline automatically does for us. We can use `token.lemma_` to access the annotation."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "id": "da610560-62c3-48ab-a1b2-25e0b589bc61",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Import spaCy\n",
- "import spacy\n",
- "nlp = spacy.load('en_core_web_sm')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "98ead266-30f3-48ad-bc51-c1685487f000",
- "metadata": {
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "# Create a function to lemmatize text\n",
- "def lemmatize_text(text):\n",
- " '''Lemmatize the text input with spaCy annotations.'''\n",
- "\n",
- " # Step 1: Initialize an empty list to hold lemmas\n",
- " lemma = ...\n",
- "\n",
- " # Step 2: Apply the nlp pipeline to input text\n",
- " doc = ...\n",
- "\n",
- " # Step 3: Iterate over tokens in the text to get the token lemma\n",
- " for token in doc:\n",
- " lemma.append(...)\n",
- "\n",
- " # Step 4: Join lemmas together into a single string\n",
- " text_lemma = ' '.join(lemma)\n",
- " \n",
- " return text_lemma"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "cf36aab6-35dd-42a2-9b38-b7c432f021c6",
- "metadata": {},
- "source": [
- "Let's apply the function to the following example tweet first!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "id": "742e82bb-5c42-4fa8-9101-5a0ea908db25",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "USER wow this just blew my mind\n",
- "==================================================\n",
- "USER wow this just blow my mind\n"
- ]
- }
- ],
- "source": [
- "# Apply the function to an example tweet\n",
- "print(tweets.iloc[33][\"text_processed\"])\n",
- "print(f\"{'='*50}\")\n",
- "print(lemmatize_text(tweets.iloc[33]['text_processed']))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "bbeda987-dc32-4979-b158-c24be7d1a420",
- "metadata": {},
- "source": [
- "And then let's lemmatize the tweet data and save the output to a new column `text_lemmatized`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "id": "1ac128d2-1be5-4ef5-bb50-5b8d44ef8ee9",
- "metadata": {},
- "outputs": [],
- "source": [
- "# This may take a while!\n",
- "tweets['text_lemmatized'] = tweets['text_processed'].apply(lambda x: lemmatize_text(x))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2c02aad6-4e71-4afc-80cf-31d4f39498b2",
- "metadata": {},
- "source": [
- "Now with the `text_lemmatized` column, let's create a third DTM. The parameter setting is the same as the second DTM. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "id": "5f49d790-3c9d-4dc1-a5c9-72c306630412",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
aa
\n",
- "
aadv
\n",
- "
aadvantage
\n",
- "
aal
\n",
- "
abandon
\n",
- "
abc
\n",
- "
ability
\n",
- "
able
\n",
- "
aboard
\n",
- "
abq
\n",
- "
...
\n",
- "
yummy
\n",
- "
yup
\n",
- "
yvonne
\n",
- "
yvr
\n",
- "
yyj
\n",
- "
yyz
\n",
- "
zero
\n",
- "
zone
\n",
- "
zoom
\n",
- "
zurich
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
...
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
0
\n",
- "
\n",
- " \n",
- "
\n",
- "
5 rows × 3553 columns
\n",
- "
"
+ "source": [
+ "tweets.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "80232c78-ac41-4d74-a581-76c9dac3b8f6",
+ "metadata": {
+ "id": "80232c78-ac41-4d74-a581-76c9dac3b8f6"
+ },
+ "source": [
+ "As a refresher, each row in this dataframe correponds to a tweet. The following columns are of main interests to us. There are other columns containing metadata of the tweet, such as the author of the tweet, when it was created, the timezone of the user, and others, which we will set aside for now.\n",
+ "\n",
+ "- `text` (`str`): the text of the tweet.\n",
+ "- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as \"neutral,\" \"positive,\" or \"negative.\"\n",
+ "- `airline` (`str`): the airline that is tweeted about.\n",
+ "- `retweet count` (`int`): how many times the tweet was retweeted.\n",
+ "\n",
+ "To prepare us for sentiment classification, we'll partition the dataset to focus on the \"positive\" and \"negative\" tweets for now."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En esa línea de código se observa que el DataFrame tweets es filtrado para excluir todas las filas donde la columna airline_sentiment tenga el valor \"neutral\", de modo que solo permanecen los registros con sentimientos positivos o negativos. Posteriormente, con .reset_index(drop=True), se reinician los índices de las filas para que queden consecutivos y sin huecos, garantizando que el DataFrame resultante esté limpio y organizado para el análisis.*"
],
- "text/plain": [
- " aa aadv aadvantage aal abandon abc ability able aboard abq ... \\\n",
- "0 0 0 0 0 0 0 0 0 0 0 ... \n",
- "1 0 0 0 0 0 0 0 0 0 0 ... \n",
- "2 0 0 0 0 0 0 0 0 0 0 ... \n",
- "3 0 0 0 0 0 0 0 0 0 0 ... \n",
- "4 0 0 0 0 0 0 0 0 0 0 ... \n",
- "\n",
- " yummy yup yvonne yvr yyj yyz zero zone zoom zurich \n",
- "0 0 0 0 0 0 0 0 0 0 0 \n",
- "1 0 0 0 0 0 0 0 0 0 0 \n",
- "2 0 0 0 0 0 0 0 0 0 0 \n",
- "3 0 0 0 0 0 0 0 0 0 0 \n",
- "4 0 0 0 0 0 0 0 0 0 0 \n",
- "\n",
- "[5 rows x 3553 columns]"
- ]
- },
- "execution_count": 41,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Create the vectorizer (the same param setting as previous)\n",
- "vectorizer = CountVectorizer(lowercase=True,\n",
- " stop_words='english',\n",
- " min_df=2,\n",
- " max_df=0.95,\n",
- " max_features=None)\n",
- "\n",
- "# Fit, transform, and get tokens\n",
- "counts = vectorizer.fit_transform(tweets['text_lemmatized'])\n",
- "tokens = vectorizer.get_feature_names_out()\n",
- "\n",
- "# Create the third DTM\n",
- "third_dtm = pd.DataFrame(data=counts.todense(),\n",
- " index=tweets.index,\n",
- " columns=tokens)\n",
- "third_dtm.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "id": "9859eb04-dbd2-4fa0-9798-65ed7496c297",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "(11541, 8751)\n",
- "(11541, 4471)\n",
- "(11541, 3553)\n"
- ]
- }
- ],
- "source": [
- "# Print the shapes of three DTMs\n",
- "print(first_dtm.shape)\n",
- "print(second_dtm.shape)\n",
- "print(third_dtm.shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fa94c8ac-e4f4-4b76-afdb-1d4af54a3eee",
- "metadata": {},
- "source": [
- "Let's print the top 10 most frequent tokens as usual. These tokens are now lemmas and their counts also change after lemmatization. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 43,
- "id": "5745ca29-97ed-4fe1-81db-7e402c8da674",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "digit 6927\n",
- "flight 4043\n",
- "hashtag 2633\n",
- "thank 1455\n",
- "hour 1134\n",
- "cancel 948\n",
- "delay 937\n",
- "service 937\n",
- "customer 902\n",
- "time 856\n",
- "dtype: int64"
- ]
- },
- "execution_count": 43,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Get the most frequent tokens in the third DTM\n",
- "third_dtm.sum().sort_values(ascending=False).head(10)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "id": "16c63e6a-50c3-448a-9a56-a1d193cd6680",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "digit 6927\n",
- "flight 3320\n",
- "hashtag 2633\n",
- "cancelled 956\n",
- "thanks 921\n",
- "service 910\n",
- "just 801\n",
- "customer 726\n",
- "time 695\n",
- "help 687\n",
- "dtype: int64"
- ]
- },
- "execution_count": 44,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Compared to the most frequent tokens in the second DTM\n",
- "second_dtm.sum().sort_values(ascending=False).head(10)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "38363398-fdf5-456b-ae3d-cae9d5294140",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "# Term Frequency-Inverse Document Frequency \n",
- "\n",
- "So far, we're relying on word frequency to give us information about a document. This assumes if a word appears more often in a document, it's more informative. However, this may not always be the case. For example, we've already removed stop words because they are not informative, despite the fact that they appear many times in a document. We also know the word \"flight\" is among the most frequent words, but it is not that informative, because it appears in many documents. Since we're looking at airline tweets, we shouldn't be surprised to see the word \"flight\"!\n",
- "\n",
- "To remedy this, we use a weighting scheme called **tf-idf (term frequency-inverse document frequency)**. The big idea behind tf-idf is to weight a word not just by its frequency within a document, but also by its frequency in one document relative to the remaining documents. So, when we construct the DTM, we will be assigning each term a **tf-idf score**. Specifically, term $t$ in document $d$ is assigned a tf-idf score as follows:\n",
- "\n",
- "\n",
- "\n",
- "In essence, the tf-idf score of a word in a document is the product of two components: **term frequency (tf)** and **inverse document frequency (idf)**. The idf acts as a scaling factor. If a word occurs in all documents, then idf equals 1. No scaling will happen. But idf is typically greater than 1, which is the weight we assign to the word to make the tf-idf score higher, so as to highlight that the word is informative. In practice, we add 1 to both the denominator and numerator (\"add-1 smooth\") to prevent any issues with zero occurrences.\n",
- "\n",
- "We can also create a tf-idf DTM using `sklearn`. We'll use a `TfidfVectorizer` this time:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "id": "f5e32d8a-c42d-475f-aab4-21eca8b1aee8",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.feature_extraction.text import TfidfVectorizer"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "id": "d23916c1-5693-456c-b71d-6d9d78d1e2e4",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Create a tfidf vectorizer\n",
- "vectorizer = TfidfVectorizer(lowercase=True,\n",
- " stop_words='english',\n",
- " min_df=2,\n",
- " max_df=0.95,\n",
- " max_features=None)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 47,
- "id": "7af5b342-ab18-4766-9561-e38e50cd1e9b",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "<11541x3553 sparse matrix of type ''\n",
- "\twith 88287 stored elements in Compressed Sparse Row format>"
- ]
- },
- "execution_count": 47,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Fit and transform \n",
- "tf_dtm = vectorizer.fit_transform(tweets['text_lemmatized'])\n",
- "tf_dtm"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 48,
- "id": "55e509c8-5402-4be0-9143-0e448fff7066",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
aa
\n",
- "
aadv
\n",
- "
aadvantage
\n",
- "
aal
\n",
- "
abandon
\n",
- "
abc
\n",
- "
ability
\n",
- "
able
\n",
- "
aboard
\n",
- "
abq
\n",
- "
...
\n",
- "
yummy
\n",
- "
yup
\n",
- "
yvonne
\n",
- "
yvr
\n",
- "
yyj
\n",
- "
yyz
\n",
- "
zero
\n",
- "
zone
\n",
- "
zoom
\n",
- "
zurich
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
...
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
...
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
...
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
...
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
...
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
0.0
\n",
- "
\n",
- " \n",
- "
\n",
- "
5 rows × 3553 columns
\n",
- "
"
+ "metadata": {
+ "id": "h1I96r9redwT"
+ },
+ "id": "h1I96r9redwT"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a1faaf90-8c01-4d25-9468-90c01823f0d5",
+ "metadata": {
+ "id": "a1faaf90-8c01-4d25-9468-90c01823f0d5"
+ },
+ "outputs": [],
+ "source": [
+ "tweets = tweets[tweets['airline_sentiment'] != 'neutral'].reset_index(drop=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7cb6b039-53e7-4afe-a9e0-b3522c12b2d7",
+ "metadata": {
+ "id": "7cb6b039-53e7-4afe-a9e0-b3522c12b2d7"
+ },
+ "source": [
+ "Let's take a look at a few tweets first!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En este bloque de código, el comentario indica que se van a mostrar los primeros cinco tweets. El bucle for idx in range(5) recorre los índices del 0 al 4, y en cada iteración utiliza tweets['text'].iloc[idx] para acceder al contenido del tweet correspondiente en la columna text del DataFrame, imprimiéndolo en pantalla. De esta manera, se puede visualizar rápidamente cómo lucen los primeros cinco tweets del conjunto de datos.*"
],
- "text/plain": [
- " aa aadv aadvantage aal abandon abc ability able aboard abq ... \\\n",
- "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
- "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
- "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
- "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
- "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
- "\n",
- " yummy yup yvonne yvr yyj yyz zero zone zoom zurich \n",
- "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
- "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
- "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
- "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
- "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
- "\n",
- "[5 rows x 3553 columns]"
- ]
- },
- "execution_count": 48,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Create a tf-idf dataframe\n",
- "tfidf = pd.DataFrame(tf_dtm.todense(),\n",
- " columns=vectorizer.get_feature_names_out(),\n",
- " index=tweets.index)\n",
- "tfidf.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "45ba13ea-c429-4ff1-a9a2-abf27c4d0888",
- "metadata": {},
- "source": [
- "You may have noticed that the vocabulary size is the same as we saw in Challenge 2. This is because we used the same parameter setting when creating the vectorizer. But the values in the matrix are different—they are tf-idf scores instead of raw counts. "
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fa58c360-5c55-4fa0-8c55-1f00e68baa9a",
- "metadata": {},
- "source": [
- "## Interpret TF-IDF Values"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "bdad233d-ebc1-420f-9b67-c227c48f3e60",
- "metadata": {},
- "source": [
- "Let's take a look the document where a term has the highest tf-idf values. We'll use the `.idxmax()` method to find the index."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 49,
- "id": "995b511a-d448-4cfb-a6a0-22a465efd8a8",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "aa 10077\n",
- "aadv 9285\n",
- "aadvantage 9974\n",
- "aal 10630\n",
- "abandon 7859\n",
- " ... \n",
- "yyz 1350\n",
- "zero 2705\n",
- "zone 3177\n",
- "zoom 3920\n",
- "zurich 10622\n",
- "Length: 3553, dtype: int64"
- ]
- },
- "execution_count": 49,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Retrieve the index of the document\n",
- "tfidf.idxmax()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fccc0249-7c68-42ee-8290-ff41715e346b",
- "metadata": {},
- "source": [
- "For example, the term \"worst\" occurs most distinctively in the 918th tweet. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 50,
- "id": "09b222fb-ad8c-4767-a974-dd261370a06e",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "918"
- ]
- },
- "execution_count": 50,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tfidf.idxmax()['worst']"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "955a48bc-dc93-481b-ba49-29876fc577fb",
- "metadata": {},
- "source": [
- "Recall that this is the tweet where the word \"worst\" appears six times!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 51,
- "id": "079ee0e0-476f-4236-ba8a-615ba7a0efe8",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "\"USER is the worst. worst reservation policies. worst costumer service. worst worst worst. congrats, USER you're not that bad!\""
- ]
- },
- "execution_count": 51,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tweets['text_processed'].iloc[918]"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9dd06bbc-e2fc-49e4-9354-efdaca5cfbd3",
- "metadata": {},
- "source": [
- "How about \"cancel\"? Let's take a look at another example. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 52,
- "id": "f809df1a-1178-4272-a415-42edb20173b2",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "5945"
- ]
- },
- "execution_count": 52,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tfidf.idxmax()['cancel']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 53,
- "id": "8093b6a7-54ca-468a-9376-b3c0be0b6f9b",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'USER cancelled flighted 😢'"
- ]
- },
- "execution_count": 53,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tweets['text_processed'].iloc[5945]"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "163dcecd-dc8c-43a9-952d-5bc84a307b07",
- "metadata": {},
- "source": [
- "## 🥊 Challenge 3: Words with Highest Mean TF-IDF scores\n",
- "\n",
- "We have obtained tf-idf values for each term in each document. But what do these values tell us about the sentiments of tweets? Are there any words that are particularly informative for positive/negative tweets? \n",
- "\n",
- "To explore this, let's gather the indices of all positive/negative tweets and calculate the mean tf-idf scores of words appear in each category. \n",
- "\n",
- "We've provided the following starter code to guide you:\n",
- "- Subset the `tweets` dataframe according to the `airline_sentiment` label and retrieve the index of each subset (`.index`). Assign the index to `positive_index` or `negative_index`.\n",
- "- For each subset:\n",
- " - Retrieve the td-idf representation \n",
- " - Take the mean tf-idf values across the subset using `.mean()`\n",
- " - Sort the mean values in the descending order using `.sort_values()`\n",
- " - Get the top 10 terms using `.head()`\n",
- "\n",
- "Next, run `pos.plot` and `neg.plot` to plot the words with the highest mean tf-idf scores for each subset. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "2bfbf838-9ff6-48b8-ad5d-5e75304fe060",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Complete the boolean masks \n",
- "positive_index = tweets[...].index\n",
- "negative_index = tweets[...].index"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8c67ea1f-de9e-49a9-94f2-a3351446e364",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Complete the following two lines\n",
- "pos = tfidf.loc[...].mean().sort_values(...).head(...)\n",
- "neg = tfidf.loc[...].mean().sort_values(...).head(...)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f1e29043-8c78-4e41-81d2-b4552030b457",
- "metadata": {},
- "outputs": [],
- "source": [
- "pos.plot(kind='barh', \n",
- " xlim=(0, 0.18),\n",
- " color='cornflowerblue',\n",
- " title='Top 10 terms with the highest mean tf-idf values for positive tweets');"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e8b25940-2372-4755-818e-f75e4d23daf9",
- "metadata": {},
- "outputs": [],
- "source": [
- "neg.plot(kind='barh', \n",
- " xlim=(0, 0.18),\n",
- " color='darksalmon',\n",
- " title='Top 10 terms with the highest mean tf-idf values for negative tweets');"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "77bca876-9649-46f3-bd4f-f9f68fea649a",
- "metadata": {},
- "source": [
- "🔔 **Question**: How would you interpret these results? Share your thoughts in the chat!"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "da410cb3-a452-441b-a94d-8f751d59d7a6",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 🎬 **Demo**: Sentiment Classification Using the TF-IDF Representation\n",
- "\n",
- "Now that we have a tf-idf representation of the text, we are ready to do sentiment analysis!\n",
- "\n",
- "In this demo, we will use a logistic regression model to perform the classification task. Here we briefly step through how logistic regression works as one of the supervised Machine Learning methods, but feel free to explore our workshop on [Python Machine Learning Fundamentals](https://github.com/dlab-berkeley/Python-Machine-Learning) if you want to learn more about it.\n",
- "\n",
- "Logistic regression is a linear model, with which we use to predict the label of a tweet, based on a set of features ($x_1, x_2, x_3, ..., x_i$), as shown below:\n",
- "\n",
- "$$\n",
- "L = \\beta_1 x_1 + \\beta_2 x_2 + \\cdots + \\beta_T x_T\n",
- "$$\n",
- "\n",
- "The list of features we'll pass to the model is the vocabulary of the DTM. We also feed the model with a portion of the data, known as the training set, along with other model specification, to learn the coeffient ($\\beta_1, \\beta_2, \\beta_3, ..., \\beta_i$) of each feature. The coefficients tell us whether a feature contributes positively or negatively to the predicted value. The predicted value corresponds to adding all features (multiplied by their coefficients) up, and the predicted value gets passed to a [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) to be converted into the probability space, which tells us whether the predicted label is positive (when $p>0.5$) or negative (when $p<0.5$). \n",
- "\n",
- "The remaining portion of the data, known as the test set, is used to test whether the learned coefficients could be generalized to unseen data. \n",
- "\n",
- "Now that we already have the tf-idf dataframe, the feature set is ready. Let's dive into model specification!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 55,
- "id": "33413d63-87eb-489f-b374-3cfeaa51cf3c",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.linear_model import LogisticRegressionCV\n",
- "from sklearn.model_selection import train_test_split"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ee87ff74-3fbb-472a-b795-6f4d18fab215",
- "metadata": {},
- "source": [
- "We'll use the `train_test_split` function from `sklearn` to separate our data into two sets:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "id": "64cec8b9-14d9-4897-9c02-cc89fcf7b3c6",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Train-test split\n",
- "X = tfidf\n",
- "y = tweets['airline_sentiment']\n",
- "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "066771d8-2f31-4646-9a1b-6d2b1b9b208c",
- "metadata": {},
- "source": [
- "The `fit_logistic_regression` function is written below to streamline the training process."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "id": "d46de0b2-af00-4a1d-b4cd-31b96ce545d1",
- "metadata": {},
- "outputs": [],
- "source": [
- "def fit_logistic_regression(X, y):\n",
- " '''Fits a logistic regression model to provided data.'''\n",
- " model = LogisticRegressionCV(Cs=10,\n",
- " penalty='l1',\n",
- " cv=5,\n",
- " solver='liblinear',\n",
- " class_weight='balanced',\n",
- " random_state=42,\n",
- " refit=True).fit(X, y)\n",
- " return model"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "124aa7ea-1bc1-43e2-beeb-0ba2da9b2df9",
- "metadata": {},
- "source": [
- "We'll fit the model and compute the training and test accuracy."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "id": "773963bd-6603-4fad-884b-09ce60afab18",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Fit the logistic regression model\n",
- "model = fit_logistic_regression(X_train, y_train)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "id": "e10d06c1-d884-45d4-a03d-dd5d40bf70aa",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Training accuracy: 0.9455601998164951\n",
- "Test accuracy: 0.894919168591224\n"
- ]
- }
- ],
- "source": [
- "# Get the training and test accuracy\n",
- "print(f\"Training accuracy: {model.score(X_train, y_train)}\")\n",
- "print(f\"Test accuracy: {model.score(X_test, y_test)}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d4e186c5-1719-4deb-bdb4-614a9980f058",
- "metadata": {},
- "source": [
- "The model achieved ~94% accuracy on the training set and ~89% on the test set—that's pretty good! The model generalizes reasonably well to the test data."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "310dac39-4753-4ae8-8dfa-e65e5824cccb",
- "metadata": {},
- "source": [
- "Next, let's also take a look at the fitted coefficients to see if what we see makes sense. \n",
- "\n",
- "We can access them using `coef_`, and we can match each coefficient to the tokens from the vectorizer:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 60,
- "id": "6dcb6ef1-13b3-437e-813c-7118911847a4",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Get coefs of all features\n",
- "coefs = model.coef_.ravel()\n",
- "\n",
- "# Get all tokens\n",
- "tokens = vectorizer.get_feature_names_out()\n",
- "\n",
- "# Create a token-coef dataframe\n",
- "importance = pd.DataFrame()\n",
- "importance['token'] = tokens\n",
- "importance['coefs'] = coefs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 61,
- "id": "3e63814e-9c0d-4f7a-a5e0-72cca2758d71",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
token
\n",
- "
coefs
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
2724
\n",
- "
rude
\n",
- "
-11.138668
\n",
- "
\n",
- "
\n",
- "
2784
\n",
- "
screw
\n",
- "
-9.962456
\n",
- "
\n",
- "
\n",
- "
2727
\n",
- "
ruin
\n",
- "
-9.849836
\n",
- "
\n",
- "
\n",
- "
1505
\n",
- "
hour
\n",
- "
-9.282416
\n",
- "
\n",
- "
\n",
- "
389
\n",
- "
break
\n",
- "
-7.949487
\n",
- "
\n",
- "
\n",
- "
2280
\n",
- "
pay
\n",
- "
-7.823908
\n",
- "
\n",
- "
\n",
- "
458
\n",
- "
cancel
\n",
- "
-7.534084
\n",
- "
\n",
- "
\n",
- "
264
\n",
- "
bad
\n",
- "
-7.357206
\n",
- "
\n",
- "
\n",
- "
1872
\n",
- "
luggage
\n",
- "
-7.093317
\n",
- "
\n",
- "
\n",
- "
3034
\n",
- "
strand
\n",
- "
-7.046890
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
+ "metadata": {
+ "id": "nSGYf82xexRJ"
+ },
+ "id": "nSGYf82xexRJ"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "438830e6-1064-47fe-b578-a1ca693a0ed0",
+ "metadata": {
+ "id": "438830e6-1064-47fe-b578-a1ca693a0ed0",
+ "outputId": "ae408e8f-86fe-4502-c316-b3129c5d76ed"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "@VirginAmerica plus you've added commercials to the experience... tacky.\n",
+ "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they have little recourse\n",
+ "@VirginAmerica and it's a really big bad thing about it\n",
+ "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\n",
+ "it's really the only bad thing about flying VA\n",
+ "@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)\n"
+ ]
+ }
],
- "text/plain": [
- " token coefs\n",
- "2724 rude -11.138668\n",
- "2784 screw -9.962456\n",
- "2727 ruin -9.849836\n",
- "1505 hour -9.282416\n",
- "389 break -7.949487\n",
- "2280 pay -7.823908\n",
- "458 cancel -7.534084\n",
- "264 bad -7.357206\n",
- "1872 luggage -7.093317\n",
- "3034 strand -7.046890"
- ]
- },
- "execution_count": 61,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Get the top 10 tokens with lowest coefs\n",
- "neg_coef = importance.sort_values('coefs').head(10)\n",
- "neg_coef"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 62,
- "id": "0d596bf7-753c-40cd-ac52-4a37163650ae",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
token
\n",
- "
coefs
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
3165
\n",
- "
thankful
\n",
- "
8.002975
\n",
- "
\n",
- "
\n",
- "
1091
\n",
- "
exceptional
\n",
- "
8.136278
\n",
- "
\n",
- "
\n",
- "
1563
\n",
- "
impressed
\n",
- "
8.501364
\n",
- "
\n",
- "
\n",
- "
648
\n",
- "
compliment
\n",
- "
8.981360
\n",
- "
\n",
- "
\n",
- "
1373
\n",
- "
great
\n",
- "
9.080558
\n",
- "
\n",
- "
\n",
- "
3498
\n",
- "
wonderful
\n",
- "
9.401606
\n",
- "
\n",
- "
\n",
- "
1089
\n",
- "
excellent
\n",
- "
10.147230
\n",
- "
\n",
- "
\n",
- "
250
\n",
- "
awesome
\n",
- "
10.315909
\n",
- "
\n",
- "
\n",
- "
1746
\n",
- "
kudo
\n",
- "
11.623828
\n",
- "
\n",
- "
\n",
- "
3164
\n",
- "
thank
\n",
- "
16.027534
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "# Print first five tweets\n",
+ "for idx in range(5):\n",
+ " print(tweets['text'].iloc[idx])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0d6746f8-b29c-40d4-bef6-b4afd4cd6cc1",
+ "metadata": {
+ "id": "0d6746f8-b29c-40d4-bef6-b4afd4cd6cc1"
+ },
+ "source": [
+ "We can already see that some of these tweets contain negative sentiment—how can we tell this is the case?\n",
+ "\n",
+ "Next, let's take a look at the distribution of sentiment labels in this dataset."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En este bloque de código, el comentario indica que se va a crear un gráfico de barras que muestre la cantidad de tweets según su sentimiento. La función sns.countplot de seaborn se utiliza para contar cuántos tweets hay de cada tipo en la columna airline_sentiment. Se especifica que el eje x será esa columna, que las barras tendrán el color cornflowerblue y que el orden de las categorías será primero \"positive\" y luego \"negative\". Esto permite visualizar de manera clara cuántos tweets positivos y negativos hay en el conjunto de datos.*"
],
- "text/plain": [
- " token coefs\n",
- "3165 thankful 8.002975\n",
- "1091 exceptional 8.136278\n",
- "1563 impressed 8.501364\n",
- "648 compliment 8.981360\n",
- "1373 great 9.080558\n",
- "3498 wonderful 9.401606\n",
- "1089 excellent 10.147230\n",
- "250 awesome 10.315909\n",
- "1746 kudo 11.623828\n",
- "3164 thank 16.027534"
- ]
- },
- "execution_count": 62,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Get the top 10 tokens with highest coefs\n",
- "pos_coef = importance.sort_values('coefs').tail(10)\n",
- "pos_coef "
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7b3b7893-caa0-4281-98f0-92c9e7b31953",
- "metadata": {},
- "source": [
- "Let's plot the top 10 tokens with the highest/lowest coefficients. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 63,
- "id": "17b1223b-e5c1-4992-bb7e-0a99651c3729",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Make a bar plot showing the count of tweet sentiments\n",
+ "sns.countplot(data=tweets,\n",
+ " x='airline_sentiment',\n",
+ " color='cornflowerblue',\n",
+ " order=['positive', 'negative']);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "eab45abf-adf4-4f5e-ae09-75f6c4fd50d1",
+ "metadata": {
+ "id": "eab45abf-adf4-4f5e-ae09-75f6c4fd50d1"
+ },
+ "source": [
+ "It looks like the majority of the tweets in this dataset are expressing negative sentiment!\n",
+ "\n",
+ "Let's take a look at what gets more retweeted:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En este bloque de código, el comentario indica que se quiere calcular el promedio de retweets para cada tipo de sentimiento. Primero, tweets.groupby('airline_sentiment') agrupa los datos según la columna airline_sentiment, separando los tweets positivos de los negativos. Luego, ['retweet_count'].mean() toma la columna retweet_count de cada grupo y calcula su valor promedio. De esta manera, se obtiene el número medio de retweets para los tweets positivos y negativos, lo que ayuda a entender qué tipo de sentimiento tiende a generar más interacción.*"
+ ],
+ "metadata": {
+ "id": "tN7Gp0RwfJS-"
+ },
+ "id": "tN7Gp0RwfJS-"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "428ddde7-af73-4eb6-92c9-041a1791ca59",
+ "metadata": {
+ "id": "428ddde7-af73-4eb6-92c9-041a1791ca59",
+ "outputId": "6bcfd9f3-38cb-4234-c76b-3b36e8633a32"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "airline_sentiment\n",
+ "negative 0.093375\n",
+ "positive 0.069403\n",
+ "Name: retweet_count, dtype: float64"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Get the mean retweet count for each sentiment\n",
+ "tweets.groupby('airline_sentiment')['retweet_count'].mean()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0d31f3bc-257c-48a8-86a0-fd0d7c3e8cb3",
+ "metadata": {
+ "id": "0d31f3bc-257c-48a8-86a0-fd0d7c3e8cb3"
+ },
+ "source": [
+ "Negative tweets are clearly retweeted more often than positive ones!\n",
+ "\n",
+ "Let's see which airline receives most negative tweets:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En este bloque de código, el comentario indica que se quiere calcular la proporción de tweets negativos por aerolínea. Primero, tweets.groupby(['airline', 'airline_sentiment']).size() cuenta cuántos tweets hay de cada sentimiento para cada aerolínea, y se divide entre tweets.groupby('airline').size(), que da el total de tweets por aerolínea; así se obtiene la proporción de cada tipo de sentimiento. Luego, proportions.unstack() reorganiza los datos para que cada sentimiento sea una columna, y sort_values('negative', ascending=False) ordena las aerolíneas de mayor a menor según la proporción de tweets negativos, mostrando cuáles tienen más críticas.*"
+ ],
+ "metadata": {
+ "id": "q2aUAry5fgUP"
+ },
+ "id": "q2aUAry5fgUP"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "12aa9f2d-d655-494a-bb72-08ad973518f3",
+ "metadata": {
+ "id": "12aa9f2d-d655-494a-bb72-08ad973518f3",
+ "outputId": "613c5a79-5468-441e-d9a9-1367bfe0bf5e"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
airline_sentiment
\n",
+ "
negative
\n",
+ "
positive
\n",
+ "
\n",
+ "
\n",
+ "
airline
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
US Airways
\n",
+ "
0.893760
\n",
+ "
0.106240
\n",
+ "
\n",
+ "
\n",
+ "
American
\n",
+ "
0.853659
\n",
+ "
0.146341
\n",
+ "
\n",
+ "
\n",
+ "
United
\n",
+ "
0.842560
\n",
+ "
0.157440
\n",
+ "
\n",
+ "
\n",
+ "
Southwest
\n",
+ "
0.675399
\n",
+ "
0.324601
\n",
+ "
\n",
+ "
\n",
+ "
Delta
\n",
+ "
0.637091
\n",
+ "
0.362909
\n",
+ "
\n",
+ "
\n",
+ "
Virgin America
\n",
+ "
0.543544
\n",
+ "
0.456456
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "airline_sentiment negative positive\n",
+ "airline \n",
+ "US Airways 0.893760 0.106240\n",
+ "American 0.853659 0.146341\n",
+ "United 0.842560 0.157440\n",
+ "Southwest 0.675399 0.324601\n",
+ "Delta 0.637091 0.362909\n",
+ "Virgin America 0.543544 0.456456"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Get the proportion of negative tweets by airline\n",
+ "proportions = tweets.groupby(['airline', 'airline_sentiment']).size() / tweets.groupby('airline').size()\n",
+ "proportions.unstack().sort_values('negative', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7042419e-9c41-40e7-8dbf-47bd1e2ad45a",
+ "metadata": {
+ "id": "7042419e-9c41-40e7-8dbf-47bd1e2ad45a"
+ },
+ "source": [
+ "It looks like people are most dissatified with US Airways, followed by American Airline, both having over 85\\% negative tweets!\n",
+ "\n",
+ "A lot of interesting discoveries could be made if you want to explore more about the data. Now let's return to our task of sentiment analysis. Before that, we need to preprocess the text data so that they are in a standard format."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9e513930-2dc7-489c-bc5a-22eb09add5bf",
+ "metadata": {
+ "id": "9e513930-2dc7-489c-bc5a-22eb09add5bf"
+ },
+ "source": [
+ "\n",
+ "# Preprocessing\n",
+ "\n",
+ "We spent much of Part 1 learning how to preprocess data. Let's apply what we learned! Looking at some of the tweets above, we can see that while they are in pretty good shape, we can do some additional processing on them.\n",
+ "\n",
+ "In our pipeline, we'll omit the tokenization process since we will perform it in a later step."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "92a83ece-f3b2-4200-9d22-0788fbc07fa4",
+ "metadata": {
+ "id": "92a83ece-f3b2-4200-9d22-0788fbc07fa4"
+ },
+ "source": [
+ "## 🥊 Challenge 1: Apply a Text Cleaning Pipeline\n",
+ "\n",
+ "Write a function called `preprocess` that performs the following steps on a text input:\n",
+ "\n",
+ "* Step 1: Lowercase the text input.\n",
+ "* Step 2: Replace the following patterns with placeholders:\n",
+ " * URLs → ` URL `\n",
+ " * Digits → ` DIGIT `\n",
+ " * Hashtags → ` HASHTAG `\n",
+ " * Tweet handles → ` USER `\n",
+ "* Step 3: Remove extra blankspace.\n",
+ "\n",
+ "Here are some hints to guide you through this challenge:\n",
+ "\n",
+ "* For Step 1, recall from Part 1 that a string method called [`.lower()`](https://docs.python.org/3.11/library/stdtypes.html#str.lower) can be usd to convert text to lowercase.\n",
+ "* We have integrated Step 2 into a function called `placeholder`. Run the cell below to import it into your notebook, and you can use it just like any other functions.\n",
+ "* For Step 3, we have provided the regex pattern for identifying whitespace characters as well as the correct replacement for extract whitespace.\n",
+ "\n",
+ "Run your `preprocess` function on `example_tweet` (three cells below) to check if it works. If it does, apply it to the entire `text` column in the tweets dataframe."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En esta línea de código se está importando algo llamado placeholder desde un módulo llamado utils. Esto significa que dentro del archivo o paquete utils.py existe una función, clase o variable llamada placeholder, y se quiere usar en el código actual. Básicamente, permite acceder a esa funcionalidad definida en otro archivo sin tener que repetir su implementación.*"
+ ],
+ "metadata": {
+ "id": "a0AM2TmKf38o"
+ },
+ "id": "a0AM2TmKf38o"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "21738b02-9ab9-4a61-b41f-ff75888aa747",
+ "metadata": {
+ "tags": [],
+ "id": "21738b02-9ab9-4a61-b41f-ff75888aa747"
+ },
+ "outputs": [],
+ "source": [
+ "from utils import placeholder"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En este bloque de código, se observa que primero se definen dos variables: blankspace_pattern = r'\\s+', que es una expresión regular que representa uno o más espacios en blanco, y blankspace_repl = ' ', que indica que esos espacios serán reemplazados por un solo espacio. Luego se define la función preprocess(text), cuyo propósito es crear un pipeline de preprocesamiento para limpiar los datos de los tweets. Dentro de la función, los comentarios indican los pasos: primero convertir todo el texto a minúsculas, luego reemplazar ciertos patrones por placeholders y finalmente eliminar espacios en blanco extra. Al final, la función devuelve el texto limpio y estandarizado listo para análisis o procesamiento adicional.*"
+ ],
+ "metadata": {
+ "id": "CQqYp2MEf6S4"
+ },
+ "id": "CQqYp2MEf6S4"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "03569f0d-34ba-492d-aa1d-1dce9d34f792",
+ "metadata": {
+ "id": "03569f0d-34ba-492d-aa1d-1dce9d34f792"
+ },
+ "outputs": [],
+ "source": [
+ "blankspace_pattern = r'\\s+'\n",
+ "blankspace_repl = ' '\n",
+ "\n",
+ "def preprocess(text):\n",
+ " '''Create a preprocess pipeline that cleans the tweet data.'''\n",
+ "\n",
+ " # Step 1: Lowercase\n",
+ " text = ...\n",
+ "\n",
+ " # Step 2: Replace patterns with placeholders\n",
+ " text = ...\n",
+ "\n",
+ " # Step 3: Remove extra whitespace characters\n",
+ " text = ...\n",
+ "\n",
+ " return text"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8990cefd-5d04-46ba-ada2-29978c28cfe8",
+ "metadata": {
+ "id": "8990cefd-5d04-46ba-ada2-29978c28cfe8",
+ "outputId": "61e6c6b8-05df-4acc-db84-b4b3151615ef"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo\n",
+ "==================================================\n",
+ "lol USER and USER are like soo DIGIT HASHTAG HASHTAG saw it on URL HASHTAG\n"
+ ]
+ }
+ ],
+ "source": [
+ "example_tweet = 'lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo'\n",
+ "\n",
+ "# Print the example tweet\n",
+ "print(example_tweet)\n",
+ "print(f\"{'='*50}\")\n",
+ "\n",
+ "# Print the preprocessed tweet\n",
+ "print(preprocess(example_tweet))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En este bloque de código, se aplica la función de limpieza preprocess a cada tweet de la columna text usando .apply(), y el resultado se guarda en una nueva columna llamada text_processed. Esto significa que cada tweet se transforma según el pipeline de preprocesamiento definido antes (por ejemplo, pasando a minúsculas, reemplazando patrones y eliminando espacios extras). Finalmente, tweets['text_processed'].head() muestra los primeros cinco tweets ya procesados, permitiendo revisar rápidamente cómo quedó el texto limpio.*"
+ ],
+ "metadata": {
+ "id": "Hr0Bxs3JgPqi"
+ },
+ "id": "Hr0Bxs3JgPqi"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a5f7bb6a-f064-48cc-b650-12c4ef2fbb88",
+ "metadata": {
+ "scrolled": true,
+ "id": "a5f7bb6a-f064-48cc-b650-12c4ef2fbb88",
+ "outputId": "ca00ca39-d1b8-4570-bb16-cbe8c16cae2c"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 USER plus you've added commercials to the expe...\n",
+ "1 USER it's really aggressive to blast obnoxious...\n",
+ "2 USER and it's a really big bad thing about it\n",
+ "3 USER seriously would pay $ DIGIT a flight for ...\n",
+ "4 USER yes, nearly every time i fly vx this “ear...\n",
+ "Name: text_processed, dtype: object"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Apply the function to the text column and assign the preprocessed tweets to a new column\n",
+ "tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess(x))\n",
+ "tweets['text_processed'].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1576acc6-b305-492a-8fde-65b343cb779c",
+ "metadata": {
+ "id": "1576acc6-b305-492a-8fde-65b343cb779c"
+ },
+ "source": [
+ "Congratualtions! Preprocessing is done. Let's dive into the bag-of-words!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "53282330-54da-4e1c-bfe5-e77cb8fa3add",
+ "metadata": {
+ "id": "53282330-54da-4e1c-bfe5-e77cb8fa3add"
+ },
+ "source": [
+ "\n",
+ "# The Bag-of-Words Representation\n",
+ "\n",
+ "The idea of bag-of-words (BoW), as the name suggests, is quite intuitive: we take a document and toss it in a bag. The action of \"throwing\" the document in a bag disregards the relative position between words, so what is \"in the bag\" is essentially \"an unsorted set of words\" [(Jurafsky & Martin, 2024)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf). In return, we have a list of unique words and the frequency of each of them.\n",
+ "\n",
+ "For example, as shown in the following illustration, the word \"coffee\" appears twice.\n",
+ "\n",
+ "\n",
+ "\n",
+ "With a bag-of-words representation, we make heavy use of word frequency but not too much of word order.\n",
+ "\n",
+ "In the context of sentiment analysis, the sentiment of a tweet is conveyed more strongly by specific words. For example, if a tweet contains the word \"happy,\" it likely conveys positive sentiment, but not always (e.g., \"not happy\" denotes the opposite sentiment). When these words come up more often, they'll probably more strongly convey the sentiment."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b9d9bdbd-406d-469b-a8f6-41d1b3687c37",
+ "metadata": {
+ "id": "b9d9bdbd-406d-469b-a8f6-41d1b3687c37"
+ },
+ "source": [
+ "## Document Term Matrix\n",
+ "\n",
+ "Now let's implement the idea of bag-of-words. Before we dive deeper, let's step back for a moment. In practice, text analysis often involves handling many documents; from now on, we use the term **document** to represent a piece of text on which we perform analysis. It could be a phrase, a sentence, a tweet, or any other text—as long as it can be represented by a string, the length dosen't really matter.\n",
+ "\n",
+ "Imagine we have four documents (i.e., the four phrases shown above), and we toss them all in the bag. Instead of a word-frequency list, we'd expect a document-term matrix (DTM) in return. In a DTM, the word list is the **vocabulary** (V) that holds all unique words occur across the documents. For each **document** (D), we count the number of occurence of each word in the vocabulary, and then plug the number into the matrix. In other words, the DTM we will construct is a $D \\times V$ matrix, where each row corresponds to a document, and each column corresponds to a token (or \"term\").\n",
+ "\n",
+ "The unique tokens in this set of documents, arranged in alphabetical order, form the columns. For each document, we mark the occurence of each word present in the document. The numerical representation for each document is a row in the matrix. For example, the first document, \"the coffee roaster,\" has the numerical representation $[0, 1, 0, 0, 0, 1, 1, 0]$.\n",
+ "\n",
+ "Note that the left index column now displays these documents as text, but typically we would just assign an index to each of them.\n",
+ "\n",
+ "$$\n",
+ "\\begin{array}{c|cccccccccccc}\n",
+ " & \\text{americano} & \\text{coffee} & \\text{iced} & \\text{light} & \\text{roast} & \\text{roaster} & \\text{the} & \\text{time} \\\\\\hline\n",
+ "\\text{the coffee roaster} &0 &1\t&0\t&0\t&0\t&1\t&1\t&0 \\\\\n",
+ "\\text{light roast} &0 &0\t&0\t&1\t&1\t&0\t&0\t&0 \\\\\n",
+ "\\text{iced americano} &1 &0\t&1\t&0\t&0\t&0\t&0\t&0 \\\\\n",
+ "\\text{coffee time} &0 &1\t&0\t&0\t&0\t&0\t&0\t&1 \\\\\n",
+ "\\end{array}\n",
+ "$$\n",
+ "\n",
+ "To create a DTM, we will use `CountVectorizer` from the package `sklearn`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*En esta línea de código se está importando CountVectorizer desde el módulo sklearn.feature_extraction.text de la librería scikit-learn. CountVectorizer es una herramienta que convierte texto en números, creando una matriz donde cada fila representa un documento (por ejemplo, un tweet) y cada columna representa una palabra, con valores que indican cuántas veces aparece cada palabra. Esto permite que los datos de texto puedan ser usados por modelos de machine learning.*"
+ ],
+ "metadata": {
+ "id": "A1t6-BDPgfmH"
+ },
+ "id": "A1t6-BDPgfmH"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cd2adf56-ba93-459d-8cfa-16ce8dc9284b",
+ "metadata": {
+ "id": "cd2adf56-ba93-459d-8cfa-16ce8dc9284b"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.feature_extraction.text import CountVectorizer"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4989781d-6b40-417a-be70-eeba05cd8a50",
+ "metadata": {
+ "id": "4989781d-6b40-417a-be70-eeba05cd8a50"
+ },
+ "source": [
+ "The following illustration depicts the three-step workflow of creating a DTM with `CountVectorizr`.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Let's walk through these steps with the toy example shown above."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "34174034-46b9-43e2-a511-5972d378cb00",
+ "metadata": {
+ "id": "34174034-46b9-43e2-a511-5972d378cb00"
+ },
+ "source": [
+ "### A Toy Example"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4da2bd3d-0460-4b5f-9b9e-02940db0d7ca",
+ "metadata": {
+ "id": "4da2bd3d-0460-4b5f-9b9e-02940db0d7ca"
+ },
+ "outputs": [],
+ "source": [
+ "# A toy example containing four documents\n",
+ "test = ['the coffee roaster',\n",
+ " 'light roast',\n",
+ " 'iced americano',\n",
+ " 'coffee time']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dff7c1d3-fcee-4e20-b9a7-17306ebd5fc2",
+ "metadata": {
+ "id": "dff7c1d3-fcee-4e20-b9a7-17306ebd5fc2"
+ },
+ "source": [
+ "The first step is to initialize a `CountVectorizer` object. Within the round paratheses, we can specify parameter settings if desired. Let's take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and see what options are available. \n",
+ "\n",
+ "For now we can just leave it blank to use the default settings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9de3fe6a-9abf-4e11-aad1-e54c891567bb",
+ "metadata": {
+ "id": "9de3fe6a-9abf-4e11-aad1-e54c891567bb"
+ },
+ "outputs": [],
+ "source": [
+ "# Create a CountVectorizer object\n",
+ "vectorizer = CountVectorizer()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1b5a7d0d-0bfc-4fb9-8e5f-e91e39797fb5",
+ "metadata": {
+ "id": "1b5a7d0d-0bfc-4fb9-8e5f-e91e39797fb5"
+ },
+ "source": [
+ "The second step is to `fit` this `CountVectorizer` object to the data, which means creating a vocabulary of tokens from the set of documents. Thirdly, we `transform` our data according to the \"fitted\" `CountVectorizer` object, which means taking each of the document and counting the occurrences of tokens according to the vocabulary established during the \"fitting\" step.\n",
+ "\n",
+ "It may sound a bit complex but steps 2 and 3 can be done in one swoop using a `fit_transform` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "da1bbad4-bb1a-4b92-9096-6e17558b4a42",
+ "metadata": {
+ "id": "da1bbad4-bb1a-4b92-9096-6e17558b4a42"
+ },
+ "outputs": [],
+ "source": [
+ "# Fit and transform to create a DTM\n",
+ "test_count = vectorizer.fit_transform(test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "324d3b65-4e98-48bf-87d2-399457f4939c",
+ "metadata": {
+ "id": "324d3b65-4e98-48bf-87d2-399457f4939c"
+ },
+ "source": [
+ "The return of `fit_transform` is supposed to be the DTM.\n",
+ "\n",
+ "Let's take a look at it!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cb044001-8eb2-4489-b025-2d8e2d4bfee2",
+ "metadata": {
+ "id": "cb044001-8eb2-4489-b025-2d8e2d4bfee2",
+ "outputId": "db26bb61-3db0-4a87-c0bf-dec1a19391f6"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<4x8 sparse matrix of type ''\n",
+ "\twith 9 stored elements in Compressed Sparse Row format>"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_count"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f9817b09-a806-42c4-9436-822cc27a38b9",
+ "metadata": {
+ "id": "f9817b09-a806-42c4-9436-822cc27a38b9"
+ },
+ "source": [
+ "Apparently we've got a \"sparse matrix\"—a matrix that contains a lot of zeros. This makes sense. For each document, there are words that don't occur at all, and these are counted as zero in the DTM. This sparse matrix is stored in a \"Compressed Sparse Row\" format, a memory-saving format designed for handling sparse matrices.\n",
+ "\n",
+ "Let's convert it to a dense matrix, where those zeros are probably represented, as in a numpy array."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb03a238-87d8-40c9-b20e-66e7c9b6576b",
+ "metadata": {
+ "id": "bb03a238-87d8-40c9-b20e-66e7c9b6576b",
+ "outputId": "97a84c1a-a0c4-4237-845f-a48ddf5d4181"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "matrix([[0, 1, 0, 0, 0, 1, 1, 0],\n",
+ " [0, 0, 0, 1, 1, 0, 0, 0],\n",
+ " [1, 0, 1, 0, 0, 0, 0, 0],\n",
+ " [0, 1, 0, 0, 0, 0, 0, 1]])"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Convert DTM to a dense matrix\n",
+ "test_count.todense()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28b58a63-d7f6-4b9f-aadf-4d4fc7341336",
+ "metadata": {
+ "id": "28b58a63-d7f6-4b9f-aadf-4d4fc7341336"
+ },
+ "source": [
+ "So this is our DTM! The matrix is the same as shown above. To make it more reader-friendly, let's convert it to a dataframe. The column names should be tokens in the vocabulary, which we can access with the `get_feature_names_out` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "714de5d3-e37d-4a19-9ade-3c6629e38d4e",
+ "metadata": {
+ "id": "714de5d3-e37d-4a19-9ade-3c6629e38d4e",
+ "outputId": "fd976dd8-e521-4be1-d543-192e13845b1d"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array(['americano', 'coffee', 'iced', 'light', 'roast', 'roaster', 'the',\n",
+ " 'time'], dtype=object)"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Retrieve the vocabulary\n",
+ "vectorizer.get_feature_names_out()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6a7729a2-ca2e-4de7-8795-74dfedb7a4d5",
+ "metadata": {
+ "id": "6a7729a2-ca2e-4de7-8795-74dfedb7a4d5"
+ },
+ "outputs": [],
+ "source": [
+ "# Create a DTM dataframe\n",
+ "test_dtm = pd.DataFrame(data=test_count.todense(),\n",
+ " columns=vectorizer.get_feature_names_out())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "781da407-f394-40f2-9d45-1fac39f02047",
+ "metadata": {
+ "id": "781da407-f394-40f2-9d45-1fac39f02047"
+ },
+ "source": [
+ "Here it is! The DTM of our toy data is now a dataframe. The index of `test_dtm` corresponds to the position of each document in the `test` list."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e41dd243-cd2e-43c3-80f8-5eaab6e64210",
+ "metadata": {
+ "id": "e41dd243-cd2e-43c3-80f8-5eaab6e64210",
+ "outputId": "ddb94d16-441b-4ebe-dd45-2db08cf39ac3"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
americano
\n",
+ "
coffee
\n",
+ "
iced
\n",
+ "
light
\n",
+ "
roast
\n",
+ "
roaster
\n",
+ "
the
\n",
+ "
time
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
1
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
1
\n",
+ "
1
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
1
\n",
+ "
1
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
1
\n",
+ "
0
\n",
+ "
1
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
0
\n",
+ "
1
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
1
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " americano coffee iced light roast roaster the time\n",
+ "0 0 1 0 0 0 1 1 0\n",
+ "1 0 0 0 1 1 0 0 0\n",
+ "2 1 0 1 0 0 0 0 0\n",
+ "3 0 1 0 0 0 0 0 1"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_dtm"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d59a03b4-94fa-4fe7-8f5d-7280e31b9bc4",
+ "metadata": {
+ "id": "d59a03b4-94fa-4fe7-8f5d-7280e31b9bc4"
+ },
+ "source": [
+ "Hopefully this toy example provides a clear walkthrough of creating a DTM.\n",
+ "\n",
+ "Now it's time for our tweets data!\n",
+ "\n",
+ "### DTM for Tweets\n",
+ "\n",
+ "We'll begin by initializing a `CountVectorizer` object. In the following cell, we have included a few parameters that people often adjust. These parameters are currently set to their default values.\n",
+ "\n",
+ "When we construct a DTM, the default is to lowercase the input text. If nothing is provided for `stop_words`, the default is to keep them. The next three parameters are used to control the size of the vocabulary, which we'll return to in a minute."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ " *La variable vectorizer almacena este objeto, que se configura para convertir texto en una matriz de conteos de palabras. Los parámetros especificados significan lo siguiente: lowercase=True convierte todo el texto a minúsculas, stop_words=None no elimina palabras comunes, min_df=1 incluye palabras que aparecen al menos en un documento, max_df=1.0 permite incluir palabras que aparecen en todos los documentos, y max_features=None indica que no hay un límite en el número de palabras a considerar. Este objeto servirá para transformar los tweets en datos numéricos que los modelos de machine learning puedan procesar.*"
+ ],
+ "metadata": {
+ "id": "J-byaxyLhCtS"
+ },
+ "id": "J-byaxyLhCtS"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "783e44a4-4a22-4290-b222-282b02c080dc",
+ "metadata": {
+ "id": "783e44a4-4a22-4290-b222-282b02c080dc"
+ },
+ "outputs": [],
+ "source": [
+ "# Create a CountVectorizer object\n",
+ "vectorizer = CountVectorizer(lowercase=True,\n",
+ " stop_words=None,\n",
+ " min_df=1,\n",
+ " max_df=1.0,\n",
+ " max_features=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*La línea counts = vectorizer.fit_transform(tweets['text_processed']) hace dos cosas: primero, fit aprende el vocabulario de los tweets procesados, y luego transform convierte cada tweet en un vector de conteos de palabras según ese vocabulario.*"
+ ],
+ "metadata": {
+ "id": "BcEdpBovhST8"
+ },
+ "id": "BcEdpBovhST8"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f85e76ea-bc54-4775-bcda-432a03d2c96f",
+ "metadata": {
+ "scrolled": true,
+ "id": "f85e76ea-bc54-4775-bcda-432a03d2c96f",
+ "outputId": "761cbe0a-1936-4e19-a90f-7ba9ed8c8ec2"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<11541x8751 sparse matrix of type ''\n",
+ "\twith 191139 stored elements in Compressed Sparse Row format>"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Fit and transform to create DTM\n",
+ "counts = vectorizer.fit_transform(tweets['text_processed'])\n",
+ "counts"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Esta línea de código convierte la matriz dispersa counts en una matriz densa de tipo numpy.ndarray. Al usar counts.todense(), se transforma la representación comprimida de los conteos de palabras en una matriz completa con todos los ceros y valores explícitos, y luego np.array(...) la convierte en un arreglo de NumPy. Esto permite manipular los datos de manera más directa para análisis o procesamiento adicional.*"
+ ],
+ "metadata": {
+ "id": "r3b5VJBvhiNA"
+ },
+ "id": "r3b5VJBvhiNA"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "87119057-c78c-4eb2-a9d6-3e9f44e4c22b",
+ "metadata": {
+ "id": "87119057-c78c-4eb2-a9d6-3e9f44e4c22b",
+ "outputId": "d313cfd4-66c0-4e96-d955-083f73b78567"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " ...,\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0]])"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Do not run if you have limited memory - this includes DataHub and Binder\n",
+ "np.array(counts.todense())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "99322b85-1a15-46a5-bb80-bb5eaa6eeb7b",
+ "metadata": {
+ "id": "99322b85-1a15-46a5-bb80-bb5eaa6eeb7b"
+ },
+ "outputs": [],
+ "source": [
+ "# Extract tokens\n",
+ "tokens = vectorizer.get_feature_names_out()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "43620587-3795-4434-8f1f-145c81b93706",
+ "metadata": {
+ "id": "43620587-3795-4434-8f1f-145c81b93706",
+ "outputId": "a55de36b-27e8-45a3-e52b-941b008b0ca8"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "(11541, 8751)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Create DTM\n",
+ "first_dtm = pd.DataFrame(data=counts.todense(),\n",
+ " index=tweets.index,\n",
+ " columns=tokens)\n",
+ "\n",
+ "# Print the shape of DTM\n",
+ "print(first_dtm.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2dd257d5-4244-436c-afe7-5688232caf8f",
+ "metadata": {
+ "id": "2dd257d5-4244-436c-afe7-5688232caf8f"
+ },
+ "source": [
+ "If we leave the `CountVectorizer` to the default setting, the vocabulary size of the tweet data is 8751."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb3604ec-d909-4238-9a3f-67e7d4ae2ac5",
+ "metadata": {
+ "id": "bb3604ec-d909-4238-9a3f-67e7d4ae2ac5",
+ "outputId": "71611a7b-3fb7-4019-977a-b0c8464143fc"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ " token number\n",
+ "3127 lt 6\n",
+ "918 worst 6\n",
+ "10572 to 5\n",
+ "8148 the 5\n",
+ "10742 to 5\n",
+ "152 to 5\n",
+ "5005 to 5\n",
+ "10923 the 5\n",
+ "7750 to 5\n",
+ "355 to 5"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "counts = pd.DataFrame()\n",
+ "\n",
+ "# Retrieve the index of the tweet where a token appears most frequently\n",
+ "counts['token'] = first_dtm.idxmax(axis=1)\n",
+ "\n",
+ "# Retrieve the number of occurrence\n",
+ "counts['number'] = first_dtm.max(axis=1)\n",
+ "\n",
+ "# Filter out placeholders\n",
+ "counts[(counts['token']!='digit')\n",
+ " & (counts['token']!='hashtag')\n",
+ " & (counts['token']!='user')].sort_values('number', ascending=False).head(10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7cdac4ef-6b9d-4aad-9b24-c70f6c2eb8f0",
+ "metadata": {
+ "id": "7cdac4ef-6b9d-4aad-9b24-c70f6c2eb8f0"
+ },
+ "source": [
+ "It looks like among all tweets, at most a token appears six times, and it is either the word \"It\" or the word \"worst.\"\n",
+ "\n",
+ "Let's go back to our tweets dataframe and locate the 918th tweet."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5e7cacd8-1fb3-4f0d-a744-4ee0994a089f",
+ "metadata": {
+ "id": "5e7cacd8-1fb3-4f0d-a744-4ee0994a089f",
+ "outputId": "bc9ca1e9-8f52-4958-a3e9-3cb3e992201c"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\"@united is the worst. Worst reservation policies. Worst costumer service. Worst worst worst. Congrats, @Delta you're not that bad!\""
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Retrieve 918th tweet: \"worst\"\n",
+ "tweets.iloc[918]['text']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3dba8e37-4880-4565-b6fc-7e7c96958f0f",
+ "metadata": {
+ "id": "3dba8e37-4880-4565-b6fc-7e7c96958f0f"
+ },
+ "source": [
+ "## Customize the `CountVectorizer`\n",
+ "\n",
+ "So far we've always used the default parameter setting to create our DTMs, but in many cases we may want to customize the `CountVectorizer` object. The purpose of doing so is to further filter out unnecessary tokens. In the example below, we tweak the following parameters:\n",
+ "\n",
+ "- `stop_words = 'english'`: ignore English stop words\n",
+ "- `min_df = 2`: ignore words that don't occur at least twice\n",
+ "- `max_df = 0.95`: ignore words if they appear in more than 95\\% of the documents\n",
+ "\n",
+ "🔔 **Question**: Let's pause for a minute to discuss whether it sounds reasonable to set these parameters! What do you think?\n",
+ "\n",
+ "Oftentimes, we are not interested in words whose frequencies are either too low or too high, so we use `min_df` and `max_df` to filter them out. Alternatively, we can define our vocabulary size as $N$ by setting `max_features`. In other words, we tell `CountVectorizer` to only consider the top $N$ most frequent tokens when constructing the DTM."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "37a0a93e-9dd8-43dc-a82c-06a24bf02bc9",
+ "metadata": {
+ "id": "37a0a93e-9dd8-43dc-a82c-06a24bf02bc9"
+ },
+ "outputs": [],
+ "source": [
+ "# Customize the parameter setting\n",
+ "vectorizer = CountVectorizer(lowercase=True,\n",
+ " stop_words='english',\n",
+ " min_df=2,\n",
+ " max_df=0.95,\n",
+ " max_features=None)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b53e5ecf-7be3-4915-9d11-fd3edb913400",
+ "metadata": {
+ "id": "b53e5ecf-7be3-4915-9d11-fd3edb913400"
+ },
+ "outputs": [],
+ "source": [
+ "# Fit, transform, and get tokens\n",
+ "counts = vectorizer.fit_transform(tweets['text_processed'])\n",
+ "tokens = vectorizer.get_feature_names_out()\n",
+ "\n",
+ "# Create the second DTM\n",
+ "second_dtm = pd.DataFrame(data=counts.todense(),\n",
+ " index=tweets.index,\n",
+ " columns=tokens)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6d2e66bc-2eaa-4642-8848-74459948084b",
+ "metadata": {
+ "id": "6d2e66bc-2eaa-4642-8848-74459948084b"
+ },
+ "source": [
+ "Our second DTM has a substantially smaller vocabulary compared to the first one."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "570fb598-fa81-4111-9e36-7172d8034713",
+ "metadata": {
+ "id": "570fb598-fa81-4111-9e36-7172d8034713",
+ "outputId": "4194f1fd-f2d4-4037-8e74-d0e9cc0a12da"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "(11541, 8751)\n",
+ "(11541, 4471)\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(first_dtm.shape)\n",
+ "print(second_dtm.shape)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d8deabb2-20eb-4047-b592-48cb1564fd2a",
+ "metadata": {
+ "id": "d8deabb2-20eb-4047-b592-48cb1564fd2a",
+ "outputId": "da080b87-ec96-4e49-9a47-d1c63a3e7635"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
aa
\n",
+ "
aadv
\n",
+ "
aadvantage
\n",
+ "
aal
\n",
+ "
abandoned
\n",
+ "
abc
\n",
+ "
ability
\n",
+ "
able
\n",
+ "
aboard
\n",
+ "
abq
\n",
+ "
...
\n",
+ "
yummy
\n",
+ "
yup
\n",
+ "
yvonne
\n",
+ "
yvr
\n",
+ "
yyj
\n",
+ "
yyz
\n",
+ "
zero
\n",
+ "
zone
\n",
+ "
zoom
\n",
+ "
zurich
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 4471 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " aa aadv aadvantage aal abandoned abc ability able aboard abq ... \\\n",
+ "0 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "1 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "2 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "3 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "4 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "\n",
+ " yummy yup yvonne yvr yyj yyz zero zone zoom zurich \n",
+ "0 0 0 0 0 0 0 0 0 0 0 \n",
+ "1 0 0 0 0 0 0 0 0 0 0 \n",
+ "2 0 0 0 0 0 0 0 0 0 0 \n",
+ "3 0 0 0 0 0 0 0 0 0 0 \n",
+ "4 0 0 0 0 0 0 0 0 0 0 \n",
+ "\n",
+ "[5 rows x 4471 columns]"
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "second_dtm.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "998fe2c3-ec90-4027-8c7f-417327a33a27",
+ "metadata": {
+ "id": "998fe2c3-ec90-4027-8c7f-417327a33a27"
+ },
+ "source": [
+ "The most frequent token list now includes words that make more sense to us, such as \"cancelled\" and \"service.\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ffa7bf4e-640b-49bc-b64b-721140f67f76",
+ "metadata": {
+ "id": "ffa7bf4e-640b-49bc-b64b-721140f67f76",
+ "outputId": "99601ba3-8894-465d-bd35-4d73897933bd"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "digit 6927\n",
+ "flight 3320\n",
+ "hashtag 2633\n",
+ "cancelled 956\n",
+ "thanks 921\n",
+ "service 910\n",
+ "just 801\n",
+ "customer 726\n",
+ "time 695\n",
+ "help 687\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 36,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "second_dtm.sum().sort_values(ascending=False).head(10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3e8b5145-d505-4e36-9a39-a40d25d8ec6f",
+ "metadata": {
+ "id": "3e8b5145-d505-4e36-9a39-a40d25d8ec6f"
+ },
+ "source": [
+ "## 🥊 Challenge 2: Lemmatize the Text Input\n",
+ "\n",
+ "Recall from Part 1 that we introduced using `spaCy` to perform lemmatization, i.e., to \"recover\" the base form of a word. This process will reduce vocabulary size by keeping word variations minimal—a smaller vocabularly may help improve model performance in sentiment classification.\n",
+ "\n",
+ "Now let's implement lemmatization on our tweet data and use the lemmatized text to create a third DTM.\n",
+ "\n",
+ "Complete the function `lemmatize_text`. It requires a text input and returns the lemmas of all tokens.\n",
+ "\n",
+ "Here are some hints to guide you through this challenge:\n",
+ "\n",
+ "- Step 1: initialize a list to hold lemmas\n",
+ "- Step 2: apply the `nlp` pipeline to the input text\n",
+ "- Step 3: iterate over tokens in the processed text and retrieve the lemma of the token\n",
+ " - HINT: lemmatization is one of the linguistic annotations that the `nlp` pipeline automatically does for us. We can use `token.lemma_` to access the annotation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*nlp = spacy.load('en_core_web_sm') carga un modelo de idioma en inglés pequeño llamado en_core_web_sm, que permite a spaCy reconocer palabras, oraciones, partes de la oración y otras características del texto. El objeto nlp resultante se puede usar para analizar y procesar los tweets de manera avanzada.*"
+ ],
+ "metadata": {
+ "id": "oc1FLlG7h2Oi"
+ },
+ "id": "oc1FLlG7h2Oi"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "da610560-62c3-48ab-a1b2-25e0b589bc61",
+ "metadata": {
+ "id": "da610560-62c3-48ab-a1b2-25e0b589bc61"
+ },
+ "outputs": [],
+ "source": [
+ "# Import spaCy\n",
+ "import spacy\n",
+ "nlp = spacy.load('en_core_web_sm')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*La función lemmatize_text(text) tiene como objetivo transformar cada palabra en su forma base o lema. Primero, se inicializa una lista vacía llamada lemma para guardar los lemas de cada palabra. Luego, se aplica el pipeline de spaCy al texto de entrada (doc = nlp(text)) para analizarlo. Después, se recorre cada token en doc y se extrae su lema, agregándolo a la lista. Finalmente, se unen todos los lemas en un solo string con ' '.join(lemma) y se devuelve ese texto lematizado listo para análisis posterior.*"
+ ],
+ "metadata": {
+ "id": "nYmmg3bYiGan"
+ },
+ "id": "nYmmg3bYiGan"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "98ead266-30f3-48ad-bc51-c1685487f000",
+ "metadata": {
+ "scrolled": true,
+ "id": "98ead266-30f3-48ad-bc51-c1685487f000"
+ },
+ "outputs": [],
+ "source": [
+ "# Create a function to lemmatize text\n",
+ "def lemmatize_text(text):\n",
+ " '''Lemmatize the text input with spaCy annotations.'''\n",
+ "\n",
+ " # Step 1: Initialize an empty list to hold lemmas\n",
+ " lemma = ...\n",
+ "\n",
+ " # Step 2: Apply the nlp pipeline to input text\n",
+ " doc = ...\n",
+ "\n",
+ " # Step 3: Iterate over tokens in the text to get the token lemma\n",
+ " for token in doc:\n",
+ " lemma.append(...)\n",
+ "\n",
+ " # Step 4: Join lemmas together into a single string\n",
+ " text_lemma = ' '.join(lemma)\n",
+ "\n",
+ " return text_lemma"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf36aab6-35dd-42a2-9b38-b7c432f021c6",
+ "metadata": {
+ "id": "cf36aab6-35dd-42a2-9b38-b7c432f021c6"
+ },
+ "source": [
+ "Let's apply the function to the following example tweet first!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Este bloque de código sirve para mostrar cómo queda un tweet después del preprocesamiento y la lematización. Primero se imprime el texto limpio del tweet, luego se coloca una línea de separación para que la salida sea más clara, y finalmente se muestra el mismo texto tras aplicarle la función de lematización, que convierte cada palabra a su forma base. Esto permite comparar de manera sencilla el texto antes y después de la lematización.*"
+ ],
+ "metadata": {
+ "id": "JyMbnArsiY0a"
+ },
+ "id": "JyMbnArsiY0a"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "742e82bb-5c42-4fa8-9101-5a0ea908db25",
+ "metadata": {
+ "id": "742e82bb-5c42-4fa8-9101-5a0ea908db25",
+ "outputId": "25800351-d172-47ba-9ed7-dbd8491ac726"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "USER wow this just blew my mind\n",
+ "==================================================\n",
+ "USER wow this just blow my mind\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Apply the function to an example tweet\n",
+ "print(tweets.iloc[33][\"text_processed\"])\n",
+ "print(f\"{'='*50}\")\n",
+ "print(lemmatize_text(tweets.iloc[33]['text_processed']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bbeda987-dc32-4979-b158-c24be7d1a420",
+ "metadata": {
+ "id": "bbeda987-dc32-4979-b158-c24be7d1a420"
+ },
+ "source": [
+ "And then let's lemmatize the tweet data and save the output to a new column `text_lemmatized`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1ac128d2-1be5-4ef5-bb50-5b8d44ef8ee9",
+ "metadata": {
+ "id": "1ac128d2-1be5-4ef5-bb50-5b8d44ef8ee9"
+ },
+ "outputs": [],
+ "source": [
+ "# This may take a while!\n",
+ "tweets['text_lemmatized'] = tweets['text_processed'].apply(lambda x: lemmatize_text(x))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2c02aad6-4e71-4afc-80cf-31d4f39498b2",
+ "metadata": {
+ "id": "2c02aad6-4e71-4afc-80cf-31d4f39498b2"
+ },
+ "source": [
+ "Now with the `text_lemmatized` column, let's create a third DTM. The parameter setting is the same as the second DTM."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5f49d790-3c9d-4dc1-a5c9-72c306630412",
+ "metadata": {
+ "id": "5f49d790-3c9d-4dc1-a5c9-72c306630412",
+ "outputId": "af1310d9-518d-4815-d4c9-b2f775786583"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
aa
\n",
+ "
aadv
\n",
+ "
aadvantage
\n",
+ "
aal
\n",
+ "
abandon
\n",
+ "
abc
\n",
+ "
ability
\n",
+ "
able
\n",
+ "
aboard
\n",
+ "
abq
\n",
+ "
...
\n",
+ "
yummy
\n",
+ "
yup
\n",
+ "
yvonne
\n",
+ "
yvr
\n",
+ "
yyj
\n",
+ "
yyz
\n",
+ "
zero
\n",
+ "
zone
\n",
+ "
zoom
\n",
+ "
zurich
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
...
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 3553 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " aa aadv aadvantage aal abandon abc ability able aboard abq ... \\\n",
+ "0 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "1 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "2 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "3 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "4 0 0 0 0 0 0 0 0 0 0 ... \n",
+ "\n",
+ " yummy yup yvonne yvr yyj yyz zero zone zoom zurich \n",
+ "0 0 0 0 0 0 0 0 0 0 0 \n",
+ "1 0 0 0 0 0 0 0 0 0 0 \n",
+ "2 0 0 0 0 0 0 0 0 0 0 \n",
+ "3 0 0 0 0 0 0 0 0 0 0 \n",
+ "4 0 0 0 0 0 0 0 0 0 0 \n",
+ "\n",
+ "[5 rows x 3553 columns]"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Create the vectorizer (the same param setting as previous)\n",
+ "vectorizer = CountVectorizer(lowercase=True,\n",
+ " stop_words='english',\n",
+ " min_df=2,\n",
+ " max_df=0.95,\n",
+ " max_features=None)\n",
+ "\n",
+ "# Fit, transform, and get tokens\n",
+ "counts = vectorizer.fit_transform(tweets['text_lemmatized'])\n",
+ "tokens = vectorizer.get_feature_names_out()\n",
+ "\n",
+ "# Create the third DTM\n",
+ "third_dtm = pd.DataFrame(data=counts.todense(),\n",
+ " index=tweets.index,\n",
+ " columns=tokens)\n",
+ "third_dtm.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9859eb04-dbd2-4fa0-9798-65ed7496c297",
+ "metadata": {
+ "id": "9859eb04-dbd2-4fa0-9798-65ed7496c297",
+ "outputId": "02b8a80e-93ed-4159-8f02-d20f3357f614"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "(11541, 8751)\n",
+ "(11541, 4471)\n",
+ "(11541, 3553)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Print the shapes of three DTMs\n",
+ "print(first_dtm.shape)\n",
+ "print(second_dtm.shape)\n",
+ "print(third_dtm.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fa94c8ac-e4f4-4b76-afdb-1d4af54a3eee",
+ "metadata": {
+ "id": "fa94c8ac-e4f4-4b76-afdb-1d4af54a3eee"
+ },
+ "source": [
+ "Let's print the top 10 most frequent tokens as usual. These tokens are now lemmas and their counts also change after lemmatization."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5745ca29-97ed-4fe1-81db-7e402c8da674",
+ "metadata": {
+ "id": "5745ca29-97ed-4fe1-81db-7e402c8da674",
+ "outputId": "2a9f8f6e-4543-4bcb-ab88-bc509010acf0"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "digit 6927\n",
+ "flight 4043\n",
+ "hashtag 2633\n",
+ "thank 1455\n",
+ "hour 1134\n",
+ "cancel 948\n",
+ "delay 937\n",
+ "service 937\n",
+ "customer 902\n",
+ "time 856\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 43,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Get the most frequent tokens in the third DTM\n",
+ "third_dtm.sum().sort_values(ascending=False).head(10)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "16c63e6a-50c3-448a-9a56-a1d193cd6680",
+ "metadata": {
+ "id": "16c63e6a-50c3-448a-9a56-a1d193cd6680",
+ "outputId": "cb11c807-409c-475f-f4d2-d848329ffce7"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "digit 6927\n",
+ "flight 3320\n",
+ "hashtag 2633\n",
+ "cancelled 956\n",
+ "thanks 921\n",
+ "service 910\n",
+ "just 801\n",
+ "customer 726\n",
+ "time 695\n",
+ "help 687\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 44,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Compared to the most frequent tokens in the second DTM\n",
+ "second_dtm.sum().sort_values(ascending=False).head(10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38363398-fdf5-456b-ae3d-cae9d5294140",
+ "metadata": {
+ "id": "38363398-fdf5-456b-ae3d-cae9d5294140"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "# Term Frequency-Inverse Document Frequency\n",
+ "\n",
+ "So far, we're relying on word frequency to give us information about a document. This assumes if a word appears more often in a document, it's more informative. However, this may not always be the case. For example, we've already removed stop words because they are not informative, despite the fact that they appear many times in a document. We also know the word \"flight\" is among the most frequent words, but it is not that informative, because it appears in many documents. Since we're looking at airline tweets, we shouldn't be surprised to see the word \"flight\"!\n",
+ "\n",
+ "To remedy this, we use a weighting scheme called **tf-idf (term frequency-inverse document frequency)**. The big idea behind tf-idf is to weight a word not just by its frequency within a document, but also by its frequency in one document relative to the remaining documents. So, when we construct the DTM, we will be assigning each term a **tf-idf score**. Specifically, term $t$ in document $d$ is assigned a tf-idf score as follows:\n",
+ "\n",
+ "\n",
+ "\n",
+ "In essence, the tf-idf score of a word in a document is the product of two components: **term frequency (tf)** and **inverse document frequency (idf)**. The idf acts as a scaling factor. If a word occurs in all documents, then idf equals 1. No scaling will happen. But idf is typically greater than 1, which is the weight we assign to the word to make the tf-idf score higher, so as to highlight that the word is informative. In practice, we add 1 to both the denominator and numerator (\"add-1 smooth\") to prevent any issues with zero occurrences.\n",
+ "\n",
+ "We can also create a tf-idf DTM using `sklearn`. We'll use a `TfidfVectorizer` this time:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Esta línea de código importa TfidfVectorizer, una herramienta que convierte textos en números usando la técnica TF-IDF, que permite medir la importancia de cada palabra en un conjunto de documentos. Esto facilita que los modelos de machine learning puedan procesar y analizar el texto de manera efectiva.*"
+ ],
+ "metadata": {
+ "id": "RCYzWtd6isU9"
+ },
+ "id": "RCYzWtd6isU9"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f5e32d8a-c42d-475f-aab4-21eca8b1aee8",
+ "metadata": {
+ "id": "f5e32d8a-c42d-475f-aab4-21eca8b1aee8"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.feature_extraction.text import TfidfVectorizer"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d23916c1-5693-456c-b71d-6d9d78d1e2e4",
+ "metadata": {
+ "id": "d23916c1-5693-456c-b71d-6d9d78d1e2e4"
+ },
+ "outputs": [],
+ "source": [
+ "# Create a tfidf vectorizer\n",
+ "vectorizer = TfidfVectorizer(lowercase=True,\n",
+ " stop_words='english',\n",
+ " min_df=2,\n",
+ " max_df=0.95,\n",
+ " max_features=None)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7af5b342-ab18-4766-9561-e38e50cd1e9b",
+ "metadata": {
+ "id": "7af5b342-ab18-4766-9561-e38e50cd1e9b",
+ "outputId": "be9232f3-cb80-4d0f-fd62-7cfd7c59efbe"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<11541x3553 sparse matrix of type ''\n",
+ "\twith 88287 stored elements in Compressed Sparse Row format>"
+ ]
+ },
+ "execution_count": 47,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Fit and transform\n",
+ "tf_dtm = vectorizer.fit_transform(tweets['text_lemmatized'])\n",
+ "tf_dtm"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "55e509c8-5402-4be0-9143-0e448fff7066",
+ "metadata": {
+ "id": "55e509c8-5402-4be0-9143-0e448fff7066",
+ "outputId": "0277c92c-66ca-4e99-dbdc-a2addb2a7f7c"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
aa
\n",
+ "
aadv
\n",
+ "
aadvantage
\n",
+ "
aal
\n",
+ "
abandon
\n",
+ "
abc
\n",
+ "
ability
\n",
+ "
able
\n",
+ "
aboard
\n",
+ "
abq
\n",
+ "
...
\n",
+ "
yummy
\n",
+ "
yup
\n",
+ "
yvonne
\n",
+ "
yvr
\n",
+ "
yyj
\n",
+ "
yyz
\n",
+ "
zero
\n",
+ "
zone
\n",
+ "
zoom
\n",
+ "
zurich
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
...
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
...
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
...
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
...
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
...
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
0.0
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 3553 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " aa aadv aadvantage aal abandon abc ability able aboard abq ... \\\n",
+ "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
+ "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
+ "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
+ "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
+ "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n",
+ "\n",
+ " yummy yup yvonne yvr yyj yyz zero zone zoom zurich \n",
+ "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
+ "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
+ "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
+ "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
+ "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
+ "\n",
+ "[5 rows x 3553 columns]"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Create a tf-idf dataframe\n",
+ "tfidf = pd.DataFrame(tf_dtm.todense(),\n",
+ " columns=vectorizer.get_feature_names_out(),\n",
+ " index=tweets.index)\n",
+ "tfidf.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "45ba13ea-c429-4ff1-a9a2-abf27c4d0888",
+ "metadata": {
+ "id": "45ba13ea-c429-4ff1-a9a2-abf27c4d0888"
+ },
+ "source": [
+ "You may have noticed that the vocabulary size is the same as we saw in Challenge 2. This is because we used the same parameter setting when creating the vectorizer. But the values in the matrix are different—they are tf-idf scores instead of raw counts."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fa58c360-5c55-4fa0-8c55-1f00e68baa9a",
+ "metadata": {
+ "id": "fa58c360-5c55-4fa0-8c55-1f00e68baa9a"
+ },
+ "source": [
+ "## Interpret TF-IDF Values"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bdad233d-ebc1-420f-9b67-c227c48f3e60",
+ "metadata": {
+ "id": "bdad233d-ebc1-420f-9b67-c227c48f3e60"
+ },
+ "source": [
+ "Let's take a look the document where a term has the highest tf-idf values. We'll use the `.idxmax()` method to find the index."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "995b511a-d448-4cfb-a6a0-22a465efd8a8",
+ "metadata": {
+ "id": "995b511a-d448-4cfb-a6a0-22a465efd8a8",
+ "outputId": "cca8c541-474f-47f4-ad72-cc2041cff708"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "aa 10077\n",
+ "aadv 9285\n",
+ "aadvantage 9974\n",
+ "aal 10630\n",
+ "abandon 7859\n",
+ " ... \n",
+ "yyz 1350\n",
+ "zero 2705\n",
+ "zone 3177\n",
+ "zoom 3920\n",
+ "zurich 10622\n",
+ "Length: 3553, dtype: int64"
+ ]
+ },
+ "execution_count": 49,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Retrieve the index of the document\n",
+ "tfidf.idxmax()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fccc0249-7c68-42ee-8290-ff41715e346b",
+ "metadata": {
+ "id": "fccc0249-7c68-42ee-8290-ff41715e346b"
+ },
+ "source": [
+ "For example, the term \"worst\" occurs most distinctively in the 918th tweet."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "09b222fb-ad8c-4767-a974-dd261370a06e",
+ "metadata": {
+ "id": "09b222fb-ad8c-4767-a974-dd261370a06e",
+ "outputId": "7becaa64-4264-4bb5-cb23-dd1e613925c4"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "918"
+ ]
+ },
+ "execution_count": 50,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tfidf.idxmax()['worst']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "955a48bc-dc93-481b-ba49-29876fc577fb",
+ "metadata": {
+ "id": "955a48bc-dc93-481b-ba49-29876fc577fb"
+ },
+ "source": [
+ "Recall that this is the tweet where the word \"worst\" appears six times!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "079ee0e0-476f-4236-ba8a-615ba7a0efe8",
+ "metadata": {
+ "id": "079ee0e0-476f-4236-ba8a-615ba7a0efe8",
+ "outputId": "05bf355c-82be-4b01-f7d6-07536e6ae1c6"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\"USER is the worst. worst reservation policies. worst costumer service. worst worst worst. congrats, USER you're not that bad!\""
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tweets['text_processed'].iloc[918]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9dd06bbc-e2fc-49e4-9354-efdaca5cfbd3",
+ "metadata": {
+ "id": "9dd06bbc-e2fc-49e4-9354-efdaca5cfbd3"
+ },
+ "source": [
+ "How about \"cancel\"? Let's take a look at another example."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f809df1a-1178-4272-a415-42edb20173b2",
+ "metadata": {
+ "id": "f809df1a-1178-4272-a415-42edb20173b2",
+ "outputId": "3c7739f9-df94-429a-e62e-f4a1f550d019"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "5945"
+ ]
+ },
+ "execution_count": 52,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tfidf.idxmax()['cancel']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8093b6a7-54ca-468a-9376-b3c0be0b6f9b",
+ "metadata": {
+ "id": "8093b6a7-54ca-468a-9376-b3c0be0b6f9b",
+ "outputId": "d82f97e5-52cb-417b-f3e9-23b23cf7ccab"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'USER cancelled flighted 😢'"
+ ]
+ },
+ "execution_count": 53,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tweets['text_processed'].iloc[5945]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "163dcecd-dc8c-43a9-952d-5bc84a307b07",
+ "metadata": {
+ "id": "163dcecd-dc8c-43a9-952d-5bc84a307b07"
+ },
+ "source": [
+ "## 🥊 Challenge 3: Words with Highest Mean TF-IDF scores\n",
+ "\n",
+ "We have obtained tf-idf values for each term in each document. But what do these values tell us about the sentiments of tweets? Are there any words that are particularly informative for positive/negative tweets?\n",
+ "\n",
+ "To explore this, let's gather the indices of all positive/negative tweets and calculate the mean tf-idf scores of words appear in each category.\n",
+ "\n",
+ "We've provided the following starter code to guide you:\n",
+ "- Subset the `tweets` dataframe according to the `airline_sentiment` label and retrieve the index of each subset (`.index`). Assign the index to `positive_index` or `negative_index`.\n",
+ "- For each subset:\n",
+ " - Retrieve the td-idf representation\n",
+ " - Take the mean tf-idf values across the subset using `.mean()`\n",
+ " - Sort the mean values in the descending order using `.sort_values()`\n",
+ " - Get the top 10 terms using `.head()`\n",
+ "\n",
+ "Next, run `pos.plot` and `neg.plot` to plot the words with the highest mean tf-idf scores for each subset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2bfbf838-9ff6-48b8-ad5d-5e75304fe060",
+ "metadata": {
+ "id": "2bfbf838-9ff6-48b8-ad5d-5e75304fe060"
+ },
+ "outputs": [],
+ "source": [
+ "# Complete the boolean masks\n",
+ "positive_index = tweets[...].index\n",
+ "negative_index = tweets[...].index"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8c67ea1f-de9e-49a9-94f2-a3351446e364",
+ "metadata": {
+ "id": "8c67ea1f-de9e-49a9-94f2-a3351446e364"
+ },
+ "outputs": [],
+ "source": [
+ "# Complete the following two lines\n",
+ "pos = tfidf.loc[...].mean().sort_values(...).head(...)\n",
+ "neg = tfidf.loc[...].mean().sort_values(...).head(...)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f1e29043-8c78-4e41-81d2-b4552030b457",
+ "metadata": {
+ "id": "f1e29043-8c78-4e41-81d2-b4552030b457"
+ },
+ "outputs": [],
+ "source": [
+ "pos.plot(kind='barh',\n",
+ " xlim=(0, 0.18),\n",
+ " color='cornflowerblue',\n",
+ " title='Top 10 terms with the highest mean tf-idf values for positive tweets');"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e8b25940-2372-4755-818e-f75e4d23daf9",
+ "metadata": {
+ "id": "e8b25940-2372-4755-818e-f75e4d23daf9"
+ },
+ "outputs": [],
+ "source": [
+ "neg.plot(kind='barh',\n",
+ " xlim=(0, 0.18),\n",
+ " color='darksalmon',\n",
+ " title='Top 10 terms with the highest mean tf-idf values for negative tweets');"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "77bca876-9649-46f3-bd4f-f9f68fea649a",
+ "metadata": {
+ "id": "77bca876-9649-46f3-bd4f-f9f68fea649a"
+ },
+ "source": [
+ "🔔 **Question**: How would you interpret these results? Share your thoughts in the chat!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "da410cb3-a452-441b-a94d-8f751d59d7a6",
+ "metadata": {
+ "id": "da410cb3-a452-441b-a94d-8f751d59d7a6"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## 🎬 **Demo**: Sentiment Classification Using the TF-IDF Representation\n",
+ "\n",
+ "Now that we have a tf-idf representation of the text, we are ready to do sentiment analysis!\n",
+ "\n",
+ "In this demo, we will use a logistic regression model to perform the classification task. Here we briefly step through how logistic regression works as one of the supervised Machine Learning methods, but feel free to explore our workshop on [Python Machine Learning Fundamentals](https://github.com/dlab-berkeley/Python-Machine-Learning) if you want to learn more about it.\n",
+ "\n",
+ "Logistic regression is a linear model, with which we use to predict the label of a tweet, based on a set of features ($x_1, x_2, x_3, ..., x_i$), as shown below:\n",
+ "\n",
+ "$$\n",
+ "L = \\beta_1 x_1 + \\beta_2 x_2 + \\cdots + \\beta_T x_T\n",
+ "$$\n",
+ "\n",
+ "The list of features we'll pass to the model is the vocabulary of the DTM. We also feed the model with a portion of the data, known as the training set, along with other model specification, to learn the coeffient ($\\beta_1, \\beta_2, \\beta_3, ..., \\beta_i$) of each feature. The coefficients tell us whether a feature contributes positively or negatively to the predicted value. The predicted value corresponds to adding all features (multiplied by their coefficients) up, and the predicted value gets passed to a [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) to be converted into the probability space, which tells us whether the predicted label is positive (when $p>0.5$) or negative (when $p<0.5$).\n",
+ "\n",
+ "The remaining portion of the data, known as the test set, is used to test whether the learned coefficients could be generalized to unseen data.\n",
+ "\n",
+ "Now that we already have the tf-idf dataframe, the feature set is ready. Let's dive into model specification!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Estas líneas de código importan dos herramientas de scikit-learn: una para crear un modelo de regresión logística con validación cruzada (LogisticRegressionCV) y otra para dividir los datos en conjuntos de entrenamiento y prueba (train_test_split), lo que permite entrenar y evaluar el modelo de manera adecuada.*"
+ ],
+ "metadata": {
+ "id": "All0kUqFjKkY"
+ },
+ "id": "All0kUqFjKkY"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "33413d63-87eb-489f-b374-3cfeaa51cf3c",
+ "metadata": {
+ "id": "33413d63-87eb-489f-b374-3cfeaa51cf3c"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import LogisticRegressionCV\n",
+ "from sklearn.model_selection import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee87ff74-3fbb-472a-b795-6f4d18fab215",
+ "metadata": {
+ "id": "ee87ff74-3fbb-472a-b795-6f4d18fab215"
+ },
+ "source": [
+ "We'll use the `train_test_split` function from `sklearn` to separate our data into two sets:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Este bloque de código separa los datos en dos partes: las características de los tweets (X) y los sentimientos que se quieren predecir (y). Luego divide todo en conjuntos de entrenamiento y prueba, usando el 15% de los datos para probar el modelo y el resto para entrenarlo, de manera que se pueda evaluar su rendimiento de forma confiable.*"
+ ],
+ "metadata": {
+ "id": "bWmC1zmxjXoK"
+ },
+ "id": "bWmC1zmxjXoK"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "64cec8b9-14d9-4897-9c02-cc89fcf7b3c6",
+ "metadata": {
+ "id": "64cec8b9-14d9-4897-9c02-cc89fcf7b3c6"
+ },
+ "outputs": [],
+ "source": [
+ "# Train-test split\n",
+ "X = tfidf\n",
+ "y = tweets['airline_sentiment']\n",
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "066771d8-2f31-4646-9a1b-6d2b1b9b208c",
+ "metadata": {
+ "id": "066771d8-2f31-4646-9a1b-6d2b1b9b208c"
+ },
+ "source": [
+ "The `fit_logistic_regression` function is written below to streamline the training process."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "*Esta función entrena un modelo de regresión logística usando los datos que se le pasan. Ajusta automáticamente los parámetros para mejorar su desempeño y devuelve el modelo ya listo para hacer predicciones.*"
+ ],
+ "metadata": {
+ "id": "IYoQQ-fbjhWY"
+ },
+ "id": "IYoQQ-fbjhWY"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d46de0b2-af00-4a1d-b4cd-31b96ce545d1",
+ "metadata": {
+ "id": "d46de0b2-af00-4a1d-b4cd-31b96ce545d1"
+ },
+ "outputs": [],
+ "source": [
+ "def fit_logistic_regression(X, y):\n",
+ " '''Fits a logistic regression model to provided data.'''\n",
+ " model = LogisticRegressionCV(Cs=10,\n",
+ " penalty='l1',\n",
+ " cv=5,\n",
+ " solver='liblinear',\n",
+ " class_weight='balanced',\n",
+ " random_state=42,\n",
+ " refit=True).fit(X, y)\n",
+ " return model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "124aa7ea-1bc1-43e2-beeb-0ba2da9b2df9",
+ "metadata": {
+ "id": "124aa7ea-1bc1-43e2-beeb-0ba2da9b2df9"
+ },
+ "source": [
+ "We'll fit the model and compute the training and test accuracy."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "773963bd-6603-4fad-884b-09ce60afab18",
+ "metadata": {
+ "id": "773963bd-6603-4fad-884b-09ce60afab18"
+ },
+ "outputs": [],
+ "source": [
+ "# Fit the logistic regression model\n",
+ "model = fit_logistic_regression(X_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e10d06c1-d884-45d4-a03d-dd5d40bf70aa",
+ "metadata": {
+ "id": "e10d06c1-d884-45d4-a03d-dd5d40bf70aa",
+ "outputId": "1ff541ca-c3e9-4791-9f0c-088eef6a9eed"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training accuracy: 0.9455601998164951\n",
+ "Test accuracy: 0.894919168591224\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Get the training and test accuracy\n",
+ "print(f\"Training accuracy: {model.score(X_train, y_train)}\")\n",
+ "print(f\"Test accuracy: {model.score(X_test, y_test)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d4e186c5-1719-4deb-bdb4-614a9980f058",
+ "metadata": {
+ "id": "d4e186c5-1719-4deb-bdb4-614a9980f058"
+ },
+ "source": [
+ "The model achieved ~94% accuracy on the training set and ~89% on the test set—that's pretty good! The model generalizes reasonably well to the test data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "310dac39-4753-4ae8-8dfa-e65e5824cccb",
+ "metadata": {
+ "id": "310dac39-4753-4ae8-8dfa-e65e5824cccb"
+ },
+ "source": [
+ "Next, let's also take a look at the fitted coefficients to see if what we see makes sense.\n",
+ "\n",
+ "We can access them using `coef_`, and we can match each coefficient to the tokens from the vectorizer:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6dcb6ef1-13b3-437e-813c-7118911847a4",
+ "metadata": {
+ "id": "6dcb6ef1-13b3-437e-813c-7118911847a4"
+ },
+ "outputs": [],
+ "source": [
+ "# Get coefs of all features\n",
+ "coefs = model.coef_.ravel()\n",
+ "\n",
+ "# Get all tokens\n",
+ "tokens = vectorizer.get_feature_names_out()\n",
+ "\n",
+ "# Create a token-coef dataframe\n",
+ "importance = pd.DataFrame()\n",
+ "importance['token'] = tokens\n",
+ "importance['coefs'] = coefs"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3e63814e-9c0d-4f7a-a5e0-72cca2758d71",
+ "metadata": {
+ "id": "3e63814e-9c0d-4f7a-a5e0-72cca2758d71",
+ "outputId": "cfad72bc-f484-4226-d4dd-3ffc498c3e54"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ " token coefs\n",
+ "3165 thankful 8.002975\n",
+ "1091 exceptional 8.136278\n",
+ "1563 impressed 8.501364\n",
+ "648 compliment 8.981360\n",
+ "1373 great 9.080558\n",
+ "3498 wonderful 9.401606\n",
+ "1089 excellent 10.147230\n",
+ "250 awesome 10.315909\n",
+ "1746 kudo 11.623828\n",
+ "3164 thank 16.027534"
+ ]
+ },
+ "execution_count": 62,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Get the top 10 tokens with highest coefs\n",
+ "pos_coef = importance.sort_values('coefs').tail(10)\n",
+ "pos_coef"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b3b7893-caa0-4281-98f0-92c9e7b31953",
+ "metadata": {
+ "id": "7b3b7893-caa0-4281-98f0-92c9e7b31953"
+ },
+ "source": [
+ "Let's plot the top 10 tokens with the highest/lowest coefficients."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "17b1223b-e5c1-4992-bb7e-0a99651c3729",
+ "metadata": {
+ "id": "17b1223b-e5c1-4992-bb7e-0a99651c3729",
+ "outputId": "265aeff7-4356-442c-daeb-366fc93a5050"
+ },
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Plot the top 10 tokens that have the highest coefs\n",
+ "pos_coef.sort_values('coefs', ascending=False) \\\n",
+ " .plot(kind='barh',\n",
+ " xlim=(0, 18),\n",
+ " x='token',\n",
+ " color='cornflowerblue',\n",
+ " title='Top 10 tokens with highest coeffient values');"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "159e00c6-8a9f-484f-aea2-853fd5512083",
+ "metadata": {
+ "id": "159e00c6-8a9f-484f-aea2-853fd5512083",
+ "outputId": "98ff22a3-1474-4cfe-8979-99a4efd966ed"
+ },
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Plot the top 10 tokens that have the lowest coefs\n",
+ "neg_coef.plot(kind='barh',\n",
+ " xlim=(0, -18),\n",
+ " x='token',\n",
+ " color='darksalmon',\n",
+ " title='Top 10 tokens with lowest coeffient values');"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2eed48ea-fd35-4585-9b98-90456aaee447",
+ "metadata": {
+ "id": "2eed48ea-fd35-4585-9b98-90456aaee447"
+ },
+ "source": [
+ "Words like \"ruin,\" \"rude,\" and \"hour\" are strong indicators of negative sentiment, while \"thank,\" \"awesome,\" and \"wonderful\" are associated with positive sentiment.\n",
+ "\n",
+ "We will wrap up Part 2 with these plots. These coefficient terms and the words with the highest TF-IDF values provide different perspectives on the sentiment of tweets. If you'd like, take some time to compare the two sets of plots and see which one provides a better account of the sentiments conveyed in tweets."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a4430fbd-108f-4a02-ab64-ef36c5949e56",
+ "metadata": {
+ "id": "a4430fbd-108f-4a02-ab64-ef36c5949e56"
+ },
+ "source": [
+ "
\n",
+ "\n",
+ "## ❗ Key Points\n",
+ "\n",
+ "* A Bag-of-Words representation is a simple method to transform our text data to numbers. It focuses on word frequency but not word order.\n",
+ "* A TF-IDF representation is a step further; it also considers if a certain word distinctively appears in one document or occurs uniformally across all documents.\n",
+ "* With a numerical representation, we can perform a range of text classification task, such as sentiment analysis.\n",
+ "\n",
+ "
"
+ ]
}
- ],
- "source": [
- "# Plot the top 10 tokens that have the highest coefs\n",
- "pos_coef.sort_values('coefs', ascending=False) \\\n",
- " .plot(kind='barh', \n",
- " xlim=(0, 18),\n",
- " x='token',\n",
- " color='cornflowerblue',\n",
- " title='Top 10 tokens with highest coeffient values');"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 64,
- "id": "159e00c6-8a9f-484f-aea2-853fd5512083",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.4"
+ },
+ "colab": {
+ "provenance": []
}
- ],
- "source": [
- "# Plot the top 10 tokens that have the lowest coefs\n",
- "neg_coef.plot(kind='barh', \n",
- " xlim=(0, -18),\n",
- " x='token',\n",
- " color='darksalmon',\n",
- " title='Top 10 tokens with lowest coeffient values');"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2eed48ea-fd35-4585-9b98-90456aaee447",
- "metadata": {},
- "source": [
- "Words like \"ruin,\" \"rude,\" and \"hour\" are strong indicators of negative sentiment, while \"thank,\" \"awesome,\" and \"wonderful\" are associated with positive sentiment. \n",
- "\n",
- "We will wrap up Part 2 with these plots. These coefficient terms and the words with the highest TF-IDF values provide different perspectives on the sentiment of tweets. If you'd like, take some time to compare the two sets of plots and see which one provides a better account of the sentiments conveyed in tweets."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a4430fbd-108f-4a02-ab64-ef36c5949e56",
- "metadata": {},
- "source": [
- "
\n",
- "\n",
- "## ❗ Key Points\n",
- "\n",
- "* A Bag-of-Words representation is a simple method to transform our text data to numbers. It focuses on word frequency but not word order. \n",
- "* A TF-IDF representation is a step further; it also considers if a certain word distinctively appears in one document or occurs uniformally across all documents. \n",
- "* With a numerical representation, we can perform a range of text classification task, such as sentiment analysis. \n",
- "\n",
- "