Remixing Writing Pedagogies with Writing Code and Data through Exploratory Data Analysis¶
CCCC 2025 | Computer Love | Session 2151 | Room 350
- Time: 04/10/2025 4:45PM - 6:00PM EDT
- Abstract: What is our discipline’s role in teaching and advocating for critical approaches to writing data? The facilitator and attendees will work through a computational notebook together that conducts exploratory data analysis. This coding and data process is meant to facilitate discussion to learn more about what writing and technical communication can bring to data practices.
The goal will be to learn more about the problems and potential for writing and technical and professional communication (TPC) scholars and teachers to teach coding and data practices, such as machine learning and data modeling, with a humanities-driven, advocacy approach. - Who Should Come?: Novices and experts are all welcome and encouraged to attend. No previous experience coding is required, since this ELE is meant to facilitate a discussion about the role that writing and TPC plays pedagogically beyond using (or not using) generative AI tools.
- Accompanying Google Slides
- License: CC BY-NC-SA
Citation¶
- Lindgren, C. A. (10 Apr. 2025). Remixing Writing Pedagogies with Writing Code and Data through Exploratory Data Analysis. NCTE CCCC 2025. Baltimore, MD, United States. https://doi.org/10.5281/zenodo.15177209
Chapter 4. Text Classification with Logistic Regression¶
Our first machine-learning challenge is to create a prediction model that categorizes news articles with the appropriate categories from a set of 31 categories: politics, entertainment, etc. The model uses logistic regression techniques in Python on a dataset with headlines, short descriptions, and URLs.
NOTE: Be sure to watch and read the materials posted in the Canvas module before and while you work through this notebook.
Learning Objectives
- Import and run EDA techniques to understand the potential limits and affordances of the dataset with our ML goal in mind.
- Learn about the basic mechanics of logistic regression (LR).
- Apply LR to this text classification goal of categorizing the news genre of articles based on potential "features" in the data, such as the article's headline, short description, and URL.
Sources
- Notebook modified from Ganesan's LR example exercise: Text Classification with Logistic Regression
- Dataset: HuffPost News Headlines & Categories in ../data/news_category_dataset.json
Import Libraries¶
# You may need to install some of the libraries below
# If so, uncomment any of the below commands
# %pip install pandas==2.1.4
# %pip install matplotlib==3.10.0
# %pip install scipy==1.13.1
# %pip install seaborn==0.13.0
# %pip install wquantiles==0.6
# %pip install statsmodels==0.14.4
# %pip install scikit-learn==1.6.0
# %pip install mplcyberpunk==0.7.6
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import logging
# Custom utility functions
from utils import _reciprocal_rank,compute_accuracy,compute_mrr_at_k,collect_preds,extract_features,get_top_k_predictions,train_model,roc_curve_per_category,plot_class_roc_curve
%config InlineBackend.figure_formats = ['svg']
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
0. Refresher on pandas' dataframes¶
We can use the pandas library to read in, review, and revise the data set.
pandas, i.e., Panel data or Dataframe, organizes information in rows for What is observed** and columns for *properties of those observations, as well as an index column to make it easier to reference each row uniquely.
Pandas Series: One-dimensional array akin to a column OR a row in a spreadsheet. They are essentially a special kind of array list with special methods and functions.
Pandas DataFrame: Two-dimensional array akin to the combination of columns AND rows in a spreadsheet. Like Series, the DataFrame has special methods and functions that we can use to explore and analyze data. Also, we will focus a lot more on DFs, rather than Series. But, you should know about Series, because you sometimes create new Series (columns or rows) to add to your existing DataFrame. It's pretty rad.
Indices: Notice the color-coding going in the figure: red, blue, and yellow. The red and blue represent the indices of the rows and columns, while the yellow squares represent the values in the dataset.
Recall how pandas' "panel data" helps us more easily transform multiple types of structured data into what's called a DataFrame. Dataframe's are two-dimensional tabular data that are mutable (transformable) with labeled axes (rows and columns). See chapter 3 to refresh your memory, if needed. You can also reference pandas in general, or its Dataframe datatype.
1. Import the Data¶
# Imports the noted JSON file as a pandas DataFrame based on the path below
df = pd.read_json("./data/news_category_dataset.json", lines=True)
2. Exploratory Data Analysis¶
.shape returns a tuple e.g., (value, value), with info about the basic 2 dimensions of panel data:
- the number of rows, and
- the number of columns.
df.shape
(124989, 6)
The pandas DataFrame .info() method returns a more comprehensive description of the data set.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 124989 entries, 0 to 124988 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 short_description 124989 non-null object 1 headline 124989 non-null object 2 date 124989 non-null datetime64[ns] 3 link 124989 non-null object 4 authors 124989 non-null object 5 category 124989 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 5.7+ MB
Looks like there are no null values in the dataset.
Let's dig deeper into the columns and values.
2.1 Review particular columns¶
Be sure to review the data per column. The goals of this modeling may be set for you, but you will be asked to perform similar EDA work on your final project, so you can better understand the modeling possibilities and boundaries with your data. So, the below includes a series of exercises for you to understand this dataset before you conduct the actual modeling work.
# List the columns for reference
df.columns
Index(['short_description', 'headline', 'date', 'link', 'authors', 'category'], dtype='object')
2.2 Describe and examine the headline column¶
We can use .describe() on a Series, so we can consider any potential quirks.
Let's check out the headline column, since that's an important column for training our model.
What are the summary stats for the headline?¶
df.headline.describe()
count 124989 unique 124560 top Sunday Roundup freq 90 Name: headline, dtype: object
Some initial observations:
- Noticing how there is a decently sized difference between the
countanduniquevalues top: How "Sunday Roundup" headline appears 90 times (freq) in theheadlinecolumn (Name: headline, dtype: object).
Let's review rows with the "Sunday Roundup" value in the headline column by using our new skills with pandas. Below I ...
df.loc[]: query the dataframe with the location method[df.headline]: Specifiy what slice of the data I want to isolate..str.contains('Sunday Roundup'): Sinceheadlinevalues are Strings, and I want to isolate any rows with the "Sunday Roundup" headline, search the isolated Series,headline, with the.str.contains()method. It takes a string as its main parameter.
df.loc[df.headline.str.contains('Sunday Roundup')].sample(10)
| short_description | headline | date | link | authors | category | |
|---|---|---|---|---|---|---|
| 107742 | We don't know who will be victorious in Tuesda... | Sunday Roundup | 2014-11-02 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 116983 | This week, President Obama's $3.7 billion requ... | Sunday Roundup | 2014-07-20 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 82168 | This week we saw how dissimilar appeals to our... | Sunday Roundup | 2015-08-23 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 96792 | This week proved that while the arc of the mor... | Sunday Roundup | 2015-03-08 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 92626 | This week, the White House revealed it really ... | Sunday Roundup | 2015-04-26 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 62390 | This week the nation got to experience March M... | Sunday Roundup | 2016-04-03 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 88321 | This week, the nation waited in breathless ant... | Sunday Roundup | 2015-06-14 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 113253 | This week got off to a horrifying start as a 9... | Sunday Roundup | 2014-08-31 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 92027 | This week, the nation's eyes were on Baltimore... | Sunday Roundup | 2015-05-03 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
| 124201 | This week, Mika Brzezinski and I hosted our se... | Sunday Roundup | 2014-04-27 | https://www.huffingtonpost.com/entry/sunday-ro... | Arianna Huffington, Contributor | POLITICS |
Let's review the top 5 repeating headlines, in case there's somethign worth noting.
df.groupby(['headline'])['headline'].count().sort_values(ascending=True)[-5:].plot(
kind='barh',
figsize=(3,7)
)
<Axes: ylabel='headline'>
2.3 How long are the headlines? (What's the distribution?)¶
Also curious about the distribution of the length of those headlines, because that can tell us how much semantic info may be available in the headlines. Specifically, if a large sum of headlines are only 3-4 words, that may put a cap on how much contextual info each categorical label will actually supply the LR model.
So, let's add a new column to the pandas DataFrame, df, by apply()ing the len (length) method to the headline column. Below, I do so with the apply() method and assign the values per row to a new Series (column) that I call headline_length.
df['headline_char_length'] = df.headline.apply(len)
df.head()
| short_description | headline | date | link | authors | category | headline_char_length | |
|---|---|---|---|---|---|---|---|
| 0 | She left her husband. He killed their children... | There Were 2 Mass Shootings In Texas Last Week... | 2018-05-26 | https://www.huffingtonpost.com/entry/texas-ama... | Melissa Jeltsen | CRIME | 64 |
| 1 | Of course it has a song. | Will Smith Joins Diplo And Nicky Jam For The 2... | 2018-05-26 | https://www.huffingtonpost.com/entry/will-smit... | Andy McDonald | ENTERTAINMENT | 75 |
| 2 | The actor and his longtime girlfriend Anna Ebe... | Hugh Grant Marries For The First Time At Age 57 | 2018-05-26 | https://www.huffingtonpost.com/entry/hugh-gran... | Ron Dicker | ENTERTAINMENT | 47 |
| 3 | The actor gives Dems an ass-kicking for not fi... | Jim Carrey Blasts 'Castrato' Adam Schiff And D... | 2018-05-26 | https://www.huffingtonpost.com/entry/jim-carre... | Ron Dicker | ENTERTAINMENT | 69 |
| 4 | The "Dietland" actress said using the bags is ... | Julianna Margulies Uses Donald Trump Poop Bags... | 2018-05-26 | https://www.huffingtonpost.com/entry/julianna-... | Ron Dicker | ENTERTAINMENT | 71 |
Ok, we counted characters in the headline to capture one angle of headline length. Now, let's add an approximate word length Series as a column in the DataFrame.
Use apply(), again, but use a little Python built-in method magic to create a simple function that:
- Assigns a variable to the row's headline String value:
lambda hl: - Splits the
hlString value into a List array delimited by spaces:hl.split(' ') - Returns the length of the List array as an Integer:
len()
BOOM! We got a great approximation of the number of words in a new column.
df['headline_word_length'] = df.headline.apply(lambda hl: len(hl.split(' ')) )
df.head()
| short_description | headline | date | link | authors | category | headline_char_length | headline_word_length | |
|---|---|---|---|---|---|---|---|---|
| 0 | She left her husband. He killed their children... | There Were 2 Mass Shootings In Texas Last Week... | 2018-05-26 | https://www.huffingtonpost.com/entry/texas-ama... | Melissa Jeltsen | CRIME | 64 | 14 |
| 1 | Of course it has a song. | Will Smith Joins Diplo And Nicky Jam For The 2... | 2018-05-26 | https://www.huffingtonpost.com/entry/will-smit... | Andy McDonald | ENTERTAINMENT | 75 | 14 |
| 2 | The actor and his longtime girlfriend Anna Ebe... | Hugh Grant Marries For The First Time At Age 57 | 2018-05-26 | https://www.huffingtonpost.com/entry/hugh-gran... | Ron Dicker | ENTERTAINMENT | 47 | 10 |
| 3 | The actor gives Dems an ass-kicking for not fi... | Jim Carrey Blasts 'Castrato' Adam Schiff And D... | 2018-05-26 | https://www.huffingtonpost.com/entry/jim-carre... | Ron Dicker | ENTERTAINMENT | 69 | 11 |
| 4 | The "Dietland" actress said using the bags is ... | Julianna Margulies Uses Donald Trump Poop Bags... | 2018-05-26 | https://www.huffingtonpost.com/entry/julianna-... | Ron Dicker | ENTERTAINMENT | 71 | 13 |
'''
.hist() -- Creates a histogram chart with a dataframe's Series
Histograms place a metric in bins -- dates in this case -- to understand the distribution of the data. In this case, the distribution of the data over time
'''
df.headline_char_length.hist(
figsize=(12,6),
color='#86bf91',
)
<Axes: >
Looks like there's some variance with some headlines that are over 100 characters long and as much as ~300.
Let's use .describe() on this new column as a table of values to review.
'''
.hist() -- Creates a histogram chart with a dataframe's Series
Histograms place a metric in bins -- dates in this case -- to understand the distribution of the data. In this case, the distribution of the data over time
'''
df.headline_word_length.hist(
figsize=(12,6),
color='#86bf91',
)
<Axes: >
df.headline_char_length.describe()
count 124989.000000 mean 60.023194 std 17.274685 min 0.000000 25% 49.000000 50% 62.000000 75% 71.000000 max 320.000000 Name: headline_char_length, dtype: float64
df.headline_word_length.describe()
count 124989.000000 mean 9.863868 std 2.886560 min 1.000000 25% 8.000000 50% 10.000000 75% 12.000000 max 44.000000 Name: headline_word_length, dtype: float64
2.3.1 Exercise -- Observations about the headlines column¶
- ENTER YOUR FIRST OBSERVATION HERE
- Question: How might this impact our model's output?
- YOUR RESPONSE HERE
- ENTER YOUR SECOND OBSERVATION HERE
- Question: How might this impact our model's output?
- YOUR RESPONSE HERE
- ENTER MORE OBSERVATIONS
- Question: How might this impact our model's output?
- YOUR RESPONSE HERE
2.4 What are the range and distribution of dates?¶
Since we have a date column, and the datatype is in a standardized format, we can quickly plot the date range in a histogram figure.
Let's jot down some observations in our notebook.
NOTE: Be sure to respond to the questions and add at least one more potential question, observation, and potential explanation for it
2.4.1 EXERCISE -- Notes on the distribution of the data based on the date column¶
'''
.hist() -- Creates a histogram chart with a dataframe's Series
Histograms place a metric in bins -- dates in this case -- to understand the distribution of the data. In this case, the distribution of the data over time
'''
df.date.hist(
figsize=(8,3),
color='#86bf91',
)
<Axes: >
- Articles' publishing dates range between July 2014 and July 2018
- Question: How might this impact the model's output?
- YOUR RESPONSE HERE
- Fewer 2018 articles compared to the rest of the dataset.
- Question: How might this impact the model's output?
- YOUR RESPONSE HERE
2.7 Describe and examine the category column of news genres in the data¶
Use dot notation and a column name with sample() to sample values in that column (Series).
Remember: The dataframe stays the same because we are not altering df.
df.category.sample(5)
42740 POLITICS 56263 THE WORLDPOST 104391 ENTERTAINMENT 24967 CRIME 121039 FIFTY Name: category, dtype: object
Now, let's read and understand the news genre categories (politics, entertainment, etc.) in the dataset, as well as their distribution.
We can use the len() and set() functions in Python to isolate the unique number of values in a column.
len() by itself would provide a length of all values combined, even if the value is repeated. By using set() first on the column, we can tell Python to reduce the column to unique set of values. Then, it will count that unique set with len().
len(
set(df['category'].values)
)
31
set(
df['category'].values
)
{'ARTS',
'ARTS & CULTURE',
'BLACK VOICES',
'BUSINESS',
'COLLEGE',
'COMEDY',
'CRIME',
'EDUCATION',
'ENTERTAINMENT',
'FIFTY',
'GOOD NEWS',
'GREEN',
'HEALTHY LIVING',
'IMPACT',
'LATINO VOICES',
'MEDIA',
'PARENTS',
'POLITICS',
'QUEER VOICES',
'RELIGION',
'SCIENCE',
'SPORTS',
'STYLE',
'TASTE',
'TECH',
'THE WORLDPOST',
'TRAVEL',
'WEIRD NEWS',
'WOMEN',
'WORLD NEWS',
'WORLDPOST'}
2.7.1 category by count¶
Let's review the distribution of categories, since this could impact the model.
In the code below, we tell Python to:
- Select the
categorycolumn, - Count the values of each value in the column with
value_counts(), - Plot the results with
plotand provide the valuebarfor its parameterkind
df['category'].value_counts().sort_values(ascending=True).plot(
kind='barh'
)
<Axes: ylabel='category'>
2.7.2 EXERCISE -- Notes on the distribution of the data based on the category column¶
- ENTER YOUR FIRST OBSERVATION HERE
- Question: How might this impact our model's output?
- YOUR RESPONSE HERE
- ENTER YOUR SECOND OBSERVATION HERE
- Question: How might this impact our model's output?
- YOUR RESPONSE HERE
- ENTER MORE OBSERVATIONS
- Question: How might this impact our model's output?
- YOUR RESPONSE HERE
2.8 Describe and examine the short_description column¶
df.short_description.describe()
count 124989 unique 103905 top freq 19590 Name: short_description, dtype: object
df['sd_char_length'] = df.short_description.apply(len)
df[['short_description','sd_char_length']].sample(5)
| short_description | sd_char_length | |
|---|---|---|
| 91240 | In major cities, car-hailing apps like Uber, L... | 295 |
| 83273 | The word "alien" will no longer be used to ref... | 103 |
| 105089 | Andre Allen (Rock) is a hip New York-based sta... | 201 |
| 41362 | Trump’s unsubstantiated claims renew fundament... | 115 |
| 108430 | 0 |
df['sd_word_length'] = df.short_description.apply(lambda sd: len(sd.split(' ')) )
df.head()
| short_description | headline | date | link | authors | category | headline_char_length | headline_word_length | sd_char_length | sd_word_length | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | She left her husband. He killed their children... | There Were 2 Mass Shootings In Texas Last Week... | 2018-05-26 | https://www.huffingtonpost.com/entry/texas-ama... | Melissa Jeltsen | CRIME | 64 | 14 | 76 | 13 |
| 1 | Of course it has a song. | Will Smith Joins Diplo And Nicky Jam For The 2... | 2018-05-26 | https://www.huffingtonpost.com/entry/will-smit... | Andy McDonald | ENTERTAINMENT | 75 | 14 | 24 | 6 |
| 2 | The actor and his longtime girlfriend Anna Ebe... | Hugh Grant Marries For The First Time At Age 57 | 2018-05-26 | https://www.huffingtonpost.com/entry/hugh-gran... | Ron Dicker | ENTERTAINMENT | 47 | 10 | 87 | 15 |
| 3 | The actor gives Dems an ass-kicking for not fi... | Jim Carrey Blasts 'Castrato' Adam Schiff And D... | 2018-05-26 | https://www.huffingtonpost.com/entry/jim-carre... | Ron Dicker | ENTERTAINMENT | 69 | 11 | 86 | 14 |
| 4 | The "Dietland" actress said using the bags is ... | Julianna Margulies Uses Donald Trump Poop Bags... | 2018-05-26 | https://www.huffingtonpost.com/entry/julianna-... | Ron Dicker | ENTERTAINMENT | 71 | 13 | 87 | 13 |
# Print out a short comparison report
print(
'# Short Desc CHAR Length',
'\n', df.sd_char_length.describe(),
'\n\n# Short Desc WORD Length',
'\n', df.sd_word_length.describe(),
)
# Short Desc CHAR Length count 124989.000000 mean 92.415373 std 84.832972 min 0.000000 25% 33.000000 50% 76.000000 75% 123.000000 max 1472.000000 Name: sd_char_length, dtype: float64 # Short Desc WORD Length count 124989.000000 mean 15.935050 std 14.447419 min 1.000000 25% 6.000000 50% 13.000000 75% 21.000000 max 243.000000 Name: sd_word_length, dtype: float64
df.sd_char_length.hist(
figsize=(7,3),
color='#86bf91',
)
<Axes: >
df.sd_word_length.hist(
figsize=(7,3),
color='#86bf91',
)
<Axes: >
# Scatter plot using pandas
ax = df.plot(
kind='scatter',
x='sd_word_length',
y='sd_char_length',
color='red',
title='Relationship between word and character length of SDs'
)
# Customizing plot elements
ax.set_xlabel("SD Word Length")
ax.set_ylabel("SD Char Length")
plt.show()