July 1, 2023

Scraping CNN Articles with Python

Extracting Text, Images, and Author Details

Use Python to scrape cnn.com's homepage for the most recent news articles.

Photo by Rubaitul Azad on Unsplash

Introduction

Today we'll scrape cnn.com's homepage for the most recent news articles using static web scraping. In practice, you'd want to use static web scraping (the requests library) because it's much faster.

CNN.com has numerous detailed articles on many subjects, making it a useful source to stay up to date on politics and world events.

Getting Started

This article will use Python code, along with several libraries that'll need to be installed:

pip install requests
pip install bs4
pip install pandas

Also, if you want to save the output data in the Excel (xlsx) format, download openpyxl .

pip install openpyxl

The requests library allows us to request HTML, and the bs4 library allows us to parse that data.

To start, let's get the HTML for the cnn.com homepage and get all of the links on the page:

import requests
from bs4 import BeautifulSoup

all_urls = []
url = 'https://www.cnn.com'
data = requests.get(url).text
soup = BeautifulSoup(data, features="html.parser")
for a in soup.find_all('a', href=True):
    if a['href'] and a['href'][0] == '/' and a['href'] != '#':
        a['href'] = url + a['href']
    all_urls.append(a['href'])

The code above, specifically soup.find_all('a', href=True), uses the HTML parser BeautifulSoup to look for all of the <a> tags with an href value.

a['href'] = url + a['href']

The above line extends the shortened URLs to include the domain and protocol information. /world becomes https://www.cnn.com/world.

The result:

['https://www.cnn.com', 
'https://www.cnn.com/us', 
'https://www.cnn.com/world', 
'https://www.cnn.com/politics', 
...]

We're only interested in article URLs. Let's only look for those:

def url_is_article(url, current_year='2023'):
    if url:
        if 'cnn.com/{}/'.format(current_year) in url and '/gallery/' not in url:
            return True
    return False

article_urls = [url for url in all_urls if url_is_article(url)]

Here's what the code found:

['https://www.cnn.com/2023/07/01/politics/supreme-court-term-takeaways/index.html', 
'https://www.cnn.com/2023/07/01/world/france-protests-arrests-hundreds-intl-hnk/index.html', 
'https://www.cnn.com/2023/06/30/us/california-woman-guilty-kidnapping-false-report-jail/index.html',
...]

Scraping Articles - Title, Author, and Text

Let's scrape the first of the article URLs from the previous code snippet.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.cnn.com/2023/07/01/politics/supreme-court-term-takeaways/index.html'
data = requests.get(url).text
print(data)

<!DOCTYPE html>
<html lang="en" data-uri="cms.cnn.com/...

First, the article's title. Using the Developer Tools (Google Chrome), we see that the title element is a <h1> tag with the class headline__text. Let's use BeautifulSoup again to get this value

def parse(html):
    soup = BeautifulSoup(html, features="html.parser")
    title = soup.find('h1', {'class': 'headline__text'})
    if title:
        title = title.text.strip()
    else:
        title = ''
    return title
print(parse(data))

# Takeaways from the latest controversial and contentious Supreme Court term

Notice how we had to check if the title existed or not, and then replaced it with '' if it didn't? We'll have to do that for most of the values we need, so let's make a function to do that:

def return_text_if_not_none(element):
    if element:
        return element.text.strip()
    else:
        return ''

So the title code now becomes:

def parse(html):
    soup = BeautifulSoup(html, features="html.parser")
    title = return_text_if_not_none(soup.find('h1', {'class': 'headline__text'}))
    return title

In a similar fashion, let's find the author's name:

def parse(html):
    ...
    author = soup.find('span', {'class': 'byline__name'})
    if not author:
        author = soup.find('span', {'class': 'byline__names'})
    author = return_text_if_not_none(author)
    return author

# Ariane de Vogue

In my testing, I found that the author, which is always in a span tag, can either have the class byline__name or byline__names (to denote several authors, I figured). At any rate, the above code first searches for the first class, and if that doesn't return anything, it assumes that the second class works.

Article text is pretty easy to get. All of the text is stored in a <div> element with the class article__content.

def parse(html):
    ...
    article_content = return_text_if_not_none(soup.find('div', {'class': 'article__content'}))
    return article_content
# CNN - Last fall, just when the Supreme Court was gearing up to start a new term, Chief Justice John Roberts...

Now let's scrape the timestamp and images

Scraping Timestamp

The timestamp and the images are a bit more difficult to extract correctly, for several reasons. Let's start with the timestamp.

The timestamp is stored in a <div> tag with the class timestamp. Let's get the text:

def parse(html):
    ...
    timestamp = return_text_if_not_none(soup.find('div', {'class': 'timestamp'}))
    return timestamp

"""Updated
        2:27 PM EDT, Sat July 1, 2023"""

There are several parts to this timestamp: The part where it says the article was "Updated" (or "Published") and the tile, date, and year. What if we want to get these parts of the timestamp individually?

Let's make a new function, parse_timestamp, to do just that

def parse_timestamp(timestamp):
    if 'Published' in timestamp:
        timestamp_type = 'Published'
    elif 'Updated' in timestamp:
        timestamp_type = 'Updated'
    else:
        timestamp_type = ''


    article_time, article_day, article_year = timestamp.replace('Published', '').replace('Updated', '').split(', ')
    return timestamp_type, article_time.strip(), article_day.strip(), article_year.strip()



def parse(html):
    ...
    if timestamp:
        timestamp = parse_timestamp(timestamp)
    else:
        timestamp = ['', '', '', '']
    return timestamp

# ('Updated', '2:27 PM EDT', 'Sat July 1', '2023')

The first part of the new function checks if "Published" or "Updated" is in the timestamp text and updates the timestamp_type variable accordingly. The second part splits the string by its commas, getting the time of day, date, and year that the CNN article was published.

Scraping Images

Finally, let's get all of the images in the article. All of the <img> tags that we want have the class image__dam-img and contain the src attribute. Let's get those elements

def parse(html):
    ...
    all_images = soup.find_all('img', {'class': 'image__dam-img'}, src=True)
    return all_images
# [<img alt="FILE PHOTO: U.S. ...

Now, to get the src attribute:

def parse(html):
    ...
    all_img_srcs = ''
    if all_images:
        all_img_srcs = [x['src'] for x in all_images if x['src']]
        all_img_srcs = '
'.join(all_img_srcs)
    return all_img_srcs
# "https://media.cnn.com/api/v1/images/stellar/prod/230630224519-supreme-court-term-takeaways.jpg?c=16x9&q=w_850,c_fill
..."

Putting Everything Together

Now that we've made a function to parse the articles, let's use the URLs we found from the homepage to test it. Once we have all of the data, we'll use the pandas library to save the data into an excel file:

def parse(html):
    ...
    data = [title, author, *timestamp, all_img_srcs, article_content] # the * flattens the tuple
    return [x.strip() for x in data]


all_data = []
article_urls_duplicates_removed = list(set(article_urls))
for article_url in article_urls_duplicates_removed:
    article_data = requests.get(article_url).text
    parsed_data = parse(article_data)
    all_data.append(parsed_data)
    print(parsed_data)


df = pd.DataFrame(all_data, columns=['Title', 'Author', 'Timestamp Type', 'Article Year', 'Article Day', 'Article Time', 'Images', 'Article Text'])
df.to_excel('cnn.xlsx', index=False)

# you may need to install openpyxl (pip install openpyxl) to save in the Excel format.

And here's a sample of the results:

['What markets are saying about when to expect a recession', 'Krystal Hur', 'Published' ...]

['An earthquake destroyed their home days before Christmas. They still have to pay the mortgage'...]

Please click here to be sent an email containing the code referenced in this blog post.