How do i extract text from a url in python?

Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., ') and HTML entities (e.g., &).

It also includes a trivial plain-text-to-html inverse converter.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with  tags,
    converting newlines to 
tags and converting confusing chars into html entities. """ def f(mo): t = mo.group() if len(t) == 1: return {'&':'&', "'":''', '"':'"', '<':'<', '>':'>'}.get(t) return '
%s' % (t, t) return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

When performing content analysis at scale, you’ll need to automatically extract text content from web pages.

In this article you’ll learn how to extract the text content from single and multiple web pages using Python.


!pip install beautifulsoup4
!pip install numpy
!pip install requests
!pip install spacy
!pip install trafilatura

NB: If you’re writing this in a standard python file, you won’t need to include the ! symbol. This is solely because this tutorial is written in a Jupyter Notebook.


Firstly we’ll break the problem down into several stages:

  1. Extract all of the HTML content using requests into a python dictionary.
  2. Pass every single HTML page to Trafilatura to parse the text content.
  3. Add error and exception handling so that if Trafilatura fails, we can still extract the content, albeit with a less accurate approach.

from bs4 import BeautifulSoup
import json
import numpy as np
import requests
from requests.models import MissingSchema
import spacy
import trafilatura

Collect The HTML Content From The Website

urls = ['https://understandingdata.com/',
      'https://sempioneer.com/',]
data = {}

for url in urls:
    # 1. Obtain the response:
    resp = requests.get(url)
    
    # 2. If the response content is 200 - Status Ok, Save The HTML Content:
    if resp.status_code == 200:
        data[url] = resp.text

After collecting the all of the requests that had a status_code of 200, we can now apply several attempts to extract the text content from every request.

Firstly we’ll try to use trafilatura, however if this library is unable to extract the text, then we’ll use BeautifulSoup4 as a fallback.

def beautifulsoup_extract_text_fallback(response_content):
    
    '''
    This is a fallback function, so that we can always return a value for text content.
    Even for when both Trafilatura and BeautifulSoup are unable to extract the text from a 
    single URL.
    '''
    
    # Create the beautifulsoup object:
    soup = BeautifulSoup(response_content, 'html.parser')
    
    # Finding the text:
    text = soup.find_all(text=True)
    
    # Remove unwanted tag elements:
    cleaned_text = ''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        'style',]

    # Then we will loop over every item in the extract text and make sure that the beautifulsoup4 tag
    # is NOT in the blacklist
    for item in text:
        if item.parent.name not in blacklist:
            cleaned_text += '{} '.format(item)
            
    # Remove any tab separation and strip the text:
    cleaned_text = cleaned_text.replace('\t', '')
    return cleaned_text.strip()
    

def extract_text_from_single_web_page(url):
    
    downloaded_url = trafilatura.fetch_url(url)
    try:
        a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
                            date_extraction_params={'extensive_search': True, 'original_date': True})
    except AttributeError:
        a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
                            date_extraction_params={'extensive_search': True, 'original_date': True})
    if a:
        json_output = json.loads(a)
        return json_output['text']
    else:
        try:
            resp = requests.get(url)
            # We will only extract the text from successful requests:
            if resp.status_code == 200:
                return beautifulsoup_extract_text_fallback(resp.content)
            else:
                # This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
                return np.nan
        # Handling for any URLs that don't have the correct protocol
        except MissingSchema:
            return np.nan

single_url = 'https://understandingdata.com/'
text = extract_text_from_single_web_page(url=single_url)
print(text)

Let’s use a list comprehension with our single_extract text function to easily extract the text from many web pages:


urls = urls + ['fake_url']

text_content = [extract_text_from_single_web_page(url) for url in urls]

print(text_content[1])
print(text_content[-1:])

Notice how we’ve made sure that any URL that failed can easily be removed as we’ve returned np.nan (not a number).


Cleaning Our Raw Text From Multiple Web Pages

After you’ve successfully extracted the raw text documents, let’s remove any web pages that failed:

cleaned_textual_content = [text for text in text_content if str(text) != 'nan']

Also, you might want to clean the text for further analysis. For example, tokenising the text content allows you to analyse the sentiment, the sentence structure, semantic dependencies and also the word count.

nlp = spacy.load("en_core_web_sm")
for cleaned_text in cleaned_textual_content:
    # 1. Create an NLP document with Spacy:
    doc = nlp(cleaned_text)
    # 2. Spacy has tokenised the text content:
    print(f"This is a spacy token: {doc[0]}")
    # 3. Extracting the word count per text document:
    print(f"The estimated word count for this document is: {len(doc)}.")
    # 4. Extracting the number of sentences:
    print(f"The estimated number of sentences in the document is: {len(list(doc.sents))}")
    print('\n')

Conclusion

Hopefully you can now easily extract text content from either a single url or multiple urls.

We’ve also included beautifulsoup as a failside/fallback function. This ensures that our code is less fragile and is able to withstand the following errors:

  • Invalid URLs.
  • URLs that had a failed status code (not 200).
  • Removing all URLs that we were unable to extract the text content from.

What's your reaction?

This website contains links to some third party sites which are described as affiliate links. These affiliate links allow us to gain a small commission when you click and buy products on those sites (it doesn't cost you anything extra!). understandingdata.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for website owners to earn advertising fees by advertising and linking to Amazon and any other website that may be affiliated with Amazon Service LLC Associates Program.

How do I extract textual data from URL in python?

Approach:.
Create a text file..
Now for the program, import required module and pass URL and **. ... .
Make requests instance and pass into URL..
Open file in read mode and pass required parameter(s)..
Pass the requests into a Beautifulsoup() function..
Create another file(or you can also write/append in existing file)..

How do I extract text from a URL?

Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.

How do I read a text file from a URL in python?

How to read a text file from a URL in Python.
url = "http://textfiles.com/adventure/aencounter.txt".
file = urllib. request. urlopen(url).
for line in file:.
decoded_line = line. decode("utf-8").
print(decoded_line).

How do I extract specific text in python?

How to extract specific portions of a text file using Python.
Make sure you're using Python 3..
Reading data from a text file..
Using "with open".
Reading text files line-by-line..
Storing text data in a variable..
Searching text for a substring..
Incorporating regular expressions..
Putting it all together..