Objectives

Create eBooks from Webpages

Get started with Web Scraping by making projects that you need
Learn how to approach a problem instead of being worried about the software stack
When using Python3, make the most of its standard library
Learn the usefulness of production-code practices like exceptions, docstrings, venv

Assumptions

Familiarity with HTML <tags>
Programmed in Python3 before, i.e. you understand the syntax

Dependencies

Python 3.7+
- BeautifulSoup: for accessing HTML file as an object.
- lxml: for parsing HTML files using the LXML parser.
Pandoc: for document format conversions.
Kindlegen: to convert EPUB to Kindle compatible formats (MOBI).

Motivation for This Talk

“Can I read this on my Kindle?”

The Solution: On Amazon Kindle

The Solution: On Android Device

How To Do It?

“Can I read this on my Kindle?”

Send an HTTP request to the server for the file.
Get the file, filter out the “title” & “judgement” (summary).
Save this to a text/html file.
Convert this file to an eBook, particularly one that is compatible with Android & Kindle.

But First, Some Prerequisites

Send A Request, Retrieve A File

from urllib import request
...
response = request.urlopen(url).read().decode('utf-8')

Create An HTML object

from bs4 import BeautifulSoup
...
html = BeautifulSoup(response, 'lxml')

Finding `<tags>`

Find: Based on Location
tag.subTag.subsubTag

# Find headline of text
headline = article.h2.a.text

Find: Based on a tag’s id/attribute
find/find_all

Syntax

.find(tag, attributes, recursive, text, keywords)
.find_all(tag, attributes, recursive, text, limit, keywords)

# [tag] Find all headings in the page
.find_all('h1')
.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

# [attributes] Find all <span> that contain green/red colored text
.find_all('span', {'class': {'green', 'red'}})

# [text] How many times is "Happy Birthday" displayed on the webpage?
.find_all(text='Happy Birthday')

# [keywords]
.find_all(id='span', class_={'green', 'red'})

Note: OR/AND

# Find all title-summary combinations that are colored in green or red
.find_all('div', id={'title','summary'}, class_={'green', 'red'})

Observations

“Can I read this on my Kindle?”

Title & Main Content can be captured from <div class="judgments">...</div>
All external document links in a document are relative links, not absolute links.
eg. “/doc/594125/” instead of “https://indiankanoon.org/doc/594125/”

Source Code: Let’s Jump In!

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import sys
from urllib import request

urlBase = 'https://indiankanoon.org'


def generateHtml(urlId):
    url = f'{urlBase}/doc/{urlId}'
    try:
        response = request.urlopen(url).read().decode('utf-8')
        html = BeautifulSoup(response, 'lxml')
        judgement = html.find('div', class_='judgments')
        title = judgement.find('div', class_='doc_title').text
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>{judgement}</body>
        </html>
        '''
        with open(f'{urlId}.html', 'w') as f:
            f.write(content)
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateHtml(sys.argv[1])

Source Code: Let’s Jump In!

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import sys
from urllib import request

urlBase = 'https://indiankanoon.org'


def generateHtml(urlId):
    url = f'{urlBase}/doc/{urlId}'
    try:
        response = request.urlopen(url).read().decode('utf-8')
        html = BeautifulSoup(response, 'lxml')
        judgement = html.find('div', class_='judgments')
        title = judgement.find('div', class_='doc_title').text
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>{judgement}</body>
        </html>
        '''
        # [1] <-- Process the links
        with open(f'{urlId}.html', 'w') as f:
            f.write(content)
        # [2] <-- Automatically open the file
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateHtml(sys.argv[1])

Source Code: Let’s Jump In!

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
import subprocess
import sys
from urllib import request

urlBase = 'https://indiankanoon.org'


def generateHtml(urlId):
    url = f'{urlBase}/doc/{urlId}'
    try:
        response = request.urlopen(url).read().decode('utf-8')
        html = BeautifulSoup(response, 'lxml')
        judgement = html.find('div', class_='judgments')
        title = judgement.find('div', class_='doc_title').text
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>{judgement}</body>
        </html>
        '''
        content = re.sub(r'''(href=")([a-zA-Z0-9/]+)"''',
                         fr'''\1{urlBase}\2"''', content)
        with open(f'{urlId}.html', 'w') as f:
            f.write(content)
        subprocess.run(f'''start {urlId}.html''', shell=True)
        # [3] <-- Save ebook (epub/mobi)
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateHtml(sys.argv[1])

Source Code: Let’s Jump In!

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
import subprocess
import sys
from urllib import request

urlBase = 'https://indiankanoon.org'


def generateMobi(urlId):
    url = f'{urlBase}/doc/{urlId}'
    try:
        response = request.urlopen(url).read().decode('utf-8')
        html = BeautifulSoup(response, 'lxml')
        judgement = html.find('div', class_='judgments')
        title = judgement.find('div', class_='doc_title').text
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>{judgement}</body>
        </html>
        '''
        content = re.sub(r'''(href=")([a-zA-Z0-9/]+)"''',
                         fr'''\1{urlBase}\2"''', content)
        with open(f'{urlId}.html', 'w') as f:
            f.write(content)
        subprocess.run(f'''pandoc {urlId}.html --epub-cover-image="resources\supreme_court_india.jpg" -o {urlId}.epub''', shell=True)
        subprocess.run(f'''kindlegen {urlId}.epub''', shell=True)
        subprocess.run(f'''start {urlId}.epub''', shell=True)
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateMobi(sys.argv[1])

Disclaimer: Don’t Use This In Production Code

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
import subprocess
import sys
from urllib import request

urlBase = 'https://indiankanoon.org'


def generateMobi(urlId):
    url = f'{urlBase}/doc/{urlId}'
    try:
        response = request.urlopen(url).read().decode('utf-8') # [1]
        html = BeautifulSoup(response, 'lxml')
        judgement = html.find('div', class_='judgments')
        title = judgement.find('div', class_='doc_title').text
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>{judgement}</body>
        </html>
        '''
        content = re.sub(r'''(href=")([a-zA-Z0-9/]+)"''',
                         fr'''\1{urlBase}\2"''', content)
        with open(f'{urlId}.html', 'w') as f:
            f.write(content)
        subprocess.run(f'''pandoc {urlId}.html --epub-cover-image="resources\supreme_court_india.jpg" -o {urlId}.epub''', shell=True)
        subprocess.run(f'''kindlegen {urlId}.epub''', shell=True)
        subprocess.run(f'''start {urlId}.epub''', shell=True)
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateMobi(sys.argv[1])

Slightly Better

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
import subprocess
import sys
from urllib import request, error

urlBase = 'https://indiankanoon.org'


def generateMobi(urlId):
    url = f'{urlBase}/doc/{urlId}'
    try:
        response = request.urlopen(url).read().decode('utf-8') # [1]
        html = BeautifulSoup(response, 'lxml')
        judgement = html.find('div', class_='judgments')
        title = judgement.find('div', class_='doc_title').text
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>{judgement}</body>
        </html>
        '''
        content = re.sub(r'''(href=")([a-zA-Z0-9/]+)"''',
                         fr'''\1{urlBase}\2"''', content)
        with open(f'{urlId}.html', 'w') as f:
            f.write(content)
        subprocess.run(f'''pandoc {urlId}.html --epub-cover-image="resources\supreme_court_india.jpg" -o {urlId}.epub''', shell=True)
        subprocess.run(f'''kindlegen {urlId}.epub''', shell=True)
        subprocess.run(f'''start {urlId}.epub''', shell=True)
    except error.HTTPError as e:
        print(e)
    except error.URLError as e:
        print(e)
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateMobi(sys.argv[1])

Document Your Code (using `docstring`s)

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
import subprocess
import sys
from urllib import request, error

urlBase = 'https://indiankanoon.org'


def generateMobi(urlId=None):
    '''
    Scrapes & Generates html/epub/mobi versions of document.
    '''
    url = f'{urlBase}/doc/{urlId}'
    try:
        response = request.urlopen(url).read().decode('utf-8')
        html = BeautifulSoup(response, 'lxml')
        judgement = html.find('div', class_='judgments')
        title = judgement.find('div', class_='doc_title').text
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>{judgement}</body>
        </html>
        '''
        content = re.sub(r'''(href=")([a-zA-Z0-9/]+)"''',
                         fr'''\1{urlBase}\2"''', content)
        with open(f'{urlId}.html', 'w') as f:
            f.write(content)
        subprocess.run(f'''pandoc {urlId}.html --epub-cover-image="resources\supreme_court_india.jpg" -o {urlId}.epub''', shell=True)
        subprocess.run(f'''kindlegen {urlId}.epub''', shell=True)
        subprocess.run(f'''start {urlId}.epub''', shell=True)
    except error.HTTPError as e:
        print(e)
    except error.URLError as e:
        print(e)
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateMobi(sys.argv[1])

Use Virtual Environments (`venv`)

Why Virtual Environments?
To isolate packages used in a project from the packages installed on the system.

Steps

Create a virtual environment
Enter the environment (Activate)
Check all installed packages
Install packages (manually)
Install packages (using requirements.txt)
Fix the package versions used in your project
Exit the environment (DeActivate)
Delete the environment completely

python -m venv myProject\venv; cd myProject
venv\Scripts\activate.bat
pip list
pip install [package-name]
pip install -r requirements.txt
pip freeze > requirements.txt
deactivate
rmdir /s venv

Example

Disclaimer: How Legal Is This?

Web Scraping is completely legal. However, keep a few things in mind:

Don’t break the internet!
Don’t share pirated content
Don’t access & share data that is not accessible publicly
Know the legalities if you are scraping data for monetary reasons
Use APIs if the site has one (eg. Wikipedia, Facebook)

Homework: Problem

Make Your Own eBook

Scrape the article on this webpage, and create your own ebook using the code used in this talk.

Steps

Visit the Dependencies page of this talk and install all necessary software.
Modify final code to capture the article’s heading & main-content.
Create an HTML & save it locally.
Convert this HTML file to EPUB, and try opening it on your phone using Google’s Play Books app.

Solutions will be posted on Saturday (25-04-2020)

Homework: Solution

Use this source-code as reference for the proposed problem.

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import subprocess
import sys
from urllib import request

urlBase = 'https://www.fullstackpython.com/blog/'


def generateHtml(urlId):
    '''
    Scrapes & generates html version of a document.
    '''
    url = f'{urlBase}{urlId}'
    try:
        # Get page, filter the contents & save as a new HTML page
        response = request.urlopen(url).read().decode('utf-8')
        html = BeautifulSoup(response, 'lxml')
        entries = html.find('div', class_='cn').find_all('div', class_='row')
        title = entries[1].h1.text
        author = entries[1].a.text
        blog = entries[2]
        content = f'''
        <html>
            <head><title>{title}</title></head>
            <body>
                <h1>{title}</h1>
                <h2>{author}</h2>
                {blog}
            </body>
        </html>
        '''
        # Save & Open HTML file
        with open(f'TEST.html', 'w', encoding='utf-8') as f:
            f.write(content)
        subprocess.run(f"start TEST.html", shell=True)
    except Exception as e:
        print(e)
    return None


if __name__ == '__main__':
    generateHtml(sys.argv[1])

# py test.py "first-steps-gitpython.html"

Creating eBooks from Webpages using Python

Objectives

Assumptions

Dependencies

Motivation for This Talk

The Solution: On Amazon Kindle

The Solution: On Android Device

How To Do It?

But First, Some Prerequisites

Send A Request, Retrieve A File

Create An HTML object

Finding `<tags>`

Observations

Source Code: Let’s Jump In!

Source Code: Let’s Jump In!

Source Code: Let’s Jump In!

Source Code: Let’s Jump In!

Disclaimer: Don’t Use This In Production Code

Slightly Better

Document Your Code (using `docstring`s)

Use Virtual Environments (`venv`)

Disclaimer: How Legal Is This?

Homework: Problem

Homework: Solution

References

Creating eBooks from Webpages using Python

Objectives

Assumptions

Dependencies

Motivation for This Talk

The Solution: On Amazon Kindle

The Solution: On Android Device

How To Do It?

But First, Some Prerequisites

Send A Request, Retrieve A File

Create An HTML object

Finding <tags>

Observations

Source Code: Let’s Jump In!

Source Code: Let’s Jump In!

Source Code: Let’s Jump In!

Source Code: Let’s Jump In!

Disclaimer: Don’t Use This In Production Code

Slightly Better

Document Your Code (using docstrings)

Use Virtual Environments (venv)

Disclaimer: How Legal Is This?

Homework: Problem

Homework: Solution

References

Finding `<tags>`

Document Your Code (using `docstring`s)

Use Virtual Environments (`venv`)