web scraping project

This is my project “web scraping using download all books and upload to the internet archive.” I already posted many project related program but this the final project details.

Downloading book for: http://www.e-books-chennaimuseum.tn.gov.in/chennaimuseum/index.php?option=com_abook&view=search&Itemid=101

Python Library my requirement:

  • Selenium webdriver
  • beautifulsoup and requests
  • internetarchive

Steps:

  • Selenium using download all book link
  • download all book using beautifulsoup and requests.
  • Upload book using “internetarchive“, internetarchive is a command-line and Python interface to archive.org.

Get the all book link:

There is total 378 books.

from selenium import webdriver
web = webdriver.Chrome("C:\drivers\chromedriver.exe")
web.get("http://www.e-books-chennaimuseum.tn.gov.in/chennaimuseum/index.php?option=com_abook&view=search&Itemid=101")
while True:
    try:
        file = (web.find_elements_by_xpath('//h3[@class="book-title"]/a'))
        for files in file:
            folder=files.get_attribute("href")
            print(folder)
            f=open('tamilbooks.txt',"a")
            f.write(folder+'\n')
            f.close()
        web.find_element_by_link_text("»").click()
    except:
        break
web.close()

This program getting all book link.

Download all books:

This program is only needed for books links. No need PDF link.

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
print("connecting to server...")
print("connected")
f=open('tamilbooks.txt',"r")
tamil=f.readlines()
for tamils in tamil:
    url = (tamils)
    response = requests.get(url)
    soup= BeautifulSoup(response.text, "html.parser")
    links=soup.select("a[href$='.pdf']")
    for link in links:
        filename = os.path.join("D:\\Tamil",link['href'].split('/')[-1])
        print(filename)
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url,link['href'])).content)

In-case of you want PDF link too. try this..

This the all book link already in “tamilbooks” and all pdf link is stored to “tamilpdflink” file.

from selenium import webdriver
web = webdriver.Chrome("C:\drivers\chromedriver.exe")
f=open('tamilbooks.txt',"r")
tamil=f.readlines()
for tamils in tamil:
    web.get(tamils)
    file =web.find_elements_by_xpath('//a[@class="tooltip"]')
    for files in file:
            folder=files.get_attribute("href")
            print(folder)
            f=open('tamilpdflink.txt',"a")
            f.write(folder+'\n')
            f.close()
web.close()

Upload book using “internetarchive

Uploading book to internetarchive is easy way of upload. I already registered for internet archive. so move on to main content. for more detail

Internetarchive: https://tamilvelanpython.wordpress.com/2020/06/25/upload-file-using-internetarchive/

  1. First command-line to upload books:
$ ia upload tamilvelanpython tamil --metadata="mediatype:texts"
  • ia is config file
  • tamilvelanpython is my identifier
  • tamil is my system folder. I stored all book in this folder. ###Both “ia” and “tamil” in same location is required
  • –metadata=”mediatype:texts” is format of the mediatype

2. Python program using upload books

from internetarchive import upload
print("uploading files...")
## all book location stored in upload file above program of "tamil" location
x=open('upload.txt',"r")
file=x.read()
uploadfile=upload('tamilvelanpython',file)
print("Finished")

This is the process of Download books and upload to internetarchive.

Github link: https://github.com/tamilvelan7/Web-scraping-using-download-books-and-Upload-to-internetarchive-api

Leave a comment