How Do I Obtain Redirected Urls?
I am trying to get the redirected URL that https://trade.ec.europa.eu/doclib/html/153814.htm leads to (a pdf file). I've so far tried r = requests.get('https://trade.ec.europa.eu/d
Solution 1:
Please try this code to see if it works for you
import urllib.request
import re
import requests
import PyPDF2
import io
from requests_html import HTMLSession
from urllib.parse import urlparse
from PyPDF2 import PdfFileReader
# Get Domain Name With urlparse
url = "https://trade.ec.europa.eu/doclib/html/153814.htm"
parsed_url = urlparse(url)
domain = parsed_url.scheme + "://" + parsed_url.netloc
# Get URL
session = HTMLSession()
r = session.get(url)
# Extract Links
jlinks = r.html.xpath('//a/@href')
# Remove bad links and replace relative path for absolute path
updated_links = []
for link in jlinks:
if re.search(".*@.*|.*javascript:.*|.*tel:.*",link):
link = ""elif re.search("^(?!http).*",link):
link = domain + link
updated_links.append(link)
else:
updated_links.append(link)
r = requests.get(updated_links[0])
f = io.BytesIO(r.content)
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText()
print(contents)
Solution 2:
I think you should get a redirect link yourself (didn't found any way to do this with redirect), when you enter https://trade.ec.europa.eu/doclib/html/153814.htm it gives you HTML page with a redirect link, as for example you can extract it like this
import requests
from lxml import etree, html
tree = html.fromstring(requests.get('https://trade.ec.europa.eu/doclib/html/153814.htm').text)
print(tree.xpath('.//a/@href')[0])
Output will be
https://trade.ec.europa.eu/doclib/docs/2015/september/tradoc_153814.pdf
Post a Comment for "How Do I Obtain Redirected Urls?"