Beautifulsoup Different Parsers

January 13, 2024 Post a Comment

could anyone elaborate more about the difference between parsers like html.parser and html5lib? I've stumbled across a weird behavior where when using html.parser it ignores all th

Solution 1:

You can use lxml which is very fast and can use find_all or select to get all tags.

from bs4 import BeautifulSoup
html = """
<html><head></head><body><!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <ahref="test"></a><ahref="test"></a><ahref="test"></a><ahref="test"></a><!--[if lte IE 8]>
  <![endif]--></body></html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)

from bs4 import BeautifulSoup
html = """
<html><head></head><body><!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <ahref="test"></a><ahref="test"></a><ahref="test"></a><ahref="test"></a><!--[if lte IE 8]>
  <![endif]--></body></html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

Python Playground

Beautifulsoup Different Parsers

Solution 1:

Post a Comment for "Beautifulsoup Different Parsers"