Skip to content Skip to sidebar Skip to footer

Beautifulsoup Different Parsers

could anyone elaborate more about the difference between parsers like html.parser and html5lib? I've stumbled across a weird behavior where when using html.parser it ignores all th

Solution 1:

You can use lxml which is very fast and can use find_all or select to get all tags.

from bs4 import BeautifulSoup
html = """
<html><head></head><body><!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <ahref="test"></a><ahref="test"></a><ahref="test"></a><ahref="test"></a><!--[if lte IE 8]>
  <![endif]--></body></html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)

OR

from bs4 import BeautifulSoup
html = """
<html><head></head><body><!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <ahref="test"></a><ahref="test"></a><ahref="test"></a><ahref="test"></a><!--[if lte IE 8]>
  <![endif]--></body></html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

Post a Comment for "Beautifulsoup Different Parsers"