How To Convert
Tag To A Comma/new Column When Scraping Website With Python?
Solution 1:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?state=AL&sFullName=Alabama&sProgramType=1'
page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
tables = page_soup.find_all("table", id = "finder")
reformattable = []
reg = re.compile(r"(<[\/]?br[\/]?>)+")
for table in tables:
reformattable.append(re.sub(reg, "<td>", str(table)))
dflist = []
for table in reformattable:
dflist.append(pd.read_html(str(table)))
info = [dflist[i][0] for i in np.arange(len(dflist))]
stats = [dflist[i][1] for i in np.arange(len(dflist))]
adjInfo = []
for df in info:
adjInfo.append(pd.concat([df[i] for i in np.arange(len(df.columns))]).dropna().reset_index(drop = True))
adjStats= []
for df in stats:
df.drop(columns = 1, inplace = True)
df.dropna(inplace = True)
df[3] = df[0]+' ' + df[2]
adjStats.append(df[3])
combo = []
for p1,p2 inzip(adjInfo, adjStats):
combo.append(pd.concat([p1,p2]))
finaldf = pd.concat([combo[i] for i in np.arange(len(combo))], axis = 1)
finaldf
So this gives you exactly what you want. Lets go over it.
After inspecting the website we can see that each section is a "table" with the id of finder. So we looked for this using beautiful soup. Next we had to reformat the <br>
tags to make it easier to load into a df. So I replaced all the <br>
tags with a single <td>
tag.
Another issue with the website was that each section was broken up into 2 tables. So we would have 2 df per one section. In order to make cleaning easier, I broke them down to both the info and stats dataframe lists.
adjInfo and adjStats simply clean the dataframes and put them in a list. Next week recombine information into single columns for each section and put it in combo.
Finally we take all the columns in combo and concat them to get our finaldf.
EDIT
To loop:
finaldf = pd.DataFrame()
for changeinurl in url:
#fix it to however you manipulated the url for your loop
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?state=AL&sFullName=Alabama&sProgramType=1'
page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
tables = page_soup.find_all("table", id = "finder")
reformattable = []
reg = re.compile(r"(<[\/]?br[\/]?>)+")
for table in tables:
reformattable.append(re.sub(reg, "<td>", str(table)))
dflist = []
for table in reformattable:
dflist.append(pd.read_html(str(table)))
info = [dflist[i][0] for i in np.arange(len(dflist))]
stats = [dflist[i][1] for i in np.arange(len(dflist))]
adjInfo = []
for df in info:
adjInfo.append(pd.concat([df[i] for i in np.arange(len(df.columns))]).dropna().reset_index(drop = True))
adjStats= []
for df in stats:
df.drop(columns = 1, inplace = True)
df.dropna(inplace = True)
df[3] = df[0]+' ' + df[2]
adjStats.append(df[3])
combo = []
for p1,p2 inzip(adjInfo, adjStats):
combo.append(pd.concat([p1,p2]))
df = pd.concat([combo[i] for i in np.arange(len(combo))], axis = 1).reset_index(drop = True).T
finaldf.append(df)
Post a Comment for "How To Convert
Tag To A Comma/new Column When Scraping Website With Python?"