Skip to content Skip to sidebar Skip to footer

How To Convert
Tag To A Comma/new Column When Scraping Website With Python?

I'm trying to scrape the website below. I can get all of the data I need off of it by using the code below. However, the 'br' tags are creating issues for me. I'd prefer for them t

Solution 1:

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re


url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?state=AL&sFullName=Alabama&sProgramType=1'

page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
tables  = page_soup.find_all("table", id = "finder")

reformattable = []
reg = re.compile(r"(<[\/]?br[\/]?>)+")
for table in tables:
    reformattable.append(re.sub(reg, "<td>", str(table)))

dflist = []
for table in reformattable:
    dflist.append(pd.read_html(str(table)))

info = [dflist[i][0] for i in np.arange(len(dflist))]
stats = [dflist[i][1] for i in np.arange(len(dflist))]

adjInfo = []
for df in info:
    adjInfo.append(pd.concat([df[i] for i in np.arange(len(df.columns))]).dropna().reset_index(drop = True))

adjStats= []
for df in stats:
    df.drop(columns = 1, inplace = True)
    df.dropna(inplace = True)
    df[3] = df[0]+' ' + df[2]
    adjStats.append(df[3])

combo = []
for p1,p2 inzip(adjInfo, adjStats):
    combo.append(pd.concat([p1,p2]))

finaldf = pd.concat([combo[i] for i in np.arange(len(combo))], axis = 1)

finaldf

So this gives you exactly what you want. Lets go over it.

After inspecting the website we can see that each section is a "table" with the id of finder. So we looked for this using beautiful soup. Next we had to reformat the <br> tags to make it easier to load into a df. So I replaced all the <br> tags with a single <td> tag.

Another issue with the website was that each section was broken up into 2 tables. So we would have 2 df per one section. In order to make cleaning easier, I broke them down to both the info and stats dataframe lists.

adjInfo and adjStats simply clean the dataframes and put them in a list. Next week recombine information into single columns for each section and put it in combo.

Finally we take all the columns in combo and concat them to get our finaldf.

EDIT

To loop:

finaldf = pd.DataFrame()
for changeinurl in url:
    #fix it to however you manipulated the url for your loop
    url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?state=AL&sFullName=Alabama&sProgramType=1'

    page_html = requests.get(url).text
    page_soup = BeautifulSoup(page_html, "html.parser")
    tables  = page_soup.find_all("table", id = "finder")

    reformattable = []
    reg = re.compile(r"(<[\/]?br[\/]?>)+")
    for table in tables:
        reformattable.append(re.sub(reg, "<td>", str(table)))

    dflist = []
    for table in reformattable:
        dflist.append(pd.read_html(str(table)))

    info = [dflist[i][0] for i in np.arange(len(dflist))]
    stats = [dflist[i][1] for i in np.arange(len(dflist))]

    adjInfo = []
    for df in info:
        adjInfo.append(pd.concat([df[i] for i in np.arange(len(df.columns))]).dropna().reset_index(drop = True))

    adjStats= []
    for df in stats:
        df.drop(columns = 1, inplace = True)
        df.dropna(inplace = True)
        df[3] = df[0]+' ' + df[2]
        adjStats.append(df[3])

    combo = []
    for p1,p2 inzip(adjInfo, adjStats):
        combo.append(pd.concat([p1,p2]))

    df = pd.concat([combo[i] for i in np.arange(len(combo))], axis = 1).reset_index(drop = True).T

    finaldf.append(df)

Post a Comment for "How To Convert
Tag To A Comma/new Column When Scraping Website With Python?"