Template for visualizing web scraped datasets (800 Venture Capital demo)

Dataset Aug 09, 2020

Web scraping is great, but it also tends to blow up rather easily. you need to install proxies, get around captchas and a lot more. Often you need databases, …

It gets complex rather fast, and then you don’t even have your results yet. Here I have a minimalistic example on how to analyze and plot a web scraped dataset.

We scraped a dataset that looks like:

When analyzing the dataset for quality and bias, we want to know how many VCs are in the dataset grouped by Investmentsize.

To load the data, we use a json endpoint as the data will be scraped and monitored every day. let’s use https://api.scraper.ai/api/website/02209169-a5fb-4a90-b540-9b86a751de95?api_key=10c5a982abeefecc50a68f134ff470ec&json as an example, this dataset contains the data shown above in the table.

# getting the scraped data
url = 'https://api.scraper.ai/api/website/02209169-a5fb-4a90-b540-9b86a751de95?api_key=10c5a982abeefecc50a68f134ff470ec&json'
r = requests.get(url).json()

Now we have a dictionary with data, to convert this to a dataset we use pandas where we take the count of each investment size.

Notice the map is being used to only take out the Investmentsize and get a list of investment sizes

# preparing the data frame
df = pd.DataFrame({"investment_size": map(lambda val: val["Investmentsize"], r["data"])})
df = df.value_counts().rename_axis('investment_size').reset_index(name='count')

the data frame now looks like

After which we can plot a barplot for example. In this example seaborn is used.

# plotting
sns.set(style="ticks", palette="pastel")
sns_plot = sns.barplot(x="investment_size", y="count", data=df)
sns.despine(offset=10, trim=True)
sns_plot.get_figure().savefig("output.png")

and you’ll see an image like

Everything put together:

import requests
import seaborn as sns
import pandas as pd

# getting the scraped data
url = 'https://api.scraper.ai/api/website/02209169-a5fb-4a90-b540-9b86a751de95?api_key=10c5a982abeefecc50a68f134ff470ec&json'
r = requests.get(url).json()

# preparing the data frame
df = pd.DataFrame({"investment_size": map(lambda val: val["Investmentsize"], r["data"])})
df = df.value_counts().rename_axis('investment_size').reset_index(name='count')

# plotting
sns.set(style="ticks", palette="pastel")
sns_plot = sns.barplot(x="investment_size", y="count", data=df)
sns.despine(offset=10, trim=True)
sns_plot.get_figure().savefig("output.png")
Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.