Gathering Company insights by Analysing Job Postings with NLP

Sep 29, 2020

Companies are evolving every day and it's your job as a company to know what your competitors are doing! One of the most insightful things to do is to actually start going through the different job postings that have been published.

Job postings are often overlooked, but they contain a lot of information of how a company is growing and what they are using internally as services. But what data can you expect to find? Well think about the following:

  • Technologies used (e.g. Java, C#, ... or Oracle, MSSQL, Postgres, ...)
  • Industries being tackled or expanded towards (manufacturing, telecom, ...)
  • Locations being expanded towards geographically
  • Sales / Marketing / ... team growth
  • ...

The above is super important, since it allows you to take certain actions:

  • Sales Team: They can now find out what is being utilised internally and tune the Proof of Concepts (POC) or Prove of Value (POV) towards the customer to improve the sales pitch
  • Marketing Team: Which keywords should be targeted such that your sales target picks them up?
  • Strategy: Which areas should you personally invest in to stay ahead?

So let's get started and see how we can analyse such a thing with the Scraper.AI Proxy extractor to analyse a job posting from Google!

Creating an Account

First you need to create an account at Scraper.AI with the Proxy API feature enabled (see the Pricing tiers). This will allow us to download the HTML from any page while we take care of:

  • Proxy Rotation
  • CAPTCHA solving
  • Infrastructure automation
  • Javascript Rendering
  • Optional: Property Loading
  • Optional: Authenticated Pages (cookie support)

Such that you just have to call a simple REST API Endpoint and wait until the HTML is returned. This URL will have the format:

https://proxy.scraper.ai/?api_key=<YOUR_API_KEY>&url=<YOUR_URL>

Downloading our Job Posting Page

Once we have an Account, we can call a simple REST API Endpoint to download the HTML from that page. In this case, I chose to download the Careers posting for a "Customer Engineer" at Google.

The URL we will target will then look like this:

https://proxy.scraper.ai/?api_key=<YOUR_API_KEY>&url=https://careers.google.com/jobs/results/120678649603990214-customer-engineer-looker-google-cloud/?company=Google&company=Google%20Fiber&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=customer%20engineer&sort_by=relevance

Once we have this URL constructed, we can download it in the language of our choice. In my case I will utilise Javascript through Node.js and the node-fetch library.

const fetch = require('node-fetch');

async function start() {
    const apiKey = "YOUR_API_KEY";
    const url = "https://careers.google.com/jobs/results/120678649603990214-customer-engineer-looker-google-cloud/?company=Google&company=Google%20Fiber&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=customer%20engineer&sort_by=relevance";
    const fullURL = `https://proxy.scraper.ai/?api_key=${apiKey}&url=${url}`;
 
    const req = await fetch(fullURL);
    const html = await req.text();
    console.log(html);
}

start().catch(e => console.error(e));

Extracting the Job Text

Once we have the HTML we of course have to pre-process this to fetch the text that we want to analyse for its keywords.

To make this easy, we can utilise a small Node.js library named Textract that we can install:

npm install textract

Which we then configure and call to return the text from our html:

const fetch = require('node-fetch');
const textract = require('textract');

async function start() {
    const apiKey = "YOUR_API_KEY";
    const url = "https://careers.google.com/jobs/results/120678649603990214-customer-engineer-looker-google-cloud/?company=Google&company=Google%20Fiber&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=customer%20engineer&sort_by=relevance";
    const fullURL = `https://proxy.scraper.ai/?api_key=${apiKey}&url=${url}`;
 
    const req = await fetch(fullURL);
    const html = await req.text();

    const text = await extractText(html);
    console.log(text);
}

async function extractText(html, mime = "text/html") {
    const buffer = Buffer.from(html, 'utf8');

    return new Promise((resolve, reject) => {
        textract.fromBufferWithMime(mime, buffer, (err, txt) => {
            if (err) {
                return reject(err);
            }

            return resolve(txt);
        })
    });
}

start().catch(e => console.error(e));

Analysing the Key Phrases

Once the HTML is returned, the only thing left to do is to call a Text Analyser to extract the Key Phrases. For this article, the choice was made to utilise an Azure Cognitive Service named "Text Analytics" that can do this for us.

To this service we can then send the text received from our earlier step and extract the key phrases, which will result into:

{
    "documents": [
        {
            "id": "1",
            "keyPhrases": [
                "data analysis",
                "Google employees",
                "Location Google",
                "data visualization",
                "Google Cloud Google Dublin",
                "data discovery",
                "powers data experiences",
                "data-driven decisions",
                "job Looker",
                "live product demonstrations",
                "Qualifications Minimum qualifications",
                "actionable business insights",
                "value of Looker's products",
                "Business Intelligence tool",
                "disability",
                "Preferred qualifications",
                "SQL",
                "customized presentations",
                "Customer Engineer",
                "unsolicited resumes",
                "forward resumes",
                "agency resumes",
                "Proficiency",
                "customer pilots",
                "compiling C-level presentations",
                "equal opportunity workplace",
                "equal employment opportunity",
                "complex spreadsheets",
                "marital status",
                "unified platform",
                "better insights",
                "intricate spreadsheets",
                "Veteran status",
                "technical prospects",
                "services teams",
                "organization location",
                "daily workflows of users",
                "Google's EEO Policy",
                "Ireland",
                "national origin",
                "sexual orientation",
                "criminal histories",
                "religion",
                "color",
                "race",
                "citizenship",
                "phone",
                "multiple channels",
                "gender identity",
                "DataStudio",
                "ancestry",
                "professionals",
                "prototypes",
                "cookies",
                "software",
                "jobs alias",
                "organizations",
                "mission",
                "depth of knowledge",
                "use cases",
                "developers",
                "legal requirements",
                "affirmative action employer",
                "company",
                "site",
                "databases",
                "Tableau",
                "fees",
                "market needs",
                "qualified applicants",
                "Privacy Terms",
                "database manipulation",
                "architects",
                "sales",
                "marketing",
                "special need",
                "accommodation",
                "architectural concepts",
                "direction",
                "Law",
                "buyers",
                "breadth",
                "point of decision",
                "Responsibilities",
                "web-scale",
                "variety of audiences",
                "recruitment agencies",
                "reporting dashboards",
                "German",
                "resulting solution",
                "technologists",
                "trials",
                "English",
                "traffic",
                "role"
            ],
            "warnings": []
        }
    ],
    "errors": [],
    "modelVersion": "2020-07-01"
}

Showing us the Key Phrases that are detected for this job posting.

Of course this does not show everything, but when we apply this on multiple job postings, we are able to detect trends of what is happening or changing over time.

We want to hear from you

Let us know what you think of this article and how you are using our services! We are constantly striving to improve our services and would love to learn more about your use cases.

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.