Self-healing released

Oct 22, 2020

How we ensure that scraped records stay functional.

When you scrape a website, it's inevitable that the data you selected will change. The data itsself will change but also the structure behind it. Or as we call it, the DOM.

But why does it change?

Changes to websites are inevitable. Just like we update Scraper.AI regulary, the engineering team of that website will also update their website. Those adjustments are needed to introduce new features, fix bugs, ... .

Changes that website engineers often make can be categorized as:

  • Structural: The Structure of the document changes, containers were moved, text was added, ...
  • Styling: The appearance of elements was changed (ex. a new color was given to the text)
  • Interactive: The way you interact with something has changed, for example, clicking the button now opens a popup instead of shifting the page.

However, those structural changes might bring tiny alterations that will prevent your previously working scrape from working again. Requiring you to manually update everything, all over again.

This is where self-healing comes in

When paths don't work, self-healing will identify this and try to fix it. It might do this by finding the exact piece of text you asked for. Or it will try to look closer if your selection can't be found in a certain proximity.

In technical terms

Technically it might look like this. Let's assume that we selected some text in a container and obtain the following XPath:

/html/body/div[8]/div[3]/div/div[3]/div[1]/div[1]/div[1]/ul/li/div/section/div/div/div[2]
The previously working XPath

However, the site pushes an update and changes its structure, which breaks that XPath on the next scrape. Our systems will now detect this and try to self-heal it.

The "healed" path looks now like

/html/body/div[7]/div[3]/div/div[3]/div/div/div/ul/li[1]/div/section/div/div/div[2]/div[1]
The adjusted path

You can notice that if we try to analyze this, quite a lot has changed, let's highlight it in bold:

/html/body/div[7]/div[3]/div/div[3]/div/div/div/ul/li[1]/div/section/div/div/div[2]/div[1]

These are subtle changes, but make a world of difference, or in our terms, the difference between working and not working.

Conclusion

Self-healing was introduced to boost reliability. It's an important feature that will prevent you from having to make updates to previously selected paths.

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.