Web Scraping

<< Back to Technical Glossary

Web Scraping Definition

Web scraping is the process of pulling data from a website using bots. Unlike screen scraping, which merely copies the onscreen image displayed by pixels, web scraping removes content and extracts its underlying HTML code and its stored data so whoever has it can replicate the content of the whole website.

Image shows web scraping bot extracting data from websites in an unstructured html format.

Web Scraping FAQs

What is Web Scraping?

Web scraping is a technique for extracting massive amounts of data from websites automatically—mostly in an unstructured, HTML format. Web scraping is also called web data extraction or web harvesting.

Although web scraping and screen scraping are sometimes referred to as if they are the same, they each extract different kinds of data, with only web scraping collecting data past the screen level.

There are also online web scraping services, and custom code for manual web scraping from scratch. Many larger sites such as Google allow access to their data in a structured format through web scraping APIs.

Web scraping is a kind of data mining. Users of web scraping software are often hoping to harness data for their own purposes, but they may also aim to sell data or use it for promotional purposes. Items such as auction prices and details, weather reports, product listings and descriptions, or other data are commonly sought.

Because some websites do not allow this kind of data mining or web scraping, the practice is controversial. However, web scraping remains an important way of using aggregated data resources.

How Does Web Scraping Work?

There are two pieces to web scraping: the crawler and the scraper. The crawler, an artificial intelligence algorithm, follows internet links to browse the web for specific data. The scraper extracts data from the website and its actual design affects how accurately and quickly it works. Obviously, if the scraper is more specific in the data it targets, it is also faster.

As an example, you might want to scrape a consumer sales web page for the kinds of smartphones available. You certainly want the model numbers, but you probably don’t need customer reviews; limiting the scrape improves the speed.

This knowledge also helps detect and stop web scraping. Understanding the specific kinds of data scrapers are targeting on a site and how they are likely to target it helps to protect it.

There are several types of web scraping:

  • Self-built or manual web scrapers. These demand sophisticated programming skills, but can also achieve many different goals and may be difficult to detect.
  • Browser extensions. These are simply added to and integrated with a web browser, so they are limited by the browser itself.
  • Cloud web scraping. These web scrapers run on an off-site server in the cloud, independent of local resources.
  • Real-time or dynamic web scraping. This is the process of scraping data from sites as they change in real-time.

 

Any of these can be deployed against any site.

Why is Python a popular programming language for web scraping?

Python with or without regular expressions (RegEx) is among the most popular web scraping languages. Python has a variety of libraries that were created specifically for web scraping and it easily handles most of the processes.

In addition, Beautiful soup is a Python library that is designed for web scraping, and Scrapy is a popular open-source web scraping framework. Both are ideal for data extraction from HTML documents via API web scraping.

What Are the Advantages of Web Scraping?

What is web scraping used for? Web scraping has advantages in various applications across many industries, and it has legitimate and illegal uses.

The benefits of web scraping include some of its legitimate uses:

  • Crawling sites with search engine bots to analyze, index, and rank its content
  • News and financial sites use web scraping for news monitoring
  • Price comparison sites auto-fetch product descriptions and prices for allied seller websites with bots
  • Market research professionals collect social media posts and other scraped data from forums for sentiment analysis
  • Real estate bots extract listings from site to site
  • Email marketing businesses can also use web scraping to build contact information.

 

However, web scraping is also used for malicious purposes, including theft of intellectual property and undercutting prices. A baseline for malicious net web scraping is permission: when data is extracted without permission of its owners, such as in the case of content theft, IP theft, or price scraping, this is malicious. Here are some examples:

Price scraping. This kind of web scraping allows perpetrators to target websites of competitors. They can scrape prices in real-time to continuously undercut the competition and mimic their strategies.

Content scraping. Large-scale content theft is common, especially targeting news sites and online product catalogs. For organizations that rely on generating fresh content, enterprise web scraping can be a serious problem.

Phishing and other attacks. Web scraping leaves sites and businesses vulnerable to spear phishing, social engineering, and other attacks, and otherwise leak sensitive scraped data. It can allow hackers to learn details like supervisor names, titles of ongoing projects, and the identities of trusted partners, vendors, and organizations.

Web Scraping vs Web Crawling

What is the difference between web scraping and web crawling? In brief, while web crawling is about discovering or finding URLs on the web, web scraping focuses on data extraction from one or more websites. The web scraping process typically combines web crawling and scraping.

And what is the difference between data scraping vs web scraping? Data scraping merely refers to detecting and extracting data, so essentially this is two ways of saying the same thing.

Data Mining vs Web Scraping?

Data mining, however, is slightly different from either data scraping or web scraping.

While web scraping involves gathering and structuring data from websites in a usable format, it does not involve data review or processing. In contrast, data mining refers to the analysis of large data sets to identify useful patterns and other information. Data mining does not require data extraction or processing.

How to Detect Web Scraping

Web scraping tools are bots or other software programmed to extract information from databases. Many are entirely customizable: to identify unique HTML structures, extract data from APIs, extract content and transform it, and store scraped data.

However, it can be difficult to distinguish between legitimate and malicious bots because they all share the same basic feature: web scraping automation to more easily access website data.

However, there are two important differences:

  • Legitimate bots identify with the organization they scrape rather than impersonating legitimate traffic. For example, malicious bots may create false HTTP user agents.
  • Legitimate bots follow the rules set by a site’s robot.txt file, while malicious scrapers crawl anywhere on the website.

 

Users require substantial resources to run web scraper bots. Legitimate operators invest heavily in servers that can process the data they extract. Malicious operators might instead rely on specialized, distributed web scraping architecture—computers infected with the same malware that are geographically dispersed but controlled from a central location.

This allows a single perpetrator with a lower budget to run a botnet with the combined power of infected systems. It also provides a pattern for sophisticated web scraping protection tools to look for.

Advanced Web Scraping Solutions and Web Scraping Tools

Malicious web scraping software and scraper bots are becoming more advanced. This renders many common security measures ineffective. Real-time, granular traffic analysis helps ensure that both human and bot traffic to a website is legitimate. This might include the close inspection of HTML headers and header signatures, IP reputation analysis, behavior analysis including tracking illogical browsing patterns and the rate of http requests, and the use of progressive challenges.

Legitimate web scraping techniques support threat intelligence and enable security teams to meet these challenges. These kinds of web scraping companies target sites including darknet sites and forums to scrape for cyber threat intelligence.

Does Avi Protect Against Web Scraping?

Even if your competitors or attackers automate web scraping, Avi offers customized protection, and visibility into attacks to prevent ongoing attacks. By delivering software load balancers, container ingress, and web application firewall services, Avi also keeps applications available, secure, and responsive. Avi also provides scaling capacity, natively mitigating against dozens of DDoS attacks.

Learn more about how Avi protects against web scraping here.

For more on the actual implementation of load balancing, security applications and web application firewalls check out our Application Delivery How-To Videos.

Web Application API Protection (WAAP)

<< Back to Technical Glossary

Web Application API Protection (WAAP) Definition

According to Gartner, cloud web application API protection WAAP services are properly defined as an evolution of cloud WAF services. WAAP services combine a subscription model with as-a-service, cloud-delivered deployment of bot mitigation, WAF, API security, and DDoS protection.

Some WAAP security providers offer managed services that are core components of the product. Many vendors offer multiple versions of their WAAP web applications, sometimes divided into highly configurable offerings and ready-to-go, simple-to-use versions.

This image depicts the core components of web application api protection (WAAP): subscription model with as-a-service, cloud-delivered deployment of bot mitigation, WAF, API security, and DDoS protection.

WAAP FAQs

What is Web Application API Protection?

Web applications are a core component of the cloud infrastructure for many organizations. A web application is a program that users can access via a web browser, and it may also provide programmatic access to the application’s key capabilities via application programming interfaces (APIs). For this reason, web applications are central to cloud services, but also present a serious set of performance and security challenges.

Gartner analysts engineers Adam Hils and Jeremy D’Hoinne first coined the term WAAP meaning any suite of cloud-based services designed with the protection of APIs and web applications as their primary goal.

Cloud web application and API protection services offer multiple models for security, based on a multi-tenant, auto-scaling cloud infrastructure. Cloud WAAP security core features include API protection, bot mitigation, protection against DDoS, and web application firewalls WAFs.

Cloud WAAP services sometimes provide additional features that can enhance the performance of web applications. Each module can have its own strategy for security protection.

Why is WAAP Important?

APIs and web applications are a primary target for attackers because they provide access to sensitive data and are available via the public Internet. WAAP is essential because traditional security solutions don’t protect these applications effectively.

WAF vendors are enhancing their cloud WAF tools and services as enterprise web applications evolve by meeting WAAP requirements. There are several reasons why traditional solutions fail to effectively protect web applications:

Port-based blocking is ineffective

Traditional firewalls filter traffic based on ports and protocols in use. However, attackers use the same web ports and protocols as users—such as HTTP(s)—against web APIs and web applications so using this method to filter out malicious traffic alone is unfeasible. To distinguish legitimate traffic from potential attacks against web applications and APIs, a more granular level of inspection is required.

Signature-based attack detection also fails

Threats to web applications change continuously, making signature-based solutions unscalable. WAAP solutions help organizations stay ahead of an application security threat environment that is developing with real-time insights and continuous self-learning.

Encrypted traffic inspection is critical

Over half of all modern web traffic uses TLS encryption, which heightens privacy but presents a challenge for detecting malicious content such as malware. WAAP solutions can identify malicious content and sensitive data hidden in encrypted traffic as they inspect TLS connections.

HTTP traffic is complex

Web apps are involved, and cybercriminals conceal malicious content using this level of complexity. Conventional intrusion detection and prevention systems (IDS/IPS) offer inadequate tools for guarding against these threats.

Cloud hosting architecture is popular

This offers greater benefits, particularly when web applications serve users across disparate geographic regions, minimizing potential latency and bottlenecks. This also prompts solution providers to offer cloud-native application security solutions.

Positive security models have not been effective

WAF technology has demanded serious manual tuning and configuration, rather than learning automatically in real-time to create usable parameters and allow lists for URLs automatically.

Modern web applications change often

DevOps and agile practices mean that modern APIs and web applications are always in flux. The manual tuning and custom rule creation that traditional WAFs demand are not well suited to the way that applications constantly and quickly evolve.

A multi-cloud strategy is essential

Each cloud provider uses a unique architecture and offers different features. To achieve effective security controls, organizations operating across multiple clouds need to weave an intricate matrix of cross-provider capabilities. Cloud-based WAAP services are more adapted to a multi-cloud strategy and environment.

Key Features of WAAP Services

Complete web application and API protection services ensure APIs and web applications remain safe from a wide range of attacks. A WAAP service must identify and analyze requests before they access the application or API endpoint.

The core features and capabilities of a comprehensive, effective WAAP strategy include:

Next-generation or web application firewall (next-gen or WAF)

A next-gen WAF monitors and protects web applications at the application layer where they are deployed—from a wide range of attacks. And in contrast to a traditional WAF, an WAF uses artificial intelligence (AI), machine learning (ML), and/or behavioral analysis, not just manual security rules or known attack patterns, to prevent attacks on apps and APIs.

Malicious bot protection

This type of protection isolates suspicious bots and stops them from attacking while allowing safe bot traffic to reach the application.

Runtime application self-protection (RASP)

RASP defends web applications and APIs in real-time, embedded in the application runtime domain.

Comprehensive protection against distributed denial-of-service (DDoS) attacks

WAAP solutions scale up to safeguard against massive DDoS attacks targeting the application and network layers of APIs, applications, and microservices.

Individual protection for APIs and microservices

WAAP strategies retain security within the application, microservice, or serverless function to surround all individual services with micro perimeters that are data- and context-aware.

Load balancing

WAAP solutions scale up to safeguard against massive DDoS attacks targeting the application and network layers of APIs, applications, and microservices.

Advanced rate limiting

This enhances API and website performance by preventing abusive activity at the application level.

Account takeover protections

This aspect of web application and API protection uses an application’s customer-facing authentication process or authentication APIs to detect unauthorized access to customer accounts. Account takeover protection prevents cybercriminals from using lost, stolen, or otherwise compromised credentials from password lists and data dumps.

How to Implement Web Application and API Protection WAAP

There are several challenges to implementing WAAP web application and API protection strategies and tools.

Concern about legal liability, cultural and regulatory constraints and old-fashioned organizational pushback can all hamper the adoption of cloud WAAP services and other cloud-based security services. Finding enough common ground between the budget and the pricing model and SLAs of possible providers is another key hurdle.

Another sensitive area is the need to allow a third-party cloud solution to manage application secret keys, decrypt TLS connections, and log sensitive client data, which might fall under the purview of data residency conditions.

Any cloud WAAP solution adopted by an organization ultimately has to be integrated into the current incident response workflow. The ease or possibility of this will be based on which security information and event management (SIEM) tool is already in place.

Along these lines, technical architecture presents an additional challenge, especially for bespoke WAAP services that are not built on established WAF solutions. These WAAP solutions can miss out on SIEM and application security testing (AST), and other integration with the enterprise ecosystem. Many also offer configuration and log retention options that are limited. Cloud consoles for WAAP monitoring may not offer entry to logs in real-time.

Finally, solution maturity is a factor in how effective cloud WAAP services are. Many are missing some key characteristics WAF appliances provided, such as cookie signing, form protection, and cross-site request forgery (CSRF) tokens. For organizations searching for a lift-and-shift means for tackling their cloud application security strategy challenges, this slows uptake, because they are already using these other techniques.

Does VMware NSX Advanced Load Balancer Offer a WAAP Security Solution?

Yes. VMware NSX Advanced Load Balancer’s comprehensive, software-defined application services platform provides a comprehensive web application security architecture, including DDoS mitigation, SSL/TLS encryption, load balancing, bot management, ACL and application rate limiting. It also features an Intelligent Web Application Firewall with distributed security fabric to enforce security through closed-loop analytics and WAF learning mode that covers open web application security project (OWASP) CRS protection, support for compliance regulations such as PCI DSS, HIPAA, and GDPR, and signature-based detection.

Pulse cloud services provide new threat updates including IP reputation, bot detection, CRS signatures and more, and minimize false positives with advanced application security analytics, detection, and enforcement modes to detect common application vulnerabilities. VMware NSX Advanced Load Balancer provides an optimized security pipeline to maximize the efficiency for traditionally resource-intensive operations. With real-time app security insights and analytics, the VMware NSX Advanced Load Balancer provides actionable insights on performance, end-user interactions and security events in a single dashboard with end-to-end visibility.

For more on the actual implementation of load balancing, security applications and web application firewalls check out our Application Delivery How-To Videos.

Learn more about how VMware NSX Advanced Load Balancer’s platform delivers comprehensive protection for APIs, applications, and microservices.

Web Performance

Web performance is the speed that web pages are loaded onto a clients web browser. Enterprises commonly use application delivery controllers to optimize and accelerate web performance.

Using an application delivery controller like Avi can lead to drastic improvements in web performance with application acceleration, autoscaling and highly efficient network traffic management.

Web Application Firewall (WAF)

<< Back to Technical Glossary

Web Application Firewall Definition

A Web Application Firewall or WAF provides web application security for online services from malicious Internet traffic. WAFs detect and filter out threats such as OWASP Top 10 which could degrade, compromise or bring down online applications.

Diagram depicting a web application firewall protecting web application servers from common threats such as the OWASP Top 10 which could compromise web application security.
FAQs

What Are Web Application Firewalls?

Web application firewalls assist load balancing by examining HTTP traffic before it reaches the application server. They also protect against web application vulnerability and unauthorized transfer of data from the web server at a time when security breaches are on the rise. According the the Verizon Data Breach Investigations Report, web application attacks were the most prevalent breaches in 2017 and 2018.

The PCI Security Standards Council defines a web application firewall as “a security policy enforcement point positioned between a web application and the client endpoint. This functionality can be implemented in software or hardware, running in an appliance device, or in a typical server running a common operating system. WAF security may be implemented using a stand-alone device or integrated into other network components.”

How Does A Web Application Firewall Work?

A web application firewall (WAF) intercepts and inspects all HTTP requests using a security model based on a set of customized policies to weed out bogus traffic. WAFs block bad traffic outright or can challenge a visitor with a CAPTCHA test that humans can pass but a malicious bot or computer program cannot.

WAFs follow rules or policies customized to specific vulnerabilities. As a result, this is how WAFs prevent DDoS attacks. Creating the rules on a traditional WAF can be complex and require expert administration. The Open Web Application Security Project maintains a list of the OWASP top web application security flaws for WAF policies to address.

WAFs come in the form of hardware appliances, server-side software, or filter traffic as-a-service. WAFs can be considered as reverse proxies i.e. the opposite of a proxy server. Proxy servers protect devices from malicious applications, while WAFs protect web applications from malicious endpoints.

What Are Some Web Application Firewall Benefits?

A web application firewall (WAF) prevents attacks that try to take advantage of the vulnerabilities in web-based applications. The vulnerabilities are common in legacy applications or applications with poor coding or designs. WAFs handle the code deficiencies with custom rules or policies.

Intelligent WAFs provide real-time insights into application traffic, performance, security and threat landscape. This visibility gives administrators the flexibility to respond to the most sophisticated attacks on protected applications.

When the Open Web Application Security Project identifies the OWASP top vulnerabilities, WAFs allow administrators to create custom security rules to combat the list of potential attack methods. An intelligent WAF analyzes the security rules matching a particular transaction and provides a real-time view as attack patterns evolve. Based on this intelligence, the WAF can reduce false positives.

What Is the Difference Between a Firewall and a Web Application Firewall?

A traditional firewall protects the flow of information between servers while a web application firewall is able to filter traffic for a specific web application. Network firewalls and web application firewalls are complementary and can work together.

Traditional security methods include network firewalls, intrusion detection systems (IDS) and intrusion prevention systems (IPS). They are effective at blocking bad L3-L4 traffic at the perimeter on the lower end (L3-L4) of the Open Systems Interconnection (OSI) model. Traditional firewalls cannot detect attacks in web applications because they do not understand Hypertext Transfer Protocol (HTTP) which occurs at layer 7 of the OSI model. They also only allow the port that sends and receives requested web pages from a HTTP server to be open or closed. This is why web application firewalls are effective for preventing attacks like SQL injections, session hijacking and Cross-Site Scripting (XSS).

When Should You Use a Web Application Firewall?

Any business that uses a website to generate revenue should use a web application firewall to protect business data and services. Organizations that use online vendors should especially deploy web application firewalls because the security of outside groups cannot be controlled or trusted.

How Do You Use a Web Application Firewall?

A web application firewall requires correct positioning, configuration, administration and monitoring. Web application firewall installation must include the following four steps: secure, monitor, test and improve. This should be a continuous process to ensure application specific protection.

The configuration of the firewall should be determined by the business rules and guardrails by the company’s security policy. This approach will allow the rules and filters in the web application firewall to define themselves.

Does VMware NSX Advanced Load Balancer Offer a Web Application Firewall?

Yes. The VMware NSX Advanced Load Balancer’s Web Application Firewall (WAF) delivers high-performance web application security with point-and-click simplicity. It enables customized policy configurations and helps achieve compliance with GDPR, HIPAA and PCI DSS. It simplifies security rules, minimizes false positives with advanced analytics and protects applications from DDoS attacks and OWASP Top 10 threats with real-time insights.

For more on the actual implementation of load balancing, security applications and web application firewalls check out our Application Delivery How-To Videos.

For more information on WAFs see the following resources: