a
AWS Glue and Why Pipelines are Important

What is AWS Glue and Why are Pipelines Important?

The most important components of a service are the service itself and the sources of its data, but those parts are completely worthless if they are separate and without the necessary connective tissue between them. AWS Glue is a serverless data pipeline that assists in finding data sources, preparing those sources, and directing data accordingly to where it’s needed. It’s easy to use and supports multiple processing and workload types.

From Source to Analytics

Preparation of data is a necessary step towards ensuring that analysis results will prove useful and in a format that’s easy and fast.  Analysis can now be completed in minutes as opposed to weeks and setup can be done with minimal effort.

From Source to Analytics

Preparation of data is a necessary step towards ensuring that analysis results will prove useful and in a format that’s easy and fast.  Analysis can now be completed in minutes as opposed to weeks and setup can be done with minimal effort.

Discover

    • An Easy Catalog: AWS Glue logs all metadata kept in data stores on an AWS account regardless of location.  The catalog constitutes all categories and formats including table definitions, schemas, and other control information.  It has the capacity to compute statistics, manage partitions, and run queries efficiently.  It also catalogs how data changes over time with snapshots.
    • Data Discovery: AWS Glue has a feature called crawlers that develop interconnectivity between a source and a target data store.  Afterward, it sifts through a prioritized list of classifiers to determine the schema for the data and generates subsequent metadata for the catalog.  Crawlers can be set up to run on certain time triggers, including on a schedule, on-demand, or after event triggers.
    • Schema Controls: The AWS Glue Schema Registry is a set of manual controls for validating the stream of data using Apache Avro schemas.  It uses Apache-licensed serializers and deserializes to integrate Java applications intended for Apache Kafka, Apache Flink, and AWS Lambda (link).  This helps AWS Glue significantly improve data quality and avoid data loss as a result of sudden changes.
    • Rapid Scaling: Just like many other services on AWS, AWS Glue does also support autoscaling for either sharply increasing or decreasing resources depending on the demand of the workload.  Resources are only procured as needed and AWS Glue will never over-provision more resources than are necessary, helping to avoid needing to pay for unused assets.

    Preparation

      • Duplicate Reduction: It’s not uncommon to have duplicated bits of data across data sources.  AWS Glue helps to clean out these superfluous instances through pre-trained machine learning using another service feature called FindMatches.  Even if the two records are slightly different, FindMatches can still recognize both as matching or not and will delete the match, retaining at least one version across the databases within its control.
      • ETL Code: AWS Glue ETL (extract, transform, and load) has development endpoints for editing, debugging code, and stress-testing with options for what IDE or notebook it’s done on.  On the side, there are a bunch of customizable templates for readers and writers that can be imported to an AWS Glue job or shared with other developers via GitHub.
      • Normalizing Code: AWS Glue’s DataBrew has an interactive and basic UI for data analysts and scientists who may not be familiar with programming.  It’s point-and-click, so moving data between database types and AWS services is very simple.
      • Automated Sorting of Critical Data: Any critical data that AWS Glue sorts through in the pipeline and data lakes are automatically identified and provided options for hiding or replacing.  This can include persons’ names, SSNs, driver’s licenses, and IP addresses.

      Integration

        • Simplification: Interactive Sessions help to streamline the development of data integration.  Engineers can fully experiment with data on either an IDE or notebook.
        • Notebooks: Studio Job Notebooks is another integral service that can help developers get started quickly by scheduling notebook code as an AWS Glue job.
        • Job Scheduling: Job Scheduling in AWS Glue is flexible and allows for easy planning based on schedule, event triggers, or manual initiation. The service handles dependencies, bad filter data, and job retries automatically, while progress can be monitored through Amazon CloudWatch.
        • Git Integration: AWS Glue also offers Git Integration with popular platforms like GitHub and AWS CodeCommit. This integration allows for the retention of job change history and enables updates through DevOps practices before deployment. Whether the job is code-based or visually implemented, automation tools such as AWS CodeDeploy can easily deploy AWS Glue jobs.
        • Data Lake Ease of Access: When it comes to data lakes, AWS Glue provides ease of access and consistency. It natively supports Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake, ensuring that storage remains consistent within an S3-based data lake.
        • Quality Assurance: Quality Assurance is another important aspect of AWS Glue. The service ensures data quality and consistency across data lakes and pipelines, providing confidence in the integrity of the data.

      Shape

        • Ease of Use: AWS Glue’s ease of use is evident in its job editor, where defining ETL is as simple as clicking and dragging. The service auto-generates code to extract and transform any volume of data, simplifying the process for users.
        • Shape it Whenever: Continuous data consumption while simultaneously cleaning the data without interrupting the streaming process is easy to achieve. This flexibility enables users to shape the data according to their needs whenever required.

      Discover

        • An Easy Catalog: AWS Glue logs all metadata kept in data stores on an AWS account regardless of location.  The catalog constitutes all categories and formats including table definitions, schemas, and other control information.  It has the capacity to compute statistics, manage partitions, and run queries efficiently.  It also catalogs how data changes over time with snapshots.
        • Data Discovery: AWS Glue has a feature called crawlers that develop interconnectivity between a source and a target data store.  Afterward, it sifts through a prioritized list of classifiers to determine the schema for the data and generates subsequent metadata for the catalog.  Crawlers can be set up to run on certain time triggers, including on a schedule, on-demand, or after event triggers.
        • Schema Controls: The AWS Glue Schema Registry is a set of manual controls for validating the stream of data using Apache Avro schemas.  It uses Apache-licensed serializers and deserializes to integrate Java applications intended for Apache Kafka, Apache Flink, and AWS Lambda (link).  This helps AWS Glue significantly improve data quality and avoid data loss as a result of sudden changes.
        • Rapid Scaling: Just like many other services on AWS, AWS Glue does also support autoscaling for either sharply increasing or decreasing resources depending on the demand of the workload.  Resources are only procured as needed and AWS Glue will never over-provision more resources than are necessary, helping to avoid needing to pay for unused assets.

        Preparation

          • Duplicate Reduction: It’s not uncommon to have duplicated bits of data across data sources.  AWS Glue helps to clean out these superfluous instances through pre-trained machine learning using another service feature called FindMatches.  Even if the two records are slightly different, FindMatches can still recognize both as matching or not and will delete the match, retaining at least one version across the databases within its control.
          • ETL Code: AWS Glue ETL (extract, transform, and load) has development endpoints for editing, debugging code, and stress-testing with options for what IDE or notebook it’s done on.  On the side, there are a bunch of customizable templates for readers and writers that can be imported to an AWS Glue job or shared with other developers via GitHub.
          • Normalizing Code: AWS Glue’s DataBrew has an interactive and basic UI for data analysts and scientists who may not be familiar with programming.  It’s point-and-click, so moving data between database types and AWS services is very simple.
          • Automated Sorting of Critical Data: Any critical data that AWS Glue sorts through in the pipeline and data lakes are automatically identified and provided options for hiding or replacing.  This can include persons’ names, SSNs, driver’s licenses, and IP addresses.

          Integration

            • Simplification: Interactive Sessions help to streamline the development of data integration.  Engineers can fully experiment with data on either an IDE or notebook.
            • Notebooks: Studio Job Notebooks is another integral service that can help developers get started quickly by scheduling notebook code as an AWS Glue job.
            • Job Scheduling: Job Scheduling in AWS Glue is flexible and allows for easy planning based on schedule, event triggers, or manual initiation. The service handles dependencies, bad filter data, and job retries automatically, while progress can be monitored through Amazon CloudWatch.
            • Git Integration: AWS Glue also offers Git Integration with popular platforms like GitHub and AWS CodeCommit. This integration allows for the retention of job change history and enables updates through DevOps practices before deployment. Whether the job is code-based or visually implemented, automation tools such as AWS CodeDeploy can easily deploy AWS Glue jobs.
            • Data Lake Ease of Access: When it comes to data lakes, AWS Glue provides ease of access and consistency. It natively supports Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake, ensuring that storage remains consistent within an S3-based data lake.
            • Quality Assurance: Quality Assurance is another important aspect of AWS Glue. The service ensures data quality and consistency across data lakes and pipelines, providing confidence in the integrity of the data.

          Shape

            • Ease of Use: AWS Glue’s ease of use is evident in its job editor, where defining ETL is as simple as clicking and dragging. The service auto-generates code to extract and transform any volume of data, simplifying the process for users.
            • Shape it Whenever: Continuous data consumption while simultaneously cleaning the data without interrupting the streaming process is easy to achieve. This flexibility enables users to shape the data according to their needs whenever required.

          AWS Glue Price

          Like every other AWS service, AWS Glue will not overcharge users for overprovisioning since the system will always provide exactly what the account needs and with minor variances in pricing based on the account’s region.  AWS Glue itself charges by the hour down to the second of use for crawlers, ETL jobs, and processing.  The Data Catalog does have a simplified monthly fee for just storing and accessing the database within storage and DataBrew will charge by the number of instances active.  Otherwise, every other part of this service will charge for the duration it is active.  There is a Free Tier option where users get the first million stored objects access sessions free.  AWS Glue Schema Registry does not add any additional charge.  For clarification on price calculations, AWS does have a free calculator for anticipating what prices will look like for certain services.

          AWS Glue Price

          Like every other AWS service, AWS Glue will not overcharge users for overprovisioning since the system will always provide exactly what the account needs and with minor variances in pricing based on the account’s region.  AWS Glue itself charges by the hour down to the second of use for crawlers, ETL jobs, and processing.  The Data Catalog does have a simplified monthly fee for just storing and accessing the database within storage and DataBrew will charge by the number of instances active.  Otherwise, every other part of this service will charge for the duration it is active.  There is a Free Tier option where users get the first million stored objects access sessions free.  AWS Glue Schema Registry does not add any additional charge.  For clarification on price calculations, AWS does have a free calculator for anticipating what prices will look like for certain services.

          Dolan Cleary

          Dolan Cleary

          I am a recent graduate from the University of Wisconsin - Stout and am now working with AllCode as a web technician. Currently working within the marketing department.

          Related Articles

          The Difference Between Amazon RDS and Aurora

          The Difference Between Amazon RDS and Aurora

          AWS does incorporate several database services that offer high performance and great functionality. However, customers do find the difference between Amazon Relational Database Service and Amazon Aurora. Both services do provide similar functions, but do cover their own use cases.

          AWS Snowflake Data Warehouse Pricing Guide

          AWS Snowflake Data Warehouse Pricing Guide

          AWS Snowflake Data Warehouse – or just Snowflake – is a data cloud built for users to mobilize, centralize, and process large quantities of data. Regardless of how many sources are connected to Snowflake or the user’s preferred type of organized data used, data is easily stored and controllably shared with selectively-authorized access. Snowflake does offer extensive control over its pricing, though how it works isn’t always clear.

          Single-Tenant vs. Multi-Tenant Cloud Environments

          Single-Tenant vs. Multi-Tenant Cloud Environments

          Operating a cloud environment and optimizing Software as a Service can be managed in two different methods. Reasons for adopting either single-tenant or multi-tenant cloud environments are dependent on business and customer-related factors as well as how much more expensive one architectural structure will be over the other. Both structure types also have a number of security and privacy implications tied to their inherent design.

          Download our 10-Step Cloud Migration ChecklistYou'll get direct access to our full-length guide on Google Docs. From here, you will be able to make a copy, download the content, and share it with your team.