Are you getting the most out of your AWS investment? Get your free AWS Well-Architected Assessment.

2021 Fillmore Street #1128

}

24/7 solutions

AWS Glue and Why Pipelines are Important

What is AWS Glue and Why are Pipelines Important?

The most important components of a service are the service itself and the sources of its data, but those parts are completely worthless if they are separate and without the necessary connective tissue between them. AWS Glue is a serverless data pipeline that assists in finding data sources, preparing those sources, and directing data accordingly to where it’s needed. It’s easy to use and supports multiple processing and workload types.

From Source to Analytics

Preparation of data is a necessary step towards ensuring that analysis results will prove useful and in a format that’s easy and fast.  Analysis can now be completed in minutes as opposed to weeks and setup can be done with minimal effort.

From Source to Analytics

Preparation of data is a necessary step towards ensuring that analysis results will prove useful and in a format that’s easy and fast.  Analysis can now be completed in minutes as opposed to weeks and setup can be done with minimal effort.

Discover

    • An Easy Catalog: AWS Glue logs all metadata kept in data stores on an AWS account regardless of location.  The catalog constitutes all categories and formats including table definitions, schemas, and other control information.  It has the capacity to compute statistics, manage partitions, and run queries efficiently.  It also catalogs how data changes over time with snapshots.
    • Data Discovery: AWS Glue has a feature called crawlers that develop interconnectivity between a source and a target data store.  Afterward, it sifts through a prioritized list of classifiers to determine the schema for the data and generates subsequent metadata for the catalog.  Crawlers can be set up to run on certain time triggers, including on a schedule, on-demand, or after event triggers.
    • Schema Controls: The AWS Glue Schema Registry is a set of manual controls for validating the stream of data using Apache Avro schemas.  It uses Apache-licensed serializers and deserializes to integrate Java applications intended for Apache Kafka, Apache Flink, and AWS Lambda (link).  This helps AWS Glue significantly improve data quality and avoid data loss as a result of sudden changes.
    • Rapid Scaling: Just like many other services on AWS, AWS Glue does also support autoscaling for either sharply increasing or decreasing resources depending on the demand of the workload.  Resources are only procured as needed and AWS Glue will never over-provision more resources than are necessary, helping to avoid needing to pay for unused assets.

    Preparation

      • Duplicate Reduction: It’s not uncommon to have duplicated bits of data across data sources.  AWS Glue helps to clean out these superfluous instances through pre-trained machine learning using another service feature called FindMatches.  Even if the two records are slightly different, FindMatches can still recognize both as matching or not and will delete the match, retaining at least one version across the databases within its control.
      • ETL Code: AWS Glue ETL (extract, transform, and load) has development endpoints for editing, debugging code, and stress-testing with options for what IDE or notebook it’s done on.  On the side, there are a bunch of customizable templates for readers and writers that can be imported to an AWS Glue job or shared with other developers via GitHub.
      • Normalizing Code: AWS Glue’s DataBrew has an interactive and basic UI for data analysts and scientists who may not be familiar with programming.  It’s point-and-click, so moving data between database types and AWS services is very simple.
      • Automated Sorting of Critical Data: Any critical data that AWS Glue sorts through in the pipeline and data lakes are automatically identified and provided options for hiding or replacing.  This can include persons’ names, SSNs, driver’s licenses, and IP addresses.

      Integration

        • Simplification: Interactive Sessions help to streamline the development of data integration.  Engineers can fully experiment with data on either an IDE or notebook.
        • Notebooks: Studio Job Notebooks is another integral service that can help developers get started quickly by scheduling notebook code as an AWS Glue job.
        • Job Scheduling: When jobs initiate can be planned either on a schedule, on event triggers, or by manual initiation.  AWS Glue will handle dependencies, bad filter data, and job retries automatically, but progress can be monitored through Amazon CloudWatch.
        • Git Integration: Widely open source and popular, GitHub and AWS CodeCommit can retain the history of changes to a job and update them with DevOps practices before deployment, regardless of if they are code-based or visually implemented.  It makes it simpler for automation tools to deploy AWS Glue jobs like AWS CodeDeploy.
        • Data Lake Ease of Access: AWS Glue supports Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake natively and works well in keeping storage consistent for an S3-based data lake.
        • Quality Assurance: Data Quality ensures that data quality and confidence are consistent across data lakes and pipelines.

      Shape

        • Ease of Use: Defining ETL is as simple as clicking and dragging in the job editor with auto-generated code to extract and transform any volume of data.
        • Shape it Whenever: AWS Glue can continuously consume data while cleaning data without stopping the streaming process.

      Discover

        • An Easy Catalog: AWS Glue logs all metadata kept in data stores on an AWS account regardless of location.  The catalog constitutes all categories and formats including table definitions, schemas, and other control information.  It has the capacity to compute statistics, manage partitions, and run queries efficiently.  It also catalogs how data changes over time with snapshots.
        • Data Discovery: AWS Glue has a feature called crawlers that develop interconnectivity between a source and a target data store.  Afterward, it sifts through a prioritized list of classifiers to determine the schema for the data and generates subsequent metadata for the catalog.  Crawlers can be set up to run on certain time triggers, including on a schedule, on-demand, or after event triggers.
        • Schema Controls: The AWS Glue Schema Registry is a set of manual controls for validating the stream of data using Apache Avro schemas.  It uses Apache-licensed serializers and deserializes to integrate Java applications intended for Apache Kafka, Apache Flink, and AWS Lambda (link).  This helps AWS Glue significantly improve data quality and avoid data loss as a result of sudden changes.
        • Rapid Scaling: Just like many other services on AWS, AWS Glue does also support autoscaling for either sharply increasing or decreasing resources depending on the demand of the workload.  Resources are only procured as needed and AWS Glue will never over-provision more resources than are necessary, helping to avoid needing to pay for unused assets.

        Preparation

          • Duplicate Reduction: It’s not uncommon to have duplicated bits of data across data sources.  AWS Glue helps to clean out these superfluous instances through pre-trained machine learning using another service feature called FindMatches.  Even if the two records are slightly different, FindMatches can still recognize both as matching or not and will delete the match, retaining at least one version across the databases within its control.
          • ETL Code: AWS Glue ETL (extract, transform, and load) has development endpoints for editing, debugging code, and stress-testing with options for what IDE or notebook it’s done on.  On the side, there are a bunch of customizable templates for readers and writers that can be imported to an AWS Glue job or shared with other developers via GitHub.
          • Normalizing Code: AWS Glue’s DataBrew has an interactive and basic UI for data analysts and scientists who may not be familiar with programming.  It’s point-and-click, so moving data between database types and AWS services is very simple.
          • Automated Sorting of Critical Data: Any critical data that AWS Glue sorts through in the pipeline and data lakes are automatically identified and provided options for hiding or replacing.  This can include persons’ names, SSNs, driver’s licenses, and IP addresses.

          Integration

            • Simplification: Interactive Sessions help to streamline the development of data integration.  Engineers can fully experiment with data on either an IDE or notebook.
            • Notebooks: Studio Job Notebooks is another integral service that can help developers get started quickly by scheduling notebook code as an AWS Glue job.
            • Job Scheduling: When jobs initiate can be planned either on a schedule, on event triggers, or by manual initiation.  AWS Glue will handle dependencies, bad filter data, and job retries automatically, but progress can be monitored through Amazon CloudWatch.
            • Git Integration: Widely open source and popular, GitHub and AWS CodeCommit can retain the history of changes to a job and update them with DevOps practices before deployment, regardless of if they are code-based or visually implemented.  It makes it simpler for automation tools to deploy AWS Glue jobs like AWS CodeDeploy.
            • Data Lake Ease of Access: AWS Glue supports Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake natively and works well in keeping storage consistent for an S3-based data lake.
            • Quality Assurance: Data Quality ensures that data quality and confidence are consistent across data lakes and pipelines.

          Shape

            • Ease of Use: Defining ETL is as simple as clicking and dragging in the job editor with auto-generated code to extract and transform any volume of data.
            • Shape it Whenever: AWS Glue can continuously consume data while cleaning data without stopping the streaming process.

          AWS Glue Price

          Like every other AWS service, AWS Glue will not overcharge users for overprovisioning since the system will always provide exactly what the account needs and with minor variances in pricing based on the account’s region.  AWS Glue itself charges by the hour down to the second of use for crawlers, ETL jobs, and processing.  The Data Catalog does have a simplified monthly fee for just storing and accessing the database within storage and DataBrew will charge by the number of instances active.  Otherwise, every other part of this service will charge for the duration it is active.  There is a Free Tier option where users get the first million stored objects access sessions free.  AWS Glue Schema Registry does not add any additional charge.  For clarification on price calculations, AWS does have a free calculator for anticipating what prices will look like for certain services.

          AWS Glue Price

          Like every other AWS service, AWS Glue will not overcharge users for overprovisioning since the system will always provide exactly what the account needs and with minor variances in pricing based on the account’s region.  AWS Glue itself charges by the hour down to the second of use for crawlers, ETL jobs, and processing.  The Data Catalog does have a simplified monthly fee for just storing and accessing the database within storage and DataBrew will charge by the number of instances active.  Otherwise, every other part of this service will charge for the duration it is active.  There is a Free Tier option where users get the first million stored objects access sessions free.  AWS Glue Schema Registry does not add any additional charge.  For clarification on price calculations, AWS does have a free calculator for anticipating what prices will look like for certain services.

          Dolan Cleary
          Dolan Cleary

          I am a recent graduate from the University of Wisconsin - Stout and am now working with AllCode as a web technician. Currently working within the marketing department.

          Related Articles

          Models of Migration on AWS

          Models of Migration on AWS

          Cloud computing does offer many benefits to users who are just starting to put together applications and solutions. Having an existing solution will not preclude an organization from being able to take advantage of the cloud. Migrating those solutions to a cloud environment can prove to be tricky for users who haven’t planned in advance.

          What is DevOps and How Developers Benefit

          What is DevOps and How Developers Benefit

          DevOps is a composition of best practices, principles, and company cultural concepts that are tailored to improve coordination in either development or IT teams in an organization. These standards help to streamline and automate the delivery cycle and allow teams to deploy applications sooner. In the case of arising issues, teams can respond faster and develop fixes sooner.

          AWS Migration Acceleration Program

          AWS Migration Acceleration Program

          The AWS Migration Acceleration Program is offered to help organizations migrate existing applications and workloads to the Amazon Cloud more efficiently. This includes tools, resources, and guidance about the best practices for migration and how to facilitate changes properly without disrupting business operations.