Course
Digicomp Code SDPDF
Serverless Data Processing with Dataflow («SDPDF»)
Course facts
- Demonstrating how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs
- Summarizing the benefits of the Beam Portability Framework and enabling it for your Dataflow pipelines
- Enabling Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance
- Enabling Flexible Resource Scheduling for more cost-efficient performance
- Selecting the right combination of IAM permissions for your Dataflow job
- Implementing best practices for a secure data processing environment
- Selecting and tuning the I/O of your choice for your Dataflow pipeline
- Using schemas to simplify your Beam code and improving the performance of your pipeline
- Developing a Beam pipeline using SQL and DataFrames
- Performing monitoring, troubleshooting, testing and CI/CD on Dataflow pipelines
Beginning with foundations, this training explains how Apache Beam and Dataflow work together to meet your data processing needs without the risk of vendor lock-in. The section on developing pipelines covers how you convert your business logic into data processing applications that can run on Dataflow.
This training culminates with a focus on operations, which reviews the most important lessons for operating a data application on Dataflow, including monitoring, troubleshooting, testing, and reliability.
1 Introduction
- Beam and Dataflow Refresher
- Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs
2 Beam Portability
- Runner v2
- Container Environments
- Cross-Language TransformS
- Summarize the benefits of the Beam Portability Framework
- Customize the data processing environment of your pipeline using custom containers
- Review use cases for cross-language transformations
- Enable the Portability framework for your Dataflow pipelines
3 Separating Compute and Storage with Dataflow
- Dataflow Shuffle Service
- Dataflow Streaming Engine
- Flexible Resource Scheduling
- Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance
- Enable Flexible Resource Scheduling for more cost-efficient performance
4 IAM, Quotas, and Permissions
- IAM
- Quota
- Select the right combination of IAM permissions for your Dataflow job.
- Determine your capacity needs by inspecting the relevant quotas for your Dataflow jobs
5 Security
- Data Locality
- Shared VPC
- Private IPs
- CMEK
- Select your zonal data processing strategy using Dataflow, depending on your data locality needs
- Implement best practices for a secure data processing environment
6 Beam Concepts Review
- Beam Basics
- Utility Transforms
- DoFn Lifecycle
- Review main Apache Beam concepts (Pipeline, PCollections, PTransforms, Runner, reading/writing, Utility PTransforms, side inputs), bundles and DoFn Lifecycle
7 Windows, Watermarks, Triggers
- Windows, Watermarks, Triggers
- Implement logic to handle your late data
- Review different types of triggers
- Review core streaming concepts (unbounded PCollections, windows)
8 Sources and Sinks
- Text IO and File IO
- BigQuery IO
- PubSub IO
- Kafka IO
- Bigable IO
- Avro IO
- Splittable DoFn
- Write the I/O of your choice for your Dataflow pipeline
- Tune your source/sink transformation for maximum performance
- Create custom sources and sinks using SDF
9 Schemas
- Beam Schemas
- Code Examples
- Introduce schemas, which give developers a way to express structured data in their Beam pipelines
- Use schemas to simplify your Beam code and improve the performance of your pipeline
10 State and Timers
- State API
- Timer API
- Summary
- Identify use cases for state and timer API implementations
- Select the right type of state and timers for your pipeline
11 Best Practices
- Schemas
- Handling unprocessable Data
- Error Handling
- AutoValue Code Generator
- JSON Data Handling
- Utilize DoFn Lifecycle
- Pipeline Optimizations
- Implement best practices for Dataflow pipelines.
12 Dataflow SQL and DataFrames
- Dataflow and Beam SQL
- Windowing in SQL
- Beam DataFrames
- Develop a Beam pipeline using SQL and DataFrames
13 Beam Notebooks
- Prototype your pipeline in Python using Beam notebook
- Launch a job to Dataflow from a notebook
14 Monitoring
- Job List
- Job Info
- Job Graph
- Job Metrics
- Metrics Explorer
- Navigate the Dataflow Job Details UI
- Interpret Job Metrics charts to diagnose pipeline regressions
- Set alerts on Dataflow jobs using Cloud Monitoring
15 Logging and Error Reporting
- Logging & Error Reporting
- Use the Dataflow logs and diagnostics widgets to troubleshoot pipeline issues
16 Troubleshooting and Debug
- Troubleshooting Workflow
- Types of Troubles
- Use a structured approach to debug your Dataflow pipelines
- Examine common causes for pipeline failures
17 Performance
- Pipeline Design
- Data Shape
- Source, Sinks, and External Systems
- Shuffle and Streaming Engine
- Understand performance considerations for pipelines
- Consider how the shape of your data can affect pipeline performance
18 Testing and CI/CD
- Unit Testing
- Integration Testing
- Artifact Building
- Deployment
- Testing approaches for your Dataflow pipeline
- Review frameworks and features available to streamline your CI/CD workflow for Dataflow pipelines
19 Reliability
- Monitoring
- Geolocation
- Disaster Recovery
- High Availability
- Implement reliability best practices for your Dataflow pipelines
20 Flex Templates
- Classic Templates
- Flex Templates
- Using Flex Templates
- Google-provided Templates
- Using flex templates to standardize and reuse Dataflow pipeline code
- Data Engineers
- Data Analysts and Data Scientists aspiring to develop Data Engineering skills
- Completed Building Batch Data Pipelines
- Completed Building Resilient Streaming Analytics Systems
Products
- Dataflow
- Cloud Operations