Google Professional Data Engineer Orchestration and Data Management

Use for pipeline scheduling, workflow dependencies, managed transfers, transformation frameworks, and data management tooling such as Composer, Workflows, and Dataform.

Exams
PROFESSIONAL-DATA-ENGINEER
Questions
27
Comments
393

1. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 335

Sequence
2
Discussion ID
382508
Source URL
https://www.examtopics.com/discussions/google/view/382508-exam-professional-data-engineer-topic-1-question-335/
Posted By
67bdb19
Posted At
Jan. 16, 2026, 8:15 a.m.

Question

You have a data analyst team member who needs to analyze data by using BigQuery. The data analyst wants to create a data pipeline that would load 200 CSV files with an average size of 15MB from a Cloud Storage bucket into BigQuery daily. The data needs to be ingested and transformed before being accessed in BigQuery for analysis. You need to recommend a fully managed, no-code solution for the data analyst. What should you do?

  • A. Create a Cloud Run function and schedule it to run daily using Cloud Scheduler to load the data into BigQuery.
  • B. Use the BigQuery Data Transfer Service to load files from Cloud Storage to BigQuery, create a BigQuery job which transforms the data using BigQuery SQL and schedule it to run daily.
  • C. Build a custom Apache Beam pipeline and run it on Dataflow to load the file from Cloud Storage to BigQuery and schedule it to run daily using Cloud Composer.
  • D. Create a pipeline by using BigQuery pipelines and schedule it to load the data into BigQuery daily.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 1 comment Click to expand

Comment 1

ID: 1716794 User: mpuche3 Badges: - Relative Date: 2 weeks, 3 days ago Absolute Date: Mon 23 Feb 2026 20:08 Selected Answer: D Upvotes: 2

BigQuery Pipelines is a real, generally available product. It was first released in September 2024 under the name "BigQuery Workflows," then renamed to "BigQuery Pipelines" when it reached General Availability in March 2025.

2. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 328

Sequence
8
Discussion ID
382506
Source URL
https://www.examtopics.com/discussions/google/view/382506-exam-professional-data-engineer-topic-1-question-328/
Posted By
67bdb19
Posted At
Jan. 16, 2026, 8:15 a.m.

Question

You need to load a dataset with multiple terabytes of clickstream data into BigQuery. The data arrives each day as compressed JSON files in a Cloud Storage bucket. You need a low-cost, programmatic, and scalable solution to load the data into BigQuery. What should you do?

  • A. Create an external table in BigQuery pointing to the Cloud Storage bucket and run the INSERT INTO ... SELECT * FROM external_table command.
  • B. Use the BigQuery Data Transfer Service from Cloud Storage.
  • C. Create a Cloud Run function to run a Python script to read and parse each JSON file, and use the BigQuery streaming insert API.
  • D. Use Cloud Data Fusion to create a pipeline to load the JSON files into BigQuery.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 1 comment Click to expand

Comment 1

ID: 1708990 User: NickForDiscussions Badges: - Relative Date: 1 month, 2 weeks ago Absolute Date: Fri 23 Jan 2026 10:32 Selected Answer: B Upvotes: 1

Should be B. When you use BigQuery Data Transfer Service you don't pay for the load job, you only pay for the storage in BigQuery. If you go with Option A you are paying for the query plus the storage.

3. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 270

Sequence
20
Discussion ID
130220
Source URL
https://www.examtopics.com/discussions/google/view/130220-exam-professional-data-engineer-topic-1-question-270/
Posted By
scaenruy
Posted At
Jan. 3, 2024, 6:43 p.m.

Question

You need to create a SQL pipeline. The pipeline runs an aggregate SQL transformation on a BigQuery table every two hours and appends the result to another existing BigQuery table. You need to configure the pipeline to retry if errors occur. You want the pipeline to send an email notification after three consecutive failures. What should you do?

  • A. Use the BigQueryUpsertTableOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.
  • B. Use the BigQueryInsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.
  • C. Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable email notifications.
  • D. Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable notification to Pub/Sub topic. Use Pub/Sub and Cloud Functions to send an email after three failed executions.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 18 comments Click to expand

Comment 1

ID: 1114688 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Fri 05 Jul 2024 17:18 Selected Answer: B Upvotes: 8

- It provides a direct and controlled way to manage the SQL pipeline using Cloud Composer (Apache Airflow).
- The BigQueryInsertJobOperator is well-suited for running SQL jobs in BigQuery, including aggregate transformations and handling of results.
- The retry and email_on_failure parameters align with the requirements for error handling and notifications.
- Cloud Composer requires more setup than using BigQuery's scheduled queries directly, but it offers robust workflow management, retry logic, and notification capabilities, making it suitable for more complex and controlled data pipeline requirements.

Comment 1.1

ID: 1154870 User: SuperVan Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Tue 20 Aug 2024 17:27 Selected Answer: - Upvotes: 6

The prompt wants an email notification sent after three failed attempts. Is there any concern that the retry parameter is set to 3, wouldn't this mean that the email is sent after 4 failed attempts (1 original + 3 retries)?

Comment 2

ID: 1700273 User: lmch Badges: Most Recent Relative Date: 2 months, 3 weeks ago Absolute Date: Thu 18 Dec 2025 10:03 Selected Answer: B Upvotes: 1

This requirement describes a standard orchestration workflow involving a schedule (every two hours), a transformation (SQL aggregation), an append operation, and custom error handling (retries and conditional notifications). Cloud Composer (managed Apache Airflow) is the standard tool for this level of logic.

Comment 3

ID: 1582366 User: 56d02cd Badges: - Relative Date: 8 months, 1 week ago Absolute Date: Wed 02 Jul 2025 06:39 Selected Answer: D Upvotes: 1

The retries in A and B are only for task failures/retries within the DAG. The question asks for a notification after three consecutive failures of the DAG. IOW ... the retries parameter is correct for handling failures within a single task instance. email_on_failure would typically fire after the final retry of that single task instance fails, not after three consecutive DAG runs fail.

Comment 4

ID: 1574202 User: 22c1725 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Mon 02 Jun 2025 15:10 Selected Answer: B Upvotes: 1

There is no retry in (D)
https://cloud.google.com/bigquery/docs/scheduling-queries

Comment 5

ID: 1562812 User: gabbferreira Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Tue 22 Apr 2025 21:49 Selected Answer: D Upvotes: 1

It´s D
"Notifications on operator failure
Set email_on_failure to True to send an email notification when an operator in the DAG fails. To send email notifications from a Cloud Composer environment, you must configure your environment to use SendGrid."

Setting email_on_failure = True will send the email after the FIRST failure

https://cloud.google.com/composer/docs/composer-2/write-dags#notifications_on_operator_failure

Comment 6

ID: 1562811 User: gabbferreira Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Tue 22 Apr 2025 21:39 Selected Answer: D Upvotes: 1

chatpgt and gemini said it is D

Comment 7

ID: 1358656 User: MarcoPellegrino Badges: - Relative Date: 1 year ago Absolute Date: Wed 19 Feb 2025 11:04 Selected Answer: D Upvotes: 2

A) Wrong, Upsert is not for appending
B) Wrong, doesn't mention the 2 hours scheduling
C) Wrong, doesn't mention the emailing
D) Correct

Comment 8

ID: 1349556 User: Augustax Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Fri 31 Jan 2025 15:33 Selected Answer: D Upvotes: 1

The retry times in B and clearly mentioned 2 hours in D make me think D is the better option..

Comment 9

ID: 1348970 User: plum21 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Thu 30 Jan 2025 10:27 Selected Answer: D Upvotes: 3

"You want the pipeline to send an email notification after three consecutive failures" - it is not about retries which are configurable via Composer operator - it is about 3 consecutive executions which could be for different hours.

Comment 10

ID: 1337620 User: b3e59c2 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 07 Jan 2025 17:05 Selected Answer: D Upvotes: 1

Terrible options as usual. Whilst B is the most elegant, it doesn't explicitly address the 2 hour scheduling (you can schedule within Composer, but the answer doesn't mention it).

If we take these answers on the surface level, D is the only option that actually achieves our goal.

Comment 11

ID: 1330200 User: e593506 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sun 22 Dec 2024 00:30 Selected Answer: D Upvotes: 1

The prompt wants an email notification sent after three failed attempts
Option B does not meet that condition

Comment 12

ID: 1213945 User: josech Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Tue 19 Nov 2024 21:58 Selected Answer: B Upvotes: 1

https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/bigquery/index.html#airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator
https://cloud.google.com/composer/docs/composer-2/write-dags#notifications_on_operator_failure

Comment 13

ID: 1190866 User: joao_01 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 07 Oct 2024 10:57 Selected Answer: - Upvotes: 1

It s B (however for me its a incomplete answers cause it does not address the schedule of every 2 hours).

Its not C or D because BigQuery scheduled queries by default does not retries the queries when error occurs. Link: https://cloud.google.com/bigquery/docs/scheduling-queries

Comment 14

ID: 1155167 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Wed 21 Aug 2024 02:21 Selected Answer: B Upvotes: 1

Option B

Comment 15

ID: 1134774 User: datapassionate Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 29 Jul 2024 08:53 Selected Answer: D Upvotes: 2

D. Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable notification to Pub/Sub topic. Use Pub/Sub and Cloud Functions to send an email after three failed executions

This method utilizes BigQuery's native scheduling capabilities for running the SQL job and leverages Pub/Sub and Cloud Functions for customized notification handling, including the specific requirement of sending an email after three consecutive failures.

Comment 15.1

ID: 1156755 User: RenePetersen Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 22 Aug 2024 21:34 Selected Answer: - Upvotes: 3

Option D mentions nothing about how the job retrying is put in place, so for that reason I don't think this is the correct option.

Comment 16

ID: 1112985 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 17:43 Selected Answer: B Upvotes: 1

B. Use the BigQueryInsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.

4. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 231

Sequence
26
Discussion ID
130340
Source URL
https://www.examtopics.com/discussions/google/view/130340-exam-professional-data-engineer-topic-1-question-231/
Posted By
raaad
Posted At
Jan. 4, 2024, 5:09 p.m.

Question

You recently deployed several data processing jobs into your Cloud Composer 2 environment. You notice that some tasks are failing in Apache Airflow. On the monitoring dashboard, you see an increase in the total workers memory usage, and there were worker pod evictions. You need to resolve these errors. What should you do? (Choose two.)

  • A. Increase the directed acyclic graph (DAG) file parsing interval.
  • B. Increase the Cloud Composer 2 environment size from medium to large.
  • C. Increase the maximum number of workers and reduce worker concurrency.
  • D. Increase the memory available to the Airflow workers.
  • E. Increase the memory available to the Airflow triggerer.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 17 comments Click to expand

Comment 1

ID: 1152041 User: ML6 Badges: Highly Voted Relative Date: 1 year, 6 months ago Absolute Date: Fri 16 Aug 2024 13:58 Selected Answer: D Upvotes: 8

If an Airflow worker pod is evicted, all task instances running on that pod are interrupted, and later marked as failed by Airflow. The majority of issues with worker pod evictions happen because of out-of-memory situations in workers.
You might want to:
- (D) Increase the memory available to workers.
- (C) Reduce worker concurrency. In this way, a single worker handles fewer tasks at once. This provides more memory or storage to each individual task. If you change worker concurrency, you might also want to increase the maximum number of workers. In this way, the number of tasks that your environment can handle at once stays the same. For example, if you reduce worker Concurrency from 12 to 6, you might want to double the maximum number of workers.

Source: https://cloud.google.com/composer/docs/composer-2/optimize-environments

Comment 2

ID: 1625433 User: af17139 Badges: Most Recent Relative Date: 3 months, 4 weeks ago Absolute Date: Thu 13 Nov 2025 08:45 Selected Answer: D Upvotes: 1

D. Increase the memory available to the Airflow workers.

Here's why:

Directly Addresses the Root Cause: Worker pod evictions are happening because the pods are exceeding their memory limits. Increasing the memory allocation for each worker pod directly provides more resources to handle the demands of the tasks.

Handles Memory-Intensive Tasks: If individual tasks themselves require a significant amount of memory, reducing concurrency (as in option C) might not prevent OOMs. A single task could still consume more memory than the worker has, even if it's the only task running.

Simpler Initial Step: Adjusting the memory per worker is often a more straightforward change to make and observe the impact.

Comment 3

ID: 1587580 User: imrane1995 Badges: - Relative Date: 7 months, 4 weeks ago Absolute Date: Wed 16 Jul 2025 18:31 Selected Answer: C Upvotes: 1

BD

B. Increase the Cloud Composer 2 environment size from medium to large.
Why: Increasing the environment size automatically adjusts the GKE cluster's node resources (CPU and memory), which can help avoid pod evictions caused by memory pressure.

Effect: This increases the capacity available to all components, including workers, helping to accommodate the memory needs of your jobs.

✅ D. Increase the memory available to the Airflow workers.
Why: If workers are being evicted due to memory pressure, increasing their memory limit directly addresses the root cause.

How: In Cloud Composer 2, this can be done via the resources settings under workloadsConfig (like worker.resources.limits.memory).

Comment 4

ID: 1410708 User: taka5094 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Thu 27 Mar 2025 03:16 Selected Answer: C Upvotes: 2

CD
On the Monitoring dashboard, in the Workers section, observe the Worker Pods evictions graphs for your environment.
The Total workers memory usage graph shows a total perspective of the environment. A single worker can still exceed the memory limit, even if the memory utilization is healthy at the environment level.
According to your observations, you might want to:
- Increase the memory available to workers.
- Reduce worker concurrency.
In this way, a single worker handles fewer tasks at once. This provides more memory or storage to each individual task. If you change worker concurrency, you might also want to increase the maximum number of workers. In this way, the number of tasks that your environment can handle at once stays the same. For example, if you reduce worker Concurrency from 12 to 6, you might want to double the maximum number of workers.

Comment 5

ID: 1402280 User: desertlotus1211 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Sun 23 Mar 2025 13:51 Selected Answer: - Upvotes: 2

Answer is B,D

Comment 6

ID: 1216178 User: Anudeep58 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sat 23 Nov 2024 06:36 Selected Answer: D Upvotes: 1

Answer C,D

According to your observations, you might want to:

Increase the memory available to workers.
Reduce worker concurrency. In this way, a single worker handles fewer tasks at once. This provides more memory or storage to each individual task. If you change worker concurrency, you might also want to increase the maximum number of workers. In this way, the number of tasks that your environment can handle at once stays the same. For example, if you reduce worker Concurrency from 12 to 6, you might want to double the maximum number of workers.

Comment 6.1

ID: 1402279 User: desertlotus1211 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Sun 23 Mar 2025 13:51 Selected Answer: - Upvotes: 1

Reducing concurrency can reduce memory pressure per worker, but won't help if the memory limit per pod is too low

Comment 7

ID: 1214301 User: virat_kohli Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 20 Nov 2024 14:31 Selected Answer: D Upvotes: 1

C. Increase the maximum number of workers and reduce worker concurrency. Most Voted
D. Increase the memory available to the Airflow workers.

Comment 8

ID: 1152039 User: ML6 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Fri 16 Aug 2024 13:56 Selected Answer: D Upvotes: 2

If an Airflow worker pod is evicted, all task instances running on that pod are interrupted, and later marked as failed by Airflow. The majority of issues with worker pod evictions happen because of out-of-memory situations in workers.
You might want to:
- Increase the memory available to workers.
- Reduce worker concurrency. In this way, a single worker handles fewer tasks at once. This provides more memory or storage to each individual task. If you change worker concurrency, you might also want to increase the maximum number of workers. In this way, the number of tasks that your environment can handle at once stays the same. For example, if you reduce worker Concurrency from 12 to 6, you might want to double the maximum number of workers.
Source: https://cloud.google.com/composer/docs/composer-2/optimize-environments

Comment 9

ID: 1125093 User: qq589539483084gfrgrgfr Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 17 Jul 2024 15:40 Selected Answer: C Upvotes: 3

CD It is clear

Comment 10

ID: 1121551 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 11:19 Selected Answer: C Upvotes: 2

C & D to me

Comment 11

ID: 1119126 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 10 Jul 2024 23:00 Selected Answer: C Upvotes: 4

C and D
Check ref for memory optimization - https://cloud.google.com/composer/docs/composer-2/optimize-environments

Comment 11.1

ID: 1122176 User: AllenChen123 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 14 Jul 2024 02:01 Selected Answer: - Upvotes: 4

Agree. Straightforward.
https://cloud.google.com/composer/docs/composer-2/optimize-environments#monitor-scheduler
-> Figure 3. Graph that displays worker pod evictions

Comment 12

ID: 1116511 User: qq589539483084gfrgrgfr Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 08 Jul 2024 09:24 Selected Answer: B Upvotes: 2

B&D See this-
https://cloud.google.com/composer/docs/composer-2/troubleshooting-dags#task-fails-without-logs
go through the suggested fixes for If there are airflow-worker pods that show Evicted

Comment 13

ID: 1115296 User: Jordan18 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 06 Jul 2024 16:10 Selected Answer: C Upvotes: 2

C and D

Comment 14

ID: 1113853 User: raaad Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 16:09 Selected Answer: B Upvotes: 3

B&D:
B :
- Scaling up the environment size can provide more resources, including memory, to the Airflow workers. If worker pod evictions are occurring due to insufficient memory, increasing the environment size to allocate more resources could alleviate the problem and improve the stability of your data processing jobs.

D:
- Increase the memory available to the Airflow workers. - Directly increasing the memory allocation for Airflow workers can address the issue of high memory usage and worker pod evictions. More memory per worker means that each worker can handle more demanding tasks or a higher volume of tasks without running out of memory.

Comment 14.1

ID: 1125731 User: GCP001 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 18 Jul 2024 10:25 Selected Answer: - Upvotes: 2

why not B ) It s not decreasing concurrency which may cause issue again

5. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 235

Sequence
27
Discussion ID
130178
Source URL
https://www.examtopics.com/discussions/google/view/130178-exam-professional-data-engineer-topic-1-question-235/
Posted By
scaenruy
Posted At
Jan. 3, 2024, 1:24 p.m.

Question

You want to schedule a number of sequential load and transformation jobs. Data files will be added to a Cloud Storage bucket by an upstream process. There is no fixed schedule for when the new data arrives. Next, a Dataproc job is triggered to perform some transformations and write the data to BigQuery. You then need to run additional transformation jobs in BigQuery. The transformation jobs are different for every table. These jobs might take hours to complete. You need to determine the most efficient and maintainable workflow to process hundreds of tables and provide the freshest data to your end users. What should you do?

  • A. 1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage, Dataproc, and BigQuery operators.
    2. Use a single shared DAG for all tables that need to go through the pipeline.
    3. Schedule the DAG to run hourly.
  • B. 1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage, Dataproc, and BigQuery operators.
    2. Create a separate DAG for each table that needs to go through the pipeline.
    3. Schedule the DAGs to run hourly.
  • C. 1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators.
    2. Use a single shared DAG for all tables that need to go through the pipeline.
    3. Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG.
  • D. 1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators.
    2. Create a separate DAG for each table that needs to go through the pipeline.
    3. Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG.

Suggested Answer

D

Answer Description Click to expand


Community Answer Votes

Comments 16 comments Click to expand

Comment 1

ID: 1160059 User: cuadradobertolinisebastiancami Badges: Highly Voted Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 21:41 Selected Answer: - Upvotes: 9

D

* Transformations are in Dataproc and BigQuery. So you don't need operators for GCS (A and B can be discard)
* "There is no fixed schedule for when the new data arrives." so you trigger the DAG when a file arrives
* "The transformation jobs are different for every table. " so you need a DAG for each table.

Then, D is the most suitable answer

Comment 2

ID: 1625434 User: af17139 Badges: Most Recent Relative Date: 3 months, 4 weeks ago Absolute Date: Thu 13 Nov 2025 08:55 Selected Answer: D Upvotes: 1

A & B: Hourly schedules don't fit the event-driven nature of data arrival.
C: A single shared DAG for hundreds of tables with different transformations is not maintainable or scalable from a logic perspective. It would require complex branching and conditional logic within the DAG, making it hard to manage.

Comment 3

ID: 1587582 User: imrane1995 Badges: - Relative Date: 7 months, 4 weeks ago Absolute Date: Wed 16 Jul 2025 18:36 Selected Answer: C Upvotes: 1

✅ C.
Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators.

Use a single shared DAG for all tables that need to go through the pipeline.

Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG.

Comment 4

ID: 1573890 User: 22c1725 Badges: - Relative Date: 9 months, 2 weeks ago Absolute Date: Sun 01 Jun 2025 10:54 Selected Answer: D Upvotes: 1

programmatic DAG generation.

Comment 5

ID: 1573091 User: 22c1725 Badges: - Relative Date: 9 months, 2 weeks ago Absolute Date: Wed 28 May 2025 18:31 Selected Answer: D Upvotes: 1

"maintainable" this is an clear. Whould creating one dag for each transformtion is better or having large code with thousend of lines would be better? there is no clear right or wrong here. But I would go with "D" becuse there is "Jobs" not "Job"

Comment 6

ID: 1351116 User: choprat1 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 03 Feb 2025 22:30 Selected Answer: D Upvotes: 1

managing indidivuals DAGs is the best way when they're too different

Comment 7

ID: 1330668 User: f74ca0c Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Mon 23 Dec 2024 05:00 Selected Answer: C Upvotes: 3

A single shared DAG is efficient to manage, and table-specific transformations can be handled using parameters (e.g., passing table names and configurations dynamically).
Triggering the DAG using a Cloud Storage object notification and a Cloud Function ensures the workflow starts immediately upon data arrival.
Event-driven architecture minimizes delays and provides the freshest data to users.
Efficient, maintainable, and event-driven.

Comment 8

ID: 1237569 User: 8ad5266 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 26 Jun 2024 18:15 Selected Answer: C Upvotes: 3

This explains why it's not D:
maintainable workflow to process hundreds of tables and provide the freshest data to your end users

How is creating a DAG for each of the hundreds of tables maintainable?

Comment 8.1

ID: 1354290 User: plum21 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 10 Feb 2025 07:21 Selected Answer: - Upvotes: 1

It's possible to generate multiple DAGs programatically. That's the reason for C. https://cloud.google.com/blog/products/data-analytics/optimize-cloud-composer-via-better-airflow-dags -> look at #5

Comment 9

ID: 1153238 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 12:53 Selected Answer: D Upvotes: 1

Option D

Comment 10

ID: 1121564 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 12:35 Selected Answer: D Upvotes: 3

Option D, which gets triggered when the data comes in and accounts for the fact that each table has its own set of transformations

Comment 11

ID: 1115315 User: Jordan18 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 06 Jan 2024 17:50 Selected Answer: - Upvotes: 3

why not C?

Comment 11.1

ID: 1160058 User: cuadradobertolinisebastiancami Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 21:40 Selected Answer: - Upvotes: 2

It says that the transformations for each table are very different

Comment 11.2

ID: 1122224 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 14 Jan 2024 04:03 Selected Answer: - Upvotes: 5

Same question, why not use single DAG to manage as there are hundreds of tables.

Comment 12

ID: 1113963 User: raaad Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 19:16 Selected Answer: D Upvotes: 2

- Option D: Tailored handling and scheduling for each table; triggered by data arrival for more timely and efficient processing.

Comment 13

ID: 1112736 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 13:24 Selected Answer: D Upvotes: 1

D.
1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators.
2. Create a separate DAG for each table that needs to go through the pipeline.
3. Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG.

6. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 6

Sequence
33
Discussion ID
16639
Source URL
https://www.examtopics.com/discussions/google/view/16639-exam-professional-data-engineer-topic-1-question-6/
Posted By
-
Posted At
March 15, 2020, 8:43 a.m.

Question

Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?

  • A. Issue a command to restart the database servers.
  • B. Retry the query with exponential backoff, up to a cap of 15 minutes.
  • C. Retry the query every second until it comes back online to minimize staleness of data.
  • D. Reduce the query frequency to once every hour until the database comes back online.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 19 comments Click to expand

Comment 1

ID: 213170 User: Radhika7983 Badges: Highly Voted Relative Date: 5 years, 4 months ago Absolute Date: Thu 05 Nov 2020 04:24 Selected Answer: - Upvotes: 56

Correct answer is B. App engine create applications that use Cloud SQL database connections effectively. Below is what is written in google cloud documnetation.

If your application attempts to connect to the database and does not succeed, the database could be temporarily unavailable. In this case, sending too many simultaneous connection requests might waste additional database resources and increase the time needed to recover. Using exponential backoff prevents your application from sending an unresponsive number of connection requests when it can't connect to the database.

This retry only makes sense when first connecting, or when first grabbing a connection from the pool. If errors happen in the middle of a transaction, the application must do the retrying, and it must retry from the beginning of a transaction. So even if your pool is configured properly, the application might still see errors if connections are lost.

reference link is https://cloud.google.com/sql/docs/mysql/manage-connections

Comment 2

ID: 137289 User: llamaste Badges: Highly Voted Relative Date: 5 years, 7 months ago Absolute Date: Fri 17 Jul 2020 16:53 Selected Answer: - Upvotes: 12

https://cloud.google.com/sql/docs/mysql/manage-connections#backoff

Comment 3

ID: 1618520 User: 3244fd8 Badges: Most Recent Relative Date: 4 months, 3 weeks ago Absolute Date: Mon 20 Oct 2025 05:32 Selected Answer: B Upvotes: 1

Retry the query with exponential backoff, up to a cap of 15 minutes.

Comment 4

ID: 1399883 User: willyunger Badges: - Relative Date: 12 months ago Absolute Date: Tue 18 Mar 2025 00:10 Selected Answer: B Upvotes: 1

Exponential backoff avoids swamping the server. Higher rates may only make problem worse. Front-end should not have option to restart DB.

Comment 5

ID: 1339956 User: cqrm3n Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 13 Jan 2025 16:34 Selected Answer: B Upvotes: 1

We should use exponential backoff because it reduces load on the failing database, optimizes retry timing and is the industry best practice. Exponential backoff is a retry strategy where the wait time between retry increase exponentially. By gradually increasing the retry interval, the system avoids wasting resources on immediate retries when the database is likely still down.

Comment 6

ID: 1050468 User: rtcpost Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:07 Selected Answer: B Upvotes: 3

Exponential backoff is a commonly used technique to handle temporary failures, such as a database server becoming temporarily unavailable. This approach retries the query, initially with a short delay and then with increasingly longer intervals between retries. Setting a cap of 15 minutes ensures that you don't excessively burden your system with constant retries.

Option C (retrying the query every second) can be too aggressive and may lead to excessive load on the server when it comes back online.

Option D (reducing the query frequency to once every hour) would result in significantly stale data and a poor user experience, which is generally not desirable for a weather app.

Option A (issuing a command to restart the database servers) is not a suitable action for a frontend component and might not address the issue effectively. Database server restarts should be managed as a part of the infrastructure and not initiated by the frontend.

Comment 7

ID: 529920 User: samdhimal Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:06 Selected Answer: - Upvotes: 2

correct answer -> Retry the query with exponential backoff, up to a cap of 15 minutes.

If your application attempts to connect to the database and does not succeed, the database could be temporarily unavailable. In this case, sending too many simultaneous connection requests might waste additional database resources and increase the time needed to recover. Using exponential backoff prevents your application from sending an unresponsive number of connection requests when it can't connect to the database.

Comment 7.1

ID: 784802 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 00:56 Selected Answer: - Upvotes: 2

Exponential backoff with a cap is a common technique used to handle temporary failures, such as database outages. In this approach, the frontend will retry the query with increasing intervals (e.g., 1s, 2s, 4s, 8s, etc.) up to a maximum interval (in this case, 15 minutes), this will help to avoid overwhelming the database servers with too many requests at once, and minimize the impact of the failure on the users.

Option A, is not recommended because it's not guaranteed that restarting the database servers will fix the problem, it could be a network or a configuration problem and it could cause more downtime.

Option C is not recommended because it could cause too many requests to be sent to the server, overwhelming the database and causing more downtime.

Option D is not recommended because reducing the query frequency too much would result in stale data, and users will not receive the most up-to-date information.

Comment 8

ID: 1065065 User: RT_G Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 07 Nov 2023 19:11 Selected Answer: B Upvotes: 1

Retries with exponential backoff seems like the most efficient option in this scenario

Comment 9

ID: 1061045 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 03 Nov 2023 04:16 Selected Answer: B Upvotes: 1

Correct answer is B

Comment 10

ID: 999527 User: gudguy1a Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Tue 05 Sep 2023 15:06 Selected Answer: B Upvotes: 1

good answer, good answer @radhika7983.

Comment 11

ID: 916192 User: Datardp Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 06 Jun 2023 13:42 Selected Answer: - Upvotes: 1

B is anser

Comment 12

ID: 901965 User: vaga1 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 19 May 2023 15:27 Selected Answer: B Upvotes: 1

I agree with the exponential backoff technique, even thoght I do not see why 15 minutes should be a desired choice.

Comment 12.1

ID: 901968 User: vaga1 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 19 May 2023 15:30 Selected Answer: - Upvotes: 1

I guess that when u have failed after 15 minutes, your app must go through a serious review before being used again, since it is not able to provide the updated results as quickly as desired.

Comment 13

ID: 746054 User: yafsong Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 15 Dec 2022 13:24 Selected Answer: - Upvotes: 4

Truncated exponential backoff is a standard error-handling strategy for network applications. In this approach, a client periodically retries a failed request with increasing delays between requests

Comment 14

ID: 721282 User: hiromi Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 18 Nov 2022 13:45 Selected Answer: B Upvotes: 1

B is right

Comment 15

ID: 546452 User: shiv14 Badges: - Relative Date: 4 years ago Absolute Date: Sun 13 Feb 2022 13:51 Selected Answer: B Upvotes: 1

According to the documentation

Comment 16

ID: 523529 User: deep_ROOT Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Fri 14 Jan 2022 13:45 Selected Answer: - Upvotes: 1

B is Correct; this question appeared in Cloud Architect exam also

Comment 17

ID: 474004 User: MaxNRG Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sun 07 Nov 2021 19:27 Selected Answer: - Upvotes: 2

B,
backoff is a standard error handling strategy for network applications in which a client periodically retries a failed request with increasing delays between requests. Clients should use truncated exponential backoff for all requests to Cloud Storage that return HTTP 5xx and 429 response codes, including uploads and downloads of data or metadata.

7. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 69

Sequence
57
Discussion ID
16072
Source URL
https://www.examtopics.com/discussions/google/view/16072-exam-professional-data-engineer-topic-1-question-69/
Posted By
cleroy
Posted At
March 10, 2020, 1:10 p.m.

Question

Your infrastructure includes a set of YouTube channels. You have been tasked with creating a process for sending the YouTube channel data to Google Cloud for analysis. You want to design a solution that allows your world-wide marketing teams to perform ANSI SQL and other types of analysis on up-to-date YouTube channels log data. How should you set up the log data transfer into Google Cloud?

  • A. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
  • B. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Regional bucket as a final destination.
  • C. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
  • D. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Regional storage bucket as a final destination.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 35 comments Click to expand

Comment 1

ID: 143919 User: VishalB Badges: Highly Voted Relative Date: 5 years, 7 months ago Absolute Date: Sun 26 Jul 2020 10:19 Selected Answer: - Upvotes: 75

Correct Answer: A

Destination is GCS and having multi-regional so A is the best option available.

Even since BigQuery Data Transfer Service initially supports Google application sources like Google Ads, Campaign Manager, Google Ad Manager and YouTube but it does not support destination anything other than bq data set

Comment 1.1

ID: 1319178 User: cloud_rider Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 28 Nov 2024 11:52 Selected Answer: - Upvotes: 1

I differ, destination is not mentioned as GCS, it only says Google Cloud which can mean BigQuery too. And ANSI SQL also points towards this direction.

Comment 1.1.1

ID: 1602408 User: forepick Badges: - Relative Date: 6 months, 2 weeks ago Absolute Date: Mon 25 Aug 2025 19:15 Selected Answer: - Upvotes: 1

ANSI SQL as a requirement can be also implemented as the data rests on GCS, and queried in BQ as an external table.

Comment 1.2

ID: 257176 User: henryCho Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Sat 02 Jan 2021 01:59 Selected Answer: - Upvotes: 7

What about ANSI SQL?

Comment 1.2.1

ID: 610197 User: nadavw Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 01 Jun 2022 15:49 Selected Answer: - Upvotes: 2

use external table for it

Comment 1.2.2

ID: 368364 User: Jphix Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Fri 28 May 2021 01:33 Selected Answer: - Upvotes: 8

I guess they are assuming that you will just query the data in Cloud Storage from BQ. The question specifically is, "How should you set up the log data transfer into Google Cloud?", not "How should you set up the querying." ANSI SQL is a distraction!

Comment 1.3

ID: 438974 User: tainangao Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Sat 04 Sep 2021 09:41 Selected Answer: - Upvotes: 10

Currently, you cannot use the BigQuery Data Transfer Service to transfer data out of BigQuery.

https://cloud.google.com/bigquery-transfer/docs/introduction

Comment 1.3.1

ID: 915461 User: phidelics Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Mon 05 Jun 2023 15:10 Selected Answer: - Upvotes: 4

You can use BQ Data transfer Service for Youtube channels mow

Comment 1.3.2

ID: 905950 User: AmmarFasih Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 15:55 Selected Answer: - Upvotes: 3

but I think now you can use BigQuery Data Transfer Service for youtube channels and many other

Comment 1.4

ID: 431129 User: asksathvik Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Wed 25 Aug 2021 06:24 Selected Answer: - Upvotes: 16

Kindly re-read the question,the question says Google Cloud not Cloud storage...once you master that you will understand the whole question and be able to pick the right answer which is C

Comment 1.4.1

ID: 447016 User: yoshik Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sat 18 Sep 2021 11:31 Selected Answer: - Upvotes: 4

logs like stuff goes better on buckets

Comment 1.4.2

ID: 432541 User: triipinbee Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 26 Aug 2021 21:20 Selected Answer: - Upvotes: 57

all the option clearly says "storage bucket", once you master that, you'll realize the correct option is A

Comment 1.4.2.1

ID: 874165 User: Rodrigo4N Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 19 Apr 2023 00:50 Selected Answer: - Upvotes: 2

Gottem!

Comment 1.4.2.2

ID: 663782 User: HarshKothari21 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 18:16 Selected Answer: - Upvotes: 3

good one :)

Comment 2

ID: 76632 User: Ganshank Badges: Highly Voted Relative Date: 5 years, 10 months ago Absolute Date: Mon 20 Apr 2020 03:01 Selected Answer: - Upvotes: 21

None of the answers make any sense.
BigQuery Transfer Service is for moving data from various sources (S3, Youtube etc) into BigQuery, not Google Cloud Storage.
Further, how are we supposed to use SQL to query data if it is stored in Cloud Storage?

Comment 2.1

ID: 123531 User: dambilwa Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Tue 30 Jun 2020 15:29 Selected Answer: - Upvotes: 1

Agreed! - All Options look wrong

Comment 2.1.1

ID: 123533 User: dambilwa Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Tue 30 Jun 2020 15:33 Selected Answer: - Upvotes: 5

Option [A] is the least worse option... for world wide teams to perform ANSI SQL Queries, it would be easier to create a ext. table or load from Multi AZ bucket... BQ Data Transfer service is used to push data in BQ, hence ruling out Option C & D

Comment 2.1.1.1

ID: 446435 User: StefanoG Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Fri 17 Sep 2021 09:40 Selected Answer: - Upvotes: 3

The best option would be to use "BigQuery Transfer Service" to upload data to BQ. But BQ is not present as a destination, so the only working option is Multi Regional GCS

Comment 2.2

ID: 192110 User: TNT87 Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 10:25 Selected Answer: - Upvotes: 9

Kindly re-read the question,the question says Google Cloud not Cloud storage...once you master that you will understand the whole question and be able to pick the right answer which is C

Comment 2.2.1

ID: 192111 User: TNT87 Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 10:26 Selected Answer: - Upvotes: 3

https://cloud.google.com/bigquery-transfer/docs/youtube-channel-transfer
this link will help to cement the answer.

Comment 3

ID: 1601770 User: 1479 Badges: Most Recent Relative Date: 6 months, 3 weeks ago Absolute Date: Sat 23 Aug 2025 20:03 Selected Answer: C Upvotes: 1

Therefore, the closest (but still flawed) answer is C or D, but with a crucial correction:
The intended solution is to use BigQuery Data Transfer Service to transfer the data directly from YouTube Analytics into BigQuery. The mention of "offsite backup files" is misleading and likely a distractor.

Comment 4

ID: 1575323 User: theRafael7 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Fri 06 Jun 2025 14:33 Selected Answer: C Upvotes: 1

I will go for C because for A it says cloud storage as the final destination and also bqts can be used for youtube channel data which are structured reports.

Comment 5

ID: 1410566 User: abhaya2608 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Wed 26 Mar 2025 21:59 Selected Answer: C Upvotes: 1

BigQuery Data Transfer Service automates data loading into BigQuery from various sources, including Google Cloud Storage.BigQuery is specifically designed for performing ANSI SQL and other types of analysis on large datasets. Cloud Storage can be used as a staging area, especially a multi-region bucket if the teams are worldwide

Comment 6

ID: 1400764 User: oussama7 Badges: - Relative Date: 11 months, 4 weeks ago Absolute Date: Wed 19 Mar 2025 23:18 Selected Answer: C Upvotes: 1

BigQuery Data Transfer Service (C) is the best choice because it natively integrates with YouTube, automates data transfers, and enables seamless analysis in BigQuery.

Comment 7

ID: 1319177 User: cloud_rider Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 28 Nov 2024 11:51 Selected Answer: C Upvotes: 1

C is the correct answer as the requirement is to transfer data to Google Cloud, which means Big Query and not GCS.

Comment 8

ID: 1301717 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 22 Oct 2024 22:36 Selected Answer: A Upvotes: 1

First thing log data is usually semi-strucutred or unstrucutred data, then GCS is more suitable in this situation. Besides, you can use external table or BigLake to run BQ query directly running on GCS.

Comment 9

ID: 1288031 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 23 Sep 2024 10:21 Selected Answer: - Upvotes: 1

This is A, because it's "offsite backup files". You can transfer direct from Youtube to BigQuery: https://cloud.google.com/bigquery/docs/youtube-channel-transfer, but this isn't that. It's direct from some backup files to Cloud Storage, all answers mandate that fact, and all that remains is "should this be Storage Transfer, or BigQuery" - clearly this is storage transfer, and all the youtube/ANSI sql stuff is just distraction.

Comment 10

ID: 1259499 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 01 Aug 2024 19:52 Selected Answer: A Upvotes: 1

E. Use BigQuery Data Transfer Service to transfer the offsite backup files to BigQuery as a final destination.

Comment 11

ID: 1207699 User: ABKR1300 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 07 May 2024 05:36 Selected Answer: - Upvotes: 1

Using BQ data transfer we can only load data into the Bigquery storage, which means a table inside the bq dataset.

What is Storage Transfer Service?
Storage Transfer Service automates the transfer of data to, from, and between object and file storage systems, including Google Cloud Storage, Amazon S3, Azure Storage, on-premises data, and more. It can be used to transfer large amounts of data quickly and reliably, without the need to write any code.

As per the above lines from the Google's documentation on Storage transfer service, we can go with option A.

For additional info, have a look at the below link.
https://cloud.google.com/storage-transfer/docs/overview?_gl=1*jaku2h*_ga*MjA5Mzc4OTM0LjE2ODQ3MzA5NzQ.*_ga_WH2QY8WWF5*MTcxNTA1OTM0OC4xODMuMS4xNzE1MDU5NTY3LjAuMC4w&_ga=2.6452401.-209378934.1684730974&_gac=1.162721358.1713078448.CjwKCAjw_e2wBhAEEiwAyFFFo4Da6-2MNQqNJuzAGSyJmCXdaPpXXiqaI0lkZYHlcln0IBbtWSJjLBoCp_4QAvD_BwE

Comment 12

ID: 1196439 User: zevexWM Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 16 Apr 2024 09:41 Selected Answer: A Upvotes: 2

Correct answer: A
Solution should cater a worldwide solution which makes B and D invalid.
You don't use BigQuery Data transfer to move data to a bucket. So C is also invalid.

Comment 13

ID: 1173382 User: AshishDhamu Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Thu 14 Mar 2024 13:15 Selected Answer: A Upvotes: 2

BigQuery Data Transfer Service can only transfer to BigQuery not in Cloud Storage. So Ans A is correct.

Comment 14

ID: 1164712 User: demoro86 Badges: - Relative Date: 2 years ago Absolute Date: Sun 03 Mar 2024 11:58 Selected Answer: A Upvotes: 2

No sense to use BQ data transfer serice to store the data to a storage bucket ... It is obviously A

Comment 15

ID: 1075177 User: rocky48 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 20 Nov 2023 07:04 Selected Answer: C Upvotes: 2

To transfer YouTube channel data to Google Cloud for analysis, you can use the BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination 1. This service allows you to automatically schedule and manage recurring load jobs for YouTube Channel reports 1. The BigQuery Data Transfer Service for YouTube Channel reports supports the following reporting options: Channel Reports (automatically loaded into BigQuery) 1. When you transfer data from a YouTube Channel into BigQuery, the data is loaded into BigQuery tables that are partitioned by date 1.

Comment 16

ID: 1044560 User: ziyunxiao Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 16 Oct 2023 01:35 Selected Answer: - Upvotes: 1

The correct answer is C. https://cloud.google.com/bigquery/docs/dts-introduction

Comment 17

ID: 1013438 User: exnaniantwort Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 22 Sep 2023 01:18 Selected Answer: - Upvotes: 2

not c
The BigQuery Data Transfer Service automates data movement into [[[[ BigQuery ]]]] on a scheduled, managed basis. Your analytics team can lay the foundation for a BigQuery data warehouse without writing a single line of code.
...
After you configure a data transfer, the BigQuery Data Transfer Service automatically loads data into [[[[[ BigQuery ]]]]] on a regular basis.

8. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 256

Sequence
59
Discussion ID
130209
Source URL
https://www.examtopics.com/discussions/google/view/130209-exam-professional-data-engineer-topic-1-question-256/
Posted By
scaenruy
Posted At
Jan. 3, 2024, 4:56 p.m.

Question

You are deploying an Apache Airflow directed acyclic graph (DAG) in a Cloud Composer 2 instance. You have incoming files in a Cloud Storage bucket that the DAG processes, one file at a time. The Cloud Composer instance is deployed in a subnetwork with no Internet access. Instead of running the DAG based on a schedule, you want to run the DAG in a reactive way every time a new file is received. What should you do?

  • A. 1. Enable Private Google Access in the subnetwork, and set up Cloud Storage notifications to a Pub/Sub topic.
    2. Create a push subscription that points to the web server URL.
  • B. 1. Enable the Cloud Composer API, and set up Cloud Storage notifications to trigger a Cloud Function.
    2. Write a Cloud Function instance to call the DAG by using the Cloud Composer API and the web server URL.
    3. Use VPC Serverless Access to reach the web server URL.
  • C. 1. Enable the Airflow REST API, and set up Cloud Storage notifications to trigger a Cloud Function instance.
    2. Create a Private Service Connect (PSC) endpoint.
    3. Write a Cloud Function that connects to the Cloud Composer cluster through the PSC endpoint.
  • D. 1. Enable the Airflow REST API, and set up Cloud Storage notifications to trigger a Cloud Function instance.
    2. Write a Cloud Function instance to call the DAG by using the Airflow REST API and the web server URL.
    3. Use VPC Serverless Access to reach the web server URL.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 16 comments Click to expand

Comment 1

ID: 1114552 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 15:12 Selected Answer: C Upvotes: 12

- Enable Airflow REST API: In Cloud Composer, enable the "Airflow web server" option.
- Set Up Cloud Storage Notifications: Create a notification for new files, routing to a Cloud Function.
- Create PSC Endpoint: Establish a PSC endpoint for Cloud Composer.
- Write Cloud Function: Code the function to use the Airflow REST API (via PSC endpoint) to trigger the DAG.

========
Why not Option D
- Using the web server URL directly wouldn't work without internet access or a direct path to the web server.

Comment 1.1

ID: 1127089 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 20 Jan 2024 07:47 Selected Answer: - Upvotes: 5

Why not B, use Cloud Composer API

Comment 2

ID: 1262084 User: STEVE_PEGLEG Badges: Highly Voted Relative Date: 1 year, 7 months ago Absolute Date: Wed 07 Aug 2024 13:58 Selected Answer: A Upvotes: 7

This is the guidance how to use method in A:
https://cloud.google.com/composer/docs/composer-2/triggering-gcf-pubsub
"In this specific example, you create a Cloud Function and deploy two DAGs. The first DAG pulls Pub/Sub messages and triggers the second DAG according to the Pub/Sub message content."

For C & D, this guidance says it can't be done when you have Private or VPS Service Controls set up:
https://cloud.google.com/composer/docs/composer-2/triggering-with-gcf#check_your_environments_networking_configuration
"This solution does not work in Private IP and VPC Service Controls configurations because it is not possible to configure connectivity from Cloud Functions to the Airflow web server in these configurations."

Comment 3

ID: 1600968 User: Zek Badges: Most Recent Relative Date: 6 months, 3 weeks ago Absolute Date: Thu 21 Aug 2025 12:50 Selected Answer: C Upvotes: 1

Accessing the Airflow REST API on Cloud Composer without internet access typically involves configuring a Private IP environment and utilizing Private Service Connect.
https://cloud.google.com/composer/docs/composer-2/private-ip-environments#private_ip_environment

Comment 4

ID: 1582336 User: 56d02cd Badges: - Relative Date: 8 months, 1 week ago Absolute Date: Wed 02 Jul 2025 04:07 Selected Answer: B Upvotes: 1

I don't think the PSC endpoint is really needed. The cloud function can connect to the Airflow webserver through VPC serverless access, which is easier to configure than a PSC endpoint.

Comment 5

ID: 1566259 User: aditya_ali Badges: - Relative Date: 10 months, 1 week ago Absolute Date: Sun 04 May 2025 22:07 Selected Answer: C Upvotes: 1

PSC is the only secure way to reach the Airflow REST API privately from a serverless service in a VPC-restricted Cloud Composer environment.

Therefore, Option C provides the most secure and functional architecture.

Comment 6

ID: 1351166 User: Augustax Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Tue 04 Feb 2025 02:52 Selected Answer: B Upvotes: 1

Option B is the only viable solution because:

It uses the Cloud Composer API, which is compatible with Private IP configurations.

It leverages VPC Serverless Access to allow Cloud Functions to securely access the Airflow web server within the subnetwork.

It avoids the limitations of the Airflow REST API in Private IP environments.

Comment 7

ID: 1294609 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 08 Oct 2024 09:01 Selected Answer: A Upvotes: 2

This is A, as steve_pegleg says, there is no way to connect the cloud function to the Airflow instance, without first enabling private access. The pubsub pattern makes sense in this context.

Comment 8

ID: 1213887 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sun 19 May 2024 18:57 Selected Answer: A Upvotes: 4

C is not correct because "this solution does not work in Private IP and VPC Service Controls configurations because it is not possible to configure connectivity from Cloud Functions to the Airflow web server in these configurations".
https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf
The correct answer is A using Pub/Sub https://cloud.google.com/composer/docs/composer-2/triggering-gcf-pubsub

Comment 9

ID: 1184163 User: chrissamharris Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 27 Mar 2024 15:16 Selected Answer: D Upvotes: 4

Why not Option C? C involves creating a Private Service Connect (PSC) endpoint, which, while viable for creating private connections to Google services, adds complexity and might not be required when simpler solutions like VPC Serverless Access (as in Option D) can suffice.

Comment 9.1

ID: 1184165 User: chrissamharris Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 27 Mar 2024 15:18 Selected Answer: - Upvotes: 2

https://cloud.google.com/vpc/docs/serverless-vpc-access: Serverless VPC Access makes it possible for you to connect directly to your Virtual Private Cloud (VPC) network from serverless environments such as Cloud Run, App Engine, or Cloud Functions

Comment 10

ID: 1181835 User: d11379b Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 17:09 Selected Answer: D Upvotes: 4

The answer should be D
Serverless VPC Access makes it possible for you to connect directly to your Virtual Private Cloud (VPC) network from serverless environments such as Cloud Run, App Engine, or Cloud Functions. Configuring Serverless VPC Access allows your serverless environment to send requests to your VPC network by using internal DNS and internal IP addresses (as defined by RFC 1918 and RFC 6598). The responses to these requests also use your internal network.
You can use Serverless VPC Access to access Compute Engine VM instances, Memorystore instances, and any other resources with internal DNS or internal IP address.
(Reference: https://cloud.google.com/vpc/docs/serverless-vpc-access)
When you use Airflow Rest API to tigger the job, the url is based on the private IP address of Cloud Composer Instance, so you need to use Serverless VPC Access for it.

Comment 10.1

ID: 1181837 User: d11379b Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 17:10 Selected Answer: - Upvotes: 2

Why not C:
The reference here (https://cloud.google.com/vpc/docs/private-service-connect#published-services) limits the available use cases:
Private Service Connect supports access to the following types of managed services:
Published VPC-hosted services, which include the following:
Google published services, such as Apigee or the GKE control plane
Third-party published services provided by Private Service Connect partners
Intra-organization published services, where the consumer and producer might be two different VPC networks within the same company
Google APIs, such as Cloud Storage or BigQuery

Unfortunately your airflow Rest API is not published as a service in the list, so you can not use it
This is also one of the reasons why you should reject A

Comment 10.1.1

ID: 1181840 User: d11379b Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 17:11 Selected Answer: - Upvotes: 2

B is not appropriate while Cloud Composer API can really execute Airflow command,but It’s not via web server Url to run a DAG in this case, and I doubt if it is really possible

Comment 11

ID: 1121737 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 15:30 Selected Answer: C Upvotes: 1

Option C, raaad explained well why

Comment 12

ID: 1112912 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 16:56 Selected Answer: C Upvotes: 1

C.
1. Enable the Airflow REST API, and set up Cloud Storage notifications to trigger a Cloud Function instance.
2. Create a Private Service Connect (PSC) endpoint.
3. Write a Cloud Function that connects to the Cloud Composer cluster through the PSC endpoint.

9. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 315

Sequence
60
Discussion ID
153020
Source URL
https://www.examtopics.com/discussions/google/view/153020-exam-professional-data-engineer-topic-1-question-315/
Posted By
FireAtMe
Posted At
Dec. 16, 2024, 3:39 a.m.

Question

You are using Workflows to call an API that returns a 1KB JSON response, apply some complex business logic on this response, wait for the logic to complete, and then perform a load from a Cloud Storage file to BigQuery. The Workflows standard library does not have sufficient capabilities to perform your complex logic, and you want to use Python's standard library instead. You want to optimize your workflow for simplicity and speed of execution. What should you do?

  • A. Create a Cloud Composer environment and run the logic in Cloud Composer.
  • B. Create a Dataproc cluster, and use PySpark to apply the logic on your JSON file.
  • C. Invoke a Cloud Function instance that uses Python to apply the logic on your JSON file.
  • D. Invoke a subworkflow in Workflows to apply the logic on your JSON file.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 3 comments Click to expand

Comment 1

ID: 1602332 User: Arvi1984 Badges: - Relative Date: 6 months, 2 weeks ago Absolute Date: Mon 25 Aug 2025 15:40 Selected Answer: C Upvotes: 1

find alternative of "does not have sufficient capabilities to perform your complex logic" in GCP which is cloud function

Comment 2

ID: 1571086 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Wed 21 May 2025 22:09 Selected Answer: C Upvotes: 1

"A" not possible since you will be runing the same logic inside of airflow nothing else.
dataproc is unneeded.

Comment 3

ID: 1327142 User: FireAtMe Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Mon 16 Dec 2024 03:39 Selected Answer: C Upvotes: 2

Cloud Functions is a serverless compute service ideal for executing lightweight, event-driven tasks with low latency.

10. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 202

Sequence
115
Discussion ID
79652
Source URL
https://www.examtopics.com/discussions/google/view/79652-exam-professional-data-engineer-topic-1-question-202/
Posted By
ducc
Posted At
Sept. 3, 2022, 4:15 a.m.

Question

Your platform on your on-premises environment generates 100 GB of data daily, composed of millions of structured JSON text files. Your on-premises environment cannot be accessed from the public internet. You want to use Google Cloud products to query and explore the platform data. What should you do?

  • A. Use Cloud Scheduler to copy data daily from your on-premises environment to Cloud Storage. Use the BigQuery Data Transfer Service to import data into BigQuery.
  • B. Use a Transfer Appliance to copy data from your on-premises environment to Cloud Storage. Use the BigQuery Data Transfer Service to import data into BigQuery.
  • C. Use Transfer Service for on-premises data to copy data from your on-premises environment to Cloud Storage. Use the BigQuery Data Transfer Service to import data into BigQuery.
  • D. Use the BigQuery Data Transfer Service dataset copy to transfer all data into BigQuery.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 23 comments Click to expand

Comment 1

ID: 878814 User: muhusman Badges: Highly Voted Relative Date: 2 years, 10 months ago Absolute Date: Sun 23 Apr 2023 23:18 Selected Answer: - Upvotes: 10

Therefore, the correct option is C. Use Transfer Service for on-premises data to copy data from your on-premises environment to Cloud Storage. Use the BigQuery Data Transfer Service to import data into BigQuery.

Option A is incorrect because Cloud Scheduler is not designed for data transfer, but rather for scheduling the execution of Cloud Functions, Cloud Run, or App Engine applications.

Option B is incorrect because Transfer Appliance is designed for large-scale data transfers from on-premises environments to Google Cloud and is not suitable for transferring data on a daily basis.

Option D is also incorrect because the BigQuery Data Transfer Service dataset copy feature is designed for copying datasets between BigQuery projects and not suitable for copying data from on-premises environments to BigQuery.

Comment 1.1

ID: 1123179 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 09:23 Selected Answer: - Upvotes: 1

With BigQuery Data Transfer Service we can copy files not only from other BigQuery, but also a bunch of cloud services listed here:
https://cloud.google.com/bigquery/docs/dts-introduction
But you are right. It wont work with on-premises.

Comment 2

ID: 935242 User: cetanx Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Tue 27 Jun 2023 12:32 Selected Answer: C Upvotes: 7

"Your on-premises environment cannot be accessed from the public internet" statement suggests that inbound traffic from internet is NOT allowed however, it doesn't mean that outbound internet connectivity from on-prem resources is not possible. Any on-prem system with outbound internet access can copy/transfer the CSV files.

CSV files are located on a filesystem, therefore you cannot copy them with BQ Transfer Service.

Leaving only possible option;
first copy CSVs to cloud storage
then run BQ Transfer Service

pls refer to https://cloud.google.com/bigquery/docs/dts-introduction#supported_data_sources

Comment 3

ID: 1401726 User: desertlotus1211 Badges: Most Recent Relative Date: 11 months, 3 weeks ago Absolute Date: Sat 22 Mar 2025 00:43 Selected Answer: C Upvotes: 1

I'm torn on this question. Okay no access from public internet... does that mean they don't have private lines (e.g. Ded/Partner interconnects)?

Poorly worded. IMO it can either be: Answer B or C based on interpretation of Public Internet.

Comment 4

ID: 1338036 User: marlon.andrei Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Wed 08 Jan 2025 18:49 Selected Answer: B Upvotes: 1

I vote B, because in "Your on-premises environment cannot be accessed from the public internet.", it would only allow data to be extracted internally within the company. So Transfer Appliance is the most appropriate tool.

Comment 5

ID: 1328350 User: namesgeo Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Wed 18 Dec 2024 09:38 Selected Answer: C Upvotes: 1

Transfer Service for on-premises data is designed specifically for this scenario. It uses a private, secure agent-based approach to move data from on-premises environments to Google Cloud Storage.

Comment 5.1

ID: 1328351 User: namesgeo Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Wed 18 Dec 2024 09:39 Selected Answer: - Upvotes: 1

https://cloud.google.com/blog/products/storage-data-transfer/introducing-storage-transfer-service-for-on-premises-data?hl=en

Comment 6

ID: 1293085 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Fri 04 Oct 2024 11:52 Selected Answer: - Upvotes: 1

They don't define "cannot be accessed from the public internet" - does this mean no incoming traffic, or no traffic or any kind regardless of the initiation point? We simply do not know, and so are left guessing. C? Probably, but could be B, just depending.

Comment 7

ID: 923034 User: Takshashila Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 14 Jun 2023 13:19 Selected Answer: C Upvotes: 2

the correct option is C

Comment 8

ID: 833884 User: wjtb Badges: - Relative Date: 3 years ago Absolute Date: Thu 09 Mar 2023 12:09 Selected Answer: - Upvotes: 3

I would say B. It is the ONLY option that is possible without data being accessible over the public (unless we assume that a direct interconnect is already set up, which seems farfetched). Also, nowhere does it say how up-to-date the data needs to be that we are querying or how often we need to query, only that the data increases in size by 100gb per day (indicating that its going to be a lot of data)

Comment 9

ID: 815162 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Mon 20 Feb 2023 12:19 Selected Answer: - Upvotes: 2

Answer C,
What is wrong with B ? Key words = Daily transfer .. so no to transfer appliance,

Comment 10

ID: 725635 User: Atnafu Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 24 Nov 2022 08:43 Selected Answer: - Upvotes: 1

C
D-no answer because bq transfer service don't support from on-prem

Comment 10.1

ID: 727063 User: Atnafu Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 25 Nov 2022 21:08 Selected Answer: - Upvotes: 1

B-is not answer because you want transfer appliance for one time bulk transfer but the question is You want to use Google Cloud products to query and explore the platform data.

query and explore is the key

Comment 11

ID: 686267 User: John_Pongthorn Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 04 Oct 2022 17:03 Selected Answer: C Upvotes: 1

Transfer Service for on-premises is optimal for on-premises google ( large files (< 1 TB) and bandwidth available and scheduling)
https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#transfer-options
https://cloud.google.com/blog/products/storage-data-transfer/introducing-storage-transfer-service-for-on-premises-data

BigQuery Data Transfer Service is good for gcs to bigquery
https://cloud.google.com/bigquery/docs/cloud-storage-transfer

Comment 11.1

ID: 686269 User: John_Pongthorn Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 04 Oct 2022 17:05 Selected Answer: - Upvotes: 1

Sorry I am wrong
( large files > 1 TB + bandwidth available on internal IP address communication + daily scheduling)

Comment 11.2

ID: 686268 User: John_Pongthorn Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 04 Oct 2022 17:03 Selected Answer: - Upvotes: 2

Your on-premises environment cannot be accessed from the public internet.
It signifies that we can apply private connection like Cloud Interconnect https://cloud.google.com/network-connectivity/docs/interconnect/concepts/overview

Comment 12

ID: 668445 User: Wasss123 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Wed 14 Sep 2022 00:17 Selected Answer: C Upvotes: 3

I will go with C

Comment 13

ID: 665459 User: MounicaN Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 10 Sep 2022 16:10 Selected Answer: - Upvotes: 1

I will g with C

https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#transfer-options

Comment 14

ID: 663630 User: John_Pongthorn Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 14:55 Selected Answer: - Upvotes: 2

C is correct, b is suitable for weekly .
https://cloud.google.com/transfer-appliance/docs/4.0/overview

Comment 14.1

ID: 686264 User: John_Pongthorn Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 04 Oct 2022 16:55 Selected Answer: - Upvotes: 1

C
Your on-premises environment cannot be accessed from the public internet.
It signifies that we can apply private connection like Cloud Interconnect https://cloud.google.com/network-connectivity/docs/interconnect/concepts/overview

Comment 15

ID: 663088 User: TNT87 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 06:15 Selected Answer: C Upvotes: 1

Ans C
https://cloud.google.com/storage-transfer/docs/on-prem-agent-best-practices

Comment 16

ID: 662637 User: HarshKothari21 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Wed 07 Sep 2022 17:13 Selected Answer: - Upvotes: 1

I would go with option C.
You need a service to transfer data from on-premises to cloud storage. so "Transfer service" is the best option & additionally you can easily configure the network so that data flows through private network.

cloud scheduler on other hand is used mostly for automation. You can schedule a service but in my view cannot be used solo to transfer data.

Comment 17

ID: 659764 User: nwk Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 05 Sep 2022 07:44 Selected Answer: - Upvotes: 2

Data is generated daily. Unlikely to ship Transfer Appliance every day.

Vote for C instead. "Transfer Service for on-premises data is a free Google Cloud service that's intended to streamline the process of uploading data into Google Cloud Storage buckets"

https://cloud.google.com/blog/products/storage-data-transfer/introducing-storage-transfer-service-for-on-premises-data

11. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 89

Sequence
121
Discussion ID
79337
Source URL
https://www.examtopics.com/discussions/google/view/79337-exam-professional-data-engineer-topic-1-question-89/
Posted By
nwk
Posted At
Sept. 2, 2022, 10:26 a.m.

Question

You're training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency.
What should you do?

  • A. Provide latitude and longitude as input vectors to your neural net.
  • B. Create a numeric column from a feature cross of latitude and longitude.
  • C. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L1 regularization during optimization.
  • D. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 23 comments Click to expand

Comment 1

ID: 680119 User: AHUI Badges: Highly Voted Relative Date: 3 years, 5 months ago Absolute Date: Mon 26 Sep 2022 22:06 Selected Answer: - Upvotes: 9

Ans C, use L1 regularization becuase we know the feature is a strong feature. L2 will evenly distribute weights

Comment 2

ID: 724240 User: dish11dish Badges: Highly Voted Relative Date: 3 years, 3 months ago Absolute Date: Tue 22 Nov 2022 11:18 Selected Answer: C Upvotes: 8

Option C is correct

Use L1 regularization when you need to assign greater importance to more influential features. It
shrinks less important feature to 0.
L2 regularization performs better when all input features influence the output & all with the
weights are of equal size.

Comment 3

ID: 1398855 User: desertlotus1211 Badges: Most Recent Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 14:57 Selected Answer: D Upvotes: 2

L1 regularization (Option C) would encourage sparsity but may eliminate too many features, which can be detrimental when you need to capture subtle geographic differences

Comment 4

ID: 1302127 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 23 Oct 2024 18:54 Selected Answer: D Upvotes: 3

I would like choose D. L1 will ignore the irrelevant features. However, we know that lat and long are cruial for this model. We can't take away their influences. L2 helps in preventing overfitting.

Comment 5

ID: 1297803 User: MohaSa1 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 14 Oct 2024 21:47 Selected Answer: A Upvotes: 1

This does not seems to be useful, minute level bucketizing will create 3,600 possible buckets per degree squared, not logical, and sparse feature space, Option A seems to be a better choice.

Comment 6

ID: 1251957 User: Snnnnneee Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 20 Jul 2024 21:17 Selected Answer: B Upvotes: 1

Bucketing into minutes is inaccurate, up to 1.8 km are grouped. Way too much for real estste.
Therefore B

Comment 7

ID: 1015518 User: uday_examtopic Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 08:09 Selected Answer: - Upvotes: 2

Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization.

Like option C, we bucketize at the minute level, but this time we apply L2 regularization. L2 regularization, or Ridge Regression, discourages large values of weights in the model without forcing them to become sparse. It can help prevent overfitting, especially when we have a large number of features (as a result of bucketizing and crossing).

Given the options, D. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization seems to be the most appropriate. Bucketizing at the minute level captures localized patterns, and L2 regularization can help control the complexity of the model without enforcing sparsity.

Comment 8

ID: 1013010 User: ckanaar Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Thu 21 Sep 2023 13:09 Selected Answer: - Upvotes: 3

What does bucketizing at the minute level mean in the context of this question?

Comment 8.1

ID: 1065719 User: Surely1987 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 08 Nov 2023 16:27 Selected Answer: - Upvotes: 4

Coordinates are written with Degrees, minutes and seconds (one minute being equal to about 1.8 km). So you group your coordinates in buckets with a miute precision

Comment 9

ID: 991358 User: FP77 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sun 27 Aug 2023 11:43 Selected Answer: B Upvotes: 2

I strongly believe it's B.

Comment 10

ID: 960310 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 11:56 Selected Answer: B Upvotes: 1

The right answer is B. What the hell does bucketize the feature cross of latitude and longtitude even mean? They are not a time feature. C and D don't even make sense. The L1 regularization is something that doesn't answer anything in the question. The only valid feature engineered here is option B. A is not an engineered feature.

Create a feature cross of latitude and longitude, bucketize it at the minute level and use L1 regularization during optimization.

Comment 10.1

ID: 1288163 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 23 Sep 2024 16:36 Selected Answer: - Upvotes: 1

Bucketising means that we're saying "anyone in this square 1.8km (minute) region is considered a single area" - it's actually recommended as a default way to deal with lat/lon, as they don't really work as seperate columns (or at least we'd be hoping the FCNN buckets them intelligently itself, which it won't mostly)

Comment 11

ID: 949122 User: Jojo9400 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 11 Jul 2023 18:01 Selected Answer: - Upvotes: 1

D

You have to use L2, since you have create a new variable with two already existing the risk of multicollinearity is high, L1 is good for selecting feature to avoid curse of dimensionality not for multicollinearity

Comment 12

ID: 886321 User: ga8our Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 01 May 2023 15:44 Selected Answer: - Upvotes: 1

Why not L2? L2 (Ridge) uses a squared value coefficient as a penalty term to the loss function, while L1 (Lasso) uses an absolute value coefficient. Isn't a squared penalty stronger than an absolute one?
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

Comment 12.1

ID: 1013008 User: ckanaar Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Thu 21 Sep 2023 13:08 Selected Answer: - Upvotes: 2

L1 regression forces unimportant coefficients to zero. Since the location is extremely important, L1 will force less important coefficients to zero, thereby further increasing the importance of the location coefficient.

Comment 13

ID: 880267 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 13:12 Selected Answer: - Upvotes: 2

gpt: Option C and D suggest bucketizing the feature cross of latitude and longitude at the minute level and using L1 or L2 regularization during optimization. While regularization can help prevent overfitting, bucketizing at such a granular level may not be necessary and could lead to overfitting. It's also not clear how bucketizing at the minute level would capture the spatial relationship between the latitude and longitude features.

Comment 14

ID: 789538 User: PolyMoe Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 27 Jan 2023 12:39 Selected Answer: D Upvotes: 2

D. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization. This will create a new feature that captures the physical dependency of the location of the property on the price, and bucketing it at the minute level will reduce the number of unique values and prevent overfitting. L2 regularization will also help to prevent overfitting by penalizing large weights in the model.

Comment 14.1

ID: 906509 User: cetanx Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 25 May 2023 10:11 Selected Answer: - Upvotes: 1

chat-gpt also says D
explanation:
This approach effectively creates a grid of the geographical area in your data, allowing the model to learn weights for each grid cell (bucket). This helps capture the spatial relationship between latitude and longitude, which can be crucial for real estate prices. Additionally, using L2 regularization helps prevent overfitting by discouraging complex models, which can be particularly important when working with high-dimensional crossed features.

Comment 15

ID: 666212 User: crismo04 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 16:49 Selected Answer: - Upvotes: 2

https://medium.com/riga-data-science-club/geographic-coordinate-encoding-with-tensorflow-feature-columns-e750ae338b7c#:~:text=to%20the%20rescue!-,Feature%20Crosses,-Combining%20features%20into

Comment 15.1

ID: 666214 User: crismo04 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 16:50 Selected Answer: - Upvotes: 1

Feature cross seems to be the right feature option

Comment 15.1.1

ID: 666217 User: crismo04 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 16:51 Selected Answer: - Upvotes: 4

So it's B option

Comment 16

ID: 658035 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 05:43 Selected Answer: C Upvotes: 5

C. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L1 regularization during optimization.

Comment 17

ID: 657177 User: nwk Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 10:26 Selected Answer: - Upvotes: 1

C or D?
https://medium.com/riga-data-science-club/geographic-coordinate-encoding-with-tensorflow-feature-columns-e750ae338b7c

12. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 138

Sequence
133
Discussion ID
79675
Source URL
https://www.examtopics.com/discussions/google/view/79675-exam-professional-data-engineer-topic-1-question-138/
Posted By
ducc
Posted At
Sept. 3, 2022, 6:39 a.m.

Question

You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?

  • A. Create a Cloud Dataproc Workflow Template
  • B. Create an initialization action to execute the jobs
  • C. Create a Directed Acyclic Graph in Cloud Composer
  • D. Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 17 comments Click to expand

Comment 1

ID: 685048 User: LP_PDE Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Sun 02 Apr 2023 22:25 Selected Answer: - Upvotes: 5

Correct answer is A. https://cloud.google.com/dataproc/docs/concepts/workflows/using-workflows

Comment 2

ID: 1350643 User: skhaire Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sun 02 Feb 2025 21:25 Selected Answer: - Upvotes: 1

A. Create a Cloud Dataproc Workflow Template
Dataproc Workflow Template can be used to run jobs concurrently and sequentially. DAG is an overkill.
https://cloud.google.com/dataproc/docs/concepts/workflows/use-workflows

Comment 3

ID: 1099526 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 08:21 Selected Answer: C Upvotes: 3

The best option for automating your scheduled Spark jobs on Cloud Dataproc, considering sequential and concurrent execution, is:
C. Create a Directed Acyclic Graph (DAG) in Cloud Composer.

Comment 3.1

ID: 1099527 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 08:22 Selected Answer: - Upvotes: 2

Here's why:
DAG workflows: Cloud Composer excels at orchestrating complex workflows with dependencies, making it ideal for managing sequential and concurrent execution of your Spark jobs. You can define dependencies between tasks to ensure certain jobs only run after others finish.
Automation: Cloud Composer lets you schedule workflows to run automatically based on triggers like time intervals or data availability, eliminating the need for manual intervention.
Integration: Cloud Composer integrates seamlessly with Cloud Dataproc, allowing you to easily launch and manage your Spark clusters within the workflow.
Scalability: Cloud Composer scales well to handle a large number of jobs and workflows, making it suitable for managing complex data pipelines.

Comment 3.1.1

ID: 1099528 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 08:22 Selected Answer: - Upvotes: 3

While the other options have some merit, they fall short in certain aspects:
A. Cloud Dataproc Workflow Templates: While workflow templates can automate job submission on a cluster, they lack the ability to define dependencies and coordinate concurrent execution effectively.
B. Initialization action: An initialization action can only run a single script before a Dataproc cluster starts, not suitable for orchestrating multiple scheduled jobs with dependencies.
D. Bash script: A Bash script might work for simple cases, but it can be cumbersome to manage and lacks the advanced scheduling and error handling capabilities of Cloud Composer.
Therefore, utilizing a Cloud Composer DAG offers the most comprehensive and flexible solution for automating your scheduled Spark jobs with sequential and concurrent execution on Cloud Dataproc.

Comment 4

ID: 1075792 User: emmylou Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 20 May 2024 19:34 Selected Answer: C Upvotes: 1

I thought it might be A but the templates can only run sequentially, not concurrently.

Comment 5

ID: 1015434 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 05:37 Selected Answer: C Upvotes: 1

Directed Acyclic Graph (DAG): Cloud Composer (formerly known as Cloud Composer) is a managed Apache Airflow service that allows you to create and manage workflows as DAGs. You can define a DAG that includes tasks for running Spark jobs in sequence or concurrently.

Scheduling: Cloud Composer provides built-in scheduling capabilities, allowing you to specify when and how often your DAGs should run. You can schedule the execution of your Spark jobs at specific times or intervals.

Dependency Management: In a DAG, you can define dependencies between tasks. This means you can set up tasks to run sequentially or concurrently based on your requirements. For example, you can specify that Job B runs after Job A has completed, or you can schedule jobs to run concurrently when there are no dependencies.

Comment 6

ID: 837728 User: midgoo Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 13 Sep 2023 07:37 Selected Answer: C Upvotes: 2

I would choose A if there was one more step to schedule the Template. It is like creating DAG without running it in Airflow.
So only option C is correct here.

Comment 7

ID: 762726 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 17:00 Selected Answer: - Upvotes: 3

C. Create a Directed Acyclic Graph in Cloud Composer

Comment 8

ID: 749781 User: saurabhsingh4k Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Mon 19 Jun 2023 11:50 Selected Answer: A Upvotes: 2

Why go for an expensive Composer when you only have to schedule and create a DAG for Dataproc, A is sufficient.

Comment 8.1

ID: 765581 User: captainbu Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 11:54 Selected Answer: - Upvotes: 5

I've would've gone for Workflow Templates as well. But those are lacking the scheduling capability. Hence you would need to use Cloud Composer (or Cloud Functions or Cloud Scheduler) anyway. Hence C seems to be the better solution.

Pls see here:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions

Comment 9

ID: 696541 User: devaid Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 16 Apr 2023 22:34 Selected Answer: C Upvotes: 4

C.
Composer fits better to schedule Dataproc Workflows, check the documentation:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions

Also A is not enough. Dataproc Workflow Template itself don't has a native schedule option.

Comment 10

ID: 696029 User: louisgcpde Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 16 Apr 2023 08:13 Selected Answer: C Upvotes: 1

So that I thing the answer should be C (Composer).

Comment 11

ID: 696028 User: louisgcpde Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 16 Apr 2023 08:12 Selected Answer: - Upvotes: 2

To me, the point is "automate" the process, so that Composer DAG is needed and can be used with Dataproc Workflow Template.

Comment 12

ID: 690366 User: dmzr Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 09 Apr 2023 18:59 Selected Answer: A Upvotes: 2

Ans A makes more sense, since a question is regarding Dataproc jobs only

Comment 13

ID: 663080 User: HarshKothari21 Badges: - Relative Date: 3 years ago Absolute Date: Wed 08 Mar 2023 07:01 Selected Answer: C Upvotes: 1

Option c

Comment 14

ID: 658070 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 07:39 Selected Answer: C Upvotes: 1

You have streaming and batch job, so Composer is the choice for me

13. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 198

Sequence
139
Discussion ID
79648
Source URL
https://www.examtopics.com/discussions/google/view/79648-exam-professional-data-engineer-topic-1-question-198/
Posted By
ducc
Posted At
Sept. 3, 2022, 4 a.m.

Question

You are implementing workflow pipeline scheduling using open source-based tools and Google Kubernetes Engine (GKE). You want to use a Google managed service to simplify and automate the task. You also want to accommodate Shared VPC networking considerations. What should you do?

  • A. Use Dataflow for your workflow pipelines. Use Cloud Run triggers for scheduling.
  • B. Use Dataflow for your workflow pipelines. Use shell scripts to schedule workflows.
  • C. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the host project.
  • D. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.

Suggested Answer

D

Answer Description Click to expand


Community Answer Votes

Comments 15 comments Click to expand

Comment 1

ID: 658057 User: AWSandeep Badges: Highly Voted Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 06:25 Selected Answer: D Upvotes: 17

D. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.

Shared VPC requires that you designate a host project to which networks and subnetworks belong and a service project, which is attached to the host project. When Cloud Composer participates in a Shared VPC, the Cloud Composer environment is in the service project.

Reference:
https://cloud.google.com/composer/docs/how-to/managing/configuring-shared-vpc

Comment 2

ID: 1346841 User: loki82 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sun 26 Jan 2025 09:05 Selected Answer: D Upvotes: 1

Using the host project to deploy services is a bad practice and should only be used if the service doesn't support shared VPC. It may be easier networking wise, but that's why it's the wrong answer. Cloud Composer supports Shaved VPC, so it should go in it's own project.

Comment 3

ID: 1306667 User: 8284a4c Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Sun 03 Nov 2024 23:21 Selected Answer: C Upvotes: 1

Place in host project for network connectivity

Comment 4

ID: 1305782 User: ToiToi Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Fri 01 Nov 2024 13:02 Selected Answer: C Upvotes: 2

The recommended approach is to place Cloud Composer resources in the host project of the Shared VPC. This centralizes network management, simplifies connectivity, and enhances security by adhering to the principle of least privilege.

Comment 5

ID: 1217339 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 11:21 Selected Answer: C Upvotes: 2

Placing Cloud Composer resources in the service project can lead to more complex network configurations and management overhead compared to placing them in the host project, which is designed to manage Shared VPC resources.

Comment 6

ID: 961449 User: vamgcp Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 12:09 Selected Answer: - Upvotes: 2

Please correct if I am wrong.. I think it is Option C coz I feel Option D is incorrect because placing the Cloud Composer resources in the service project would not allow you to access resources in the host project.

Comment 6.1

ID: 1064317 User: spicebits Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 06 Nov 2023 23:46 Selected Answer: - Upvotes: 1

https://cloud.google.com/composer/docs/composer-2/configure-shared-vpc#shared-vpc-guidelines

Comment 7

ID: 915547 User: Ender_H Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Mon 05 Jun 2023 17:56 Selected Answer: A Upvotes: 1

✅ Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.

- Cloud Composer is a managed Apache Airflow service. It is an open-source tool that programmatically author, schedule, and monitor pipelines, which fits your needs perfectly.
- In a Shared VPC configuration, Cloud Composer resources should be placed in the service project. This provides network isolation while still allowing the Cloud Composer environment to communicate with resources in the host project.
- With Shared VPC, the host project's network (including its subnets and secondary IP ranges) is shared by other service projects, which promotes network peering, and it's compliant with the networking considerations of GKE.

Comment 7.1

ID: 1012394 User: ckanaar Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Wed 20 Sep 2023 16:11 Selected Answer: - Upvotes: 2

That's answer D though.

Comment 8

ID: 729322 User: [Removed] Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 28 Nov 2022 16:30 Selected Answer: D Upvotes: 3

D is the answer.

https://cloud.google.com/composer/docs/how-to/managing/configuring-shared-vpc
Shared VPC enables organizations to establish budgeting and access control boundaries at the project level while allowing for secure and efficient communication using private IPs across those boundaries. In the Shared VPC configuration, Cloud Composer can invoke services hosted in other Google Cloud projects in the same organization without exposing services to the public internet.

Comment 8.1

ID: 763424 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 02:04 Selected Answer: - Upvotes: 2

Agreed

Comment 8.1.1

ID: 885888 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 01 May 2023 07:12 Selected Answer: - Upvotes: 1

agreed to what..

Comment 8.1.1.1

ID: 885899 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 01 May 2023 07:41 Selected Answer: - Upvotes: 1

D it is, as per doc link, provided by users. thx

Comment 9

ID: 725632 User: Atnafu Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 24 Nov 2022 08:31 Selected Answer: - Upvotes: 2

D
Shared VPC requires that you designate a host project to which networks and subnetworks belong and a service project, which is attached to the host project.
https://cloud.google.com/composer/docs/how-to/managing/configuring-shared-vpc#:~:text=This%20page%20describes,the%20service%20project.

Comment 10

ID: 657974 User: ducc Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 04:00 Selected Answer: D Upvotes: 2

D according to documentation

Shared VPC requires that you designate a host project to which networks and subnetworks belong and a service project, which is attached to the host project. When Cloud Composer participates in a Shared VPC, the Cloud Composer environment is in the service project.

https://cloud.google.com/composer/docs/how-to/managing/configuring-shared-vpc#set_up_shared_vpc_and_attach_the_service_project

14. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 114

Sequence
143
Discussion ID
16628
Source URL
https://www.examtopics.com/discussions/google/view/16628-exam-professional-data-engineer-topic-1-question-114/
Posted By
madhu1171
Posted At
March 15, 2020, 4:16 a.m.

Question

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?

  • A. Cloud Dataflow
  • B. Cloud Composer
  • C. Cloud Dataprep
  • D. Cloud Dataproc

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 19 comments Click to expand

Comment 1

ID: 64129 User: madhu1171 Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Wed 15 Sep 2021 03:16 Selected Answer: - Upvotes: 30

Answer should be B

Comment 2

ID: 194785 User: Darlee Badges: Highly Voted Relative Date: 3 years, 11 months ago Absolute Date: Thu 07 Apr 2022 04:44 Selected Answer: - Upvotes: 8

How come the `Correct Answer` so ridiculous WRONG?

Comment 2.1

ID: 463704 User: squishy_fishy Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 17 Apr 2023 22:57 Selected Answer: - Upvotes: 1

Ha ha.. I know.

Comment 3

ID: 1342790 User: grshankar9 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 23:46 Selected Answer: B Upvotes: 1

Cloud Composer is considered suitable across multiple cloud providers, as it is built on Apache Airflow, which allows for workflow orchestration across different cloud environments and even on-premises data centers, making it a good choice for multi-cloud strategies; however, its tightest integration is with Google Cloud Platform services.

Comment 4

ID: 911421 User: forepick Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sat 30 Nov 2024 19:31 Selected Answer: B Upvotes: 3

No other option is aimed for this purpose

Comment 5

ID: 843736 User: juliobs Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Thu 19 Sep 2024 12:27 Selected Answer: B Upvotes: 1

Airflow

Comment 6

ID: 758147 User: PrashantGupta1616 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 27 Jun 2024 05:06 Selected Answer: B Upvotes: 2

Cloud Composer is Airflow

Comment 7

ID: 738258 User: odacir Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 07 Jun 2024 18:57 Selected Answer: B Upvotes: 3

Cloud Composer is Airflow, It's made for this job.

Comment 8

ID: 518527 User: medeis_jar Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 06 Jul 2023 19:31 Selected Answer: B Upvotes: 1

https://cloud.google.com/composer/

Comment 9

ID: 516252 User: MaxNRG Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 06:05 Selected Answer: B Upvotes: 6

B:
Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers.
https://cloud.google.com/composer/
Cloud Composer can help create workflows that connect data, processing, and services across clouds, giving you a unified data environment.
Built on the popular Apache Airflow open source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.
Cloud Composer gives you the ability to connect your pipeline through a single orchestration tool whether your workflow Eves on-premises, in multiple clouds, or fully within GCP. The ability to author, schedule, and monitor your workflows in a unified manner means you can break down the silos in your environment and focus less on infrastructure.

Comment 9.1

ID: 516253 User: MaxNRG Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 06:06 Selected Answer: - Upvotes: 1

Option A is wrong as Cloud Scheduler is a fully managed enterprise-grade cron job scheduler. It is not a multi-cloud orchestration tool.
Option B is wrong as Google Cloud Dataflow is a fully managed service for strongly consistent, parallel data-processing pipelines. It does not support multi-cloud handling.
Option D is wrong as Google Cloud Dataproc is a fast, easy to use, managed Spark and Hadoop service for distributed data processing.

Comment 10

ID: 487092 User: JG123 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 26 May 2023 05:34 Selected Answer: B Upvotes: 3

Answer: B

Comment 11

ID: 421964 User: sandipk91 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 09 Feb 2023 09:35 Selected Answer: - Upvotes: 3

Cloud composer

Comment 12

ID: 396975 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 17:55 Selected Answer: - Upvotes: 5

Vote for B

Comment 13

ID: 308886 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 12 Sep 2022 13:49 Selected Answer: - Upvotes: 4

B:
Hybrid and multi-cloud
Ease your transition to the cloud or maintain a hybrid data environment by orchestrating workflows that cross between on-premises and the public cloud. Create workflows that connect data, processing, and services across clouds to give you a unified data environment.
https://cloud.google.com/composer#section-2

Comment 14

ID: 285314 User: someshsehgal Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sun 07 Aug 2022 05:53 Selected Answer: - Upvotes: 3

Correct B: without any doubt.

Comment 15

ID: 222516 User: arghya13 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Thu 19 May 2022 07:13 Selected Answer: - Upvotes: 4

B-Cloud Composer works on a multicloud environment

Comment 16

ID: 216390 User: Alasmindas Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Tue 10 May 2022 06:07 Selected Answer: - Upvotes: 5

there can not be any simple question like this to choose the right answer as "Cloud Composer". I really feel someone must have deliberately selecting the wrong answers in Exam topics to confuse people....

Comment 17

ID: 216026 User: Cloud_Enthusiast Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Mon 09 May 2022 15:15 Selected Answer: - Upvotes: 4

Composer is the obvious answer. so B

15. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 108

Sequence
208
Discussion ID
17209
Source URL
https://www.examtopics.com/discussions/google/view/17209-exam-professional-data-engineer-topic-1-question-108/
Posted By
Rajokkiyam
Posted At
March 22, 2020, 6:48 a.m.

Question

You have developed three data processing jobs. One executes a Cloud Dataflow pipeline that transforms data uploaded to Cloud Storage and writes results to
BigQuery. The second ingests data from on-premises servers and uploads it to Cloud Storage. The third is a Cloud Dataflow pipeline that gets information from third-party data providers and uploads the information to Cloud Storage. You need to be able to schedule and monitor the execution of these three workflows and manually execute them when needed. What should you do?

  • A. Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.
  • B. Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.
  • C. Develop an App Engine application to schedule and request the status of the jobs using GCP API calls.
  • D. Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using GCP API calls.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 20 comments Click to expand

Comment 1

ID: 66819 User: Rajokkiyam Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Mon 22 Mar 2021 06:48 Selected Answer: - Upvotes: 22

Create dependency in Cloud Composer and schedule it.

Comment 1.1

ID: 708468 User: MisuLava Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 31 Oct 2023 15:02 Selected Answer: - Upvotes: 1

the jobs are not interdependent. just 3 individual jobs

Comment 2

ID: 68774 User: [Removed] Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Sun 28 Mar 2021 06:40 Selected Answer: - Upvotes: 10

Answer: A
Description: Cloud composer is used to schedule the interdependent jobs

Comment 2.1

ID: 500156 User: marioferrulli Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 12 Dec 2022 19:20 Selected Answer: - Upvotes: 1

but thery are not

Comment 3

ID: 1053550 User: maxu Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Fri 25 Oct 2024 09:30 Selected Answer: - Upvotes: 1

yes answer A

Comment 4

ID: 911336 User: forepick Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 31 May 2024 15:47 Selected Answer: A Upvotes: 1

Cloud Composer. No doubt

Comment 5

ID: 762271 User: AzureDP900 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 20:45 Selected Answer: - Upvotes: 1

A is correct

Comment 6

ID: 758876 User: dconesoko Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 27 Dec 2023 18:52 Selected Answer: A Upvotes: 1

Cloud composer's DAG would manage the dependencies

Comment 7

ID: 627639 User: danielfootc Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 06 Jul 2023 01:55 Selected Answer: - Upvotes: 2

This should be A

Comment 8

ID: 518513 User: medeis_jar Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 06 Jan 2023 20:21 Selected Answer: A Upvotes: 3

https://cloud.google.com/composer/docs/how-to/using/writing-dags

Comment 9

ID: 514688 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 01 Jan 2023 22:23 Selected Answer: A Upvotes: 5

Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers.
https://cloud.google.com/composer/?hl=en

Comment 10

ID: 506252 User: kishanu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 21 Dec 2022 17:15 Selected Answer: A Upvotes: 6

A
Though the jobs are not dependent, they are data-driven. Refer to the below link:
https://cloud.google.com/blog/topics/developers-practitioners/choosing-right-orchestrator-google-cloud

Comment 10.1

ID: 514701 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 01 Jan 2023 22:43 Selected Answer: - Upvotes: 1

nice article thanks!

Comment 11

ID: 486756 User: JG123 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 25 Nov 2022 16:35 Selected Answer: A Upvotes: 3

Cloud Composer

Comment 12

ID: 486753 User: JG123 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 25 Nov 2022 16:34 Selected Answer: - Upvotes: 3

Correct: A

Comment 13

ID: 421948 User: sandipk91 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Tue 09 Aug 2022 08:03 Selected Answer: - Upvotes: 5

should be option A

Comment 14

ID: 396868 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 02 Jul 2022 14:32 Selected Answer: - Upvotes: 4

Vote for A

Comment 15

ID: 285213 User: someshsehgal Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 07 Feb 2022 04:16 Selected Answer: - Upvotes: 3

COrrect A: Couldn't understand why a option with no connection with actual problem has been given as correct option (D)

Comment 16

ID: 222487 User: arghya13 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Fri 19 Nov 2021 07:22 Selected Answer: - Upvotes: 2

I'll go for A

Comment 17

ID: 179033 User: Tanmoyk Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Tue 14 Sep 2021 04:48 Selected Answer: - Upvotes: 2

Should be A

16. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 196

Sequence
218
Discussion ID
79646
Source URL
https://www.examtopics.com/discussions/google/view/79646-exam-professional-data-engineer-topic-1-question-196/
Posted By
ducc
Posted At
Sept. 3, 2022, 3:57 a.m.

Question

You have 15 TB of data in your on-premises data center that you want to transfer to Google Cloud. Your data changes weekly and is stored in a POSIX-compliant source. The network operations team has granted you 500 Mbps bandwidth to the public internet. You want to follow Google-recommended practices to reliably transfer your data to Google Cloud on a weekly basis. What should you do?

  • A. Use Cloud Scheduler to trigger the gsutil command. Use the -m parameter for optimal parallelism.
  • B. Use Transfer Appliance to migrate your data into a Google Kubernetes Engine cluster, and then configure a weekly transfer job.
  • C. Install Storage Transfer Service for on-premises data in your data center, and then configure a weekly transfer job.
  • D. Install Storage Transfer Service for on-premises data on a Google Cloud virtual machine, and then configure a weekly transfer job.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 11 comments Click to expand

Comment 1

ID: 814123 User: musumusu Badges: Highly Voted Relative Date: 1 year, 6 months ago Absolute Date: Mon 19 Aug 2024 13:39 Selected Answer: - Upvotes: 7

answer C,
To avoid confustion: Install Storage Transfer Service is always on EXTERNAL OR NON GOOGLE service or data centre to connect google service.

Comment 2

ID: 876656 User: Prudvi3266 Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 17:00 Selected Answer: C Upvotes: 3

C is the Answer as we need weekly run Storage transfer service has the feature to schedule.

Comment 3

ID: 725895 User: NicolasN Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 14:00 Selected Answer: C Upvotes: 3

The fact that it's about a POSIX source makes necessary the set up of Storage Transfer Service agents.
This detail limits [C] to be the correct answer, since it's the data center hosting the files where the agent must be installed.
--
Some excerpts:
(an older version of documentation was definite)
"The following is a high-level overview of how Transfer service for on-premises data works:
1.Install Docker and run a small piece of software, called an agent, in your private data center. "
Source: https://web.archive.org/web/20210529161414/https://cloud.google.com/storage-transfer/docs/on-prem-overview
--
"Storage Transfer Service agents are applications running inside a Docker container, that coordinate with Storage Transfer Service to read data from POSIX file system sources, and/or write data to POSIX file system sinks.
If your transfer does not involve a POSIX file system, you do not need to set up agents."
Source: https://cloud.google.com/storage-transfer/docs/managing-on-prem-agents

Comment 4

ID: 725629 User: Atnafu Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 07:19 Selected Answer: - Upvotes: 2

C
Storage Transfer Service agents are applications running inside a Docker container, that coordinate with Storage Transfer Service to read data from POSIX file system sources, and/or write data to POSIX file system sinks.

https://cloud.google.com/storage-transfer/docs/managing-on-prem-agents#:~:text=Storage%20Transfer%20Service%20agents,agents%20on%20your%20servers.

Comment 5

ID: 676100 User: namo621 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Fri 22 Mar 2024 15:18 Selected Answer: - Upvotes: 1

why can't it be D?

Comment 6

ID: 667483 User: Wasss123 Badges: - Relative Date: 2 years ago Absolute Date: Wed 13 Mar 2024 00:53 Selected Answer: C Upvotes: 2

I vote for C

Comment 7

ID: 666699 User: TNT87 Badges: - Relative Date: 2 years ago Absolute Date: Tue 12 Mar 2024 10:27 Selected Answer: - Upvotes: 2

Ans C
https://cloud.google.com/storage-transfer/docs/overview

Comment 8

ID: 665462 User: MounicaN Badges: - Relative Date: 2 years ago Absolute Date: Sun 10 Mar 2024 17:15 Selected Answer: - Upvotes: 3

can you help with difference between c and d ?

Comment 8.1

ID: 725216 User: gudiking Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Thu 23 May 2024 14:56 Selected Answer: - Upvotes: 2

If you install the software for on-premise data center on a Google Cloud VM, then it's not on-premise, it's on GCP, so it can't access your on-premise data.

Comment 9

ID: 658050 User: AWSandeep Badges: - Relative Date: 2 years ago Absolute Date: Sun 03 Mar 2024 07:15 Selected Answer: C Upvotes: 2

C. Install Storage Transfer Service for on-premises data in your data center, and then configure a weekly transfer job.

Comment 10

ID: 657971 User: ducc Badges: - Relative Date: 2 years ago Absolute Date: Sun 03 Mar 2024 04:57 Selected Answer: C Upvotes: 2

C is correct

17. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 206

Sequence
234
Discussion ID
129853
Source URL
https://www.examtopics.com/discussions/google/view/129853-exam-professional-data-engineer-topic-1-question-206/
Posted By
e70ea9e
Posted At
Dec. 30, 2023, 9:24 a.m.

Question

You need ads data to serve AI models and historical data for analytics. Longtail and outlier data points need to be identified. You want to cleanse the data in near-real time before running it through AI models. What should you do?

  • A. Use Cloud Storage as a data warehouse, shell scripts for processing, and BigQuery to create views for desired datasets.
  • B. Use Dataflow to identify longtail and outlier data points programmatically, with BigQuery as a sink.
  • C. Use BigQuery to ingest, prepare, and then analyze the data, and then run queries to create views.
  • D. Use Cloud Composer to identify longtail and outlier data points, and then output a usable dataset to BigQuery.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 8 comments Click to expand

Comment 1

ID: 1172576 User: Y___ash Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Fri 13 Sep 2024 13:15 Selected Answer: B Upvotes: 2

Dataflow for Real-Time Processing: Dataflow allows you to process data in near-real time, making it well-suited for identifying longtail and outlier data points as they occur. You can use Dataflow to implement custom data cleansing and outlier detection algorithms that operate on streaming data.

BigQuery as a Sink: Using BigQuery as a sink allows you to store the cleaned and processed data efficiently for further analysis or use in AI models. Dataflow can write the cleaned data to BigQuery tables, enabling seamless integration with downstream processes.

Comment 2

ID: 1151086 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 15 Aug 2024 15:31 Selected Answer: B Upvotes: 1

B. Use Dataflow to identify longtail and outlier data points programmatically, with BigQuery as a sink.

Comment 3

ID: 1123188 User: datapassionate Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 15 Jul 2024 08:42 Selected Answer: B Upvotes: 1

B. Use Dataflow to identify longtail and outlier data points programmatically, with BigQuery as a sink.

Comment 4

ID: 1121412 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 08:22 Selected Answer: B Upvotes: 1

B: Dataflow, solves exactly the use case described

Comment 5

ID: 1115702 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 10:06 Selected Answer: B Upvotes: 2

B is the best option for cleansing the ads data in near real-time before running it through AI models.
The key reasons are:
• Dataflow allows for stream processing of data in near real-time. This allows you to identify and cleanse longtail and outlier data points as the data is streamed in.
• Dataflow has built-in capabilities for detecting and handling outliers and anomalies in streaming data. This makes it well-suited for programmatically identifying longtail and outlier data points.
• Using BigQuery as the output sink allows the cleansed data to be immediatley available for analysis and serving to AI models. BigQuery can act as a serving layer for the models.
• Options A, C, and D either don't provide real-time processing (A and C) or don't easily integrate with BigQuery for analysis and serving (D).

Comment 5.1

ID: 1115703 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 10:06 Selected Answer: - Upvotes: 1

So B is the best architecture here to meet the needs of near real-time cleansing, identification of longtail/outlier data points, and integration with BigQuery for serving AI models.

Comment 6

ID: 1112142 User: raaad Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 02 Jul 2024 18:37 Selected Answer: B Upvotes: 1

- Dataflow is a fully managed service for stream and batch data processing and is well-suited for real-time data processing tasks like identifying longtail and outlier data points.
- Using BigQuery as a sink allows to efficiently store the cleansed and processed data for further analysis and serving it to AI models.

Comment 7

ID: 1109522 User: e70ea9e Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 08:24 Selected Answer: B Upvotes: 1

Real-time Data Processing: Dataflow excels at handling large-scale, streaming data with low latency, enabling near-real-time cleansing.
Scalability: Easily scales to handle growing data volumes and processing needs.
Programmatic Data Cleaning: Allows you to write custom logic in Apache Beam for identifying longtail and outlier data points accurately and efficiently.
Integration with BigQuery: Seamless integration with BigQuery enables you to store cleansed data for AI model training and historical analytics.
Cost-Effective: Dataflow's pay-as-you-go model optimizes costs for real-time data processing.

18. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 194

Sequence
242
Discussion ID
79529
Source URL
https://www.examtopics.com/discussions/google/view/79529-exam-professional-data-engineer-topic-1-question-194/
Posted By
PhuocT
Posted At
Sept. 2, 2022, 7:58 p.m.

Question

An online brokerage company requires a high volume trade processing architecture. You need to create a secure queuing system that triggers jobs. The jobs will run in Google Cloud and call the company's Python API to execute trades. You need to efficiently implement a solution. What should you do?

  • A. Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to the Python API.
  • B. Write an application hosted on a Compute Engine instance that makes a push subscription to the Pub/Sub topic.
  • C. Write an application that makes a queue in a NoSQL database.
  • D. Use Cloud Composer to subscribe to a Pub/Sub topic and call the Python API.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 17 comments Click to expand

Comment 1

ID: 850941 User: lucaluca1982 Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Sun 26 Mar 2023 13:30 Selected Answer: - Upvotes: 6

A and D are both good. I go for A because we have high volume and easy to scale and optmize cost

Comment 2

ID: 1240087 User: kajitsu Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Mon 01 Jul 2024 11:45 Selected Answer: D Upvotes: 2

D is the answer.

Comment 2.1

ID: 1272061 User: nadavw Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 25 Aug 2024 10:31 Selected Answer: - Upvotes: 2

There is no need for a composer to call a Python API only - it's an overkill.

Comment 3

ID: 1052275 User: squishy_fishy Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 23 Oct 2023 23:37 Selected Answer: - Upvotes: 3

Answer is D, at work we use solution A for low volume of Pub/Sub messages and Cloud function, and using D Composer for high volume Pub/Sub messages.

Comment 4

ID: 814117 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Sun 19 Feb 2023 14:27 Selected Answer: - Upvotes: 4

Answer A:
assume, Company wants to buy immediately in same second if stock goes down or up.
Somehow, it is connected to PubSub as SINK connector, then immediately there is PUSH to subcriber (cloud function) that is connected to their python API (internal application) that makes the purchase.

Comment 5

ID: 763420 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 01:57 Selected Answer: - Upvotes: 1

A. Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to the Python API.

Comment 6

ID: 714817 User: GCPCloudArchitectUser Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 10 Nov 2022 00:50 Selected Answer: A Upvotes: 4

Because trading platform requires securely transmission to queuing
If you use cloud compose then we need some other job to trigger composer … would that be cloud composer api or cloud function …

Comment 7

ID: 682342 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 29 Sep 2022 06:46 Selected Answer: - Upvotes: 1

https://cloud.google.com/functions/docs/calling/pubsub

Comment 8

ID: 663139 User: TNT87 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 07:16 Selected Answer: A Upvotes: 4

Ans A
https://cloud.google.com/functions/docs/calling/pubsub#deployment

Comment 9

ID: 661144 User: YorelNation Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Tue 06 Sep 2022 13:18 Selected Answer: A Upvotes: 3

A because D is stupidly high latency

Comment 10

ID: 659758 User: nwk Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 05 Sep 2022 07:34 Selected Answer: - Upvotes: 1

Vote A, can't see the need for composer

Comment 11

ID: 659092 User: soichirokawa Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 04 Sep 2022 10:38 Selected Answer: - Upvotes: 3

A might be enough. Cloud composer will be an overkill

Comment 12

ID: 658047 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 05:58 Selected Answer: - Upvotes: 4

A. Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to the Python API.

Comment 13

ID: 657968 User: ducc Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 03:54 Selected Answer: D Upvotes: 1

D is a more recommend way by Google, IMO.

Comment 13.1

ID: 1052273 User: squishy_fishy Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 23 Oct 2023 23:36 Selected Answer: - Upvotes: 2

I agree, at work use solution A for low volume of Pub/Sub messages and function, and using Composer for high volume Pub/Sub messages.

Comment 14

ID: 657687 User: PhuocT Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 19:58 Selected Answer: - Upvotes: 1

A. more sense to me.

Comment 14.1

ID: 657967 User: ducc Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 03:53 Selected Answer: - Upvotes: 2

Composer support exception and retry for complex pipeline.
D might be correct

19. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 301

Sequence
253
Discussion ID
129913
Source URL
https://www.examtopics.com/discussions/google/view/129913-exam-professional-data-engineer-topic-1-question-301/
Posted By
chickenwingz
Posted At
Dec. 30, 2023, 9:51 p.m.

Question

You are architecting a data transformation solution for BigQuery. Your developers are proficient with SQL and want to use the ELT development technique. In addition, your developers need an intuitive coding environment and the ability to manage SQL as code. You need to identify a solution for your developers to build these pipelines. What should you do?

  • A. Use Dataform to build, manage, and schedule SQL pipelines.
  • B. Use Dataflow jobs to read data from Pub/Sub, transform the data, and load the data to BigQuery.
  • C. Use Data Fusion to build and execute ETL pipelines.
  • D. Use Cloud Composer to load data and run SQL pipelines by using the BigQuery job operators.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 5 comments Click to expand

Comment 1

ID: 1115420 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sat 06 Jul 2024 20:36 Selected Answer: A Upvotes: 8

- Aligns with ELT Approach: Dataform is designed for ELT (Extract, Load, Transform) pipelines, directly executing SQL transformations within BigQuery, matching the developers' preference.
-SQL as Code: It enables developers to write and manage SQL transformations as code, promoting version control, collaboration, and testing.
- Intuitive Coding Environment: Dataform provides a user-friendly interface and familiar SQL syntax, making it easy for SQL-proficient developers to adopt.
- Scheduling and Orchestration: It includes built-in scheduling capabilities to automate pipeline execution, simplifying pipeline management.

Comment 2

ID: 1156207 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Thu 22 Aug 2024 07:21 Selected Answer: A Upvotes: 1

Option A

Comment 3

ID: 1121889 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 17:08 Selected Answer: A Upvotes: 2

Definitely A

Comment 4

ID: 1113690 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 12:57 Selected Answer: A Upvotes: 3

A. Use Dataform to build, manage, and schedule SQL pipelines.

Comment 5

ID: 1109968 User: chickenwingz Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 20:51 Selected Answer: A Upvotes: 1

Dataform = transformations in SQL

20. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 223

Sequence
257
Discussion ID
129870
Source URL
https://www.examtopics.com/discussions/google/view/129870-exam-professional-data-engineer-topic-1-question-223/
Posted By
e70ea9e
Posted At
Dec. 30, 2023, 9:49 a.m.

Question

You are building an ELT solution in BigQuery by using Dataform. You need to perform uniqueness and null value checks on your final tables. What should you do to efficiently integrate these checks into your pipeline?

  • A. Build BigQuery user-defined functions (UDFs).
  • B. Create Dataplex data quality tasks.
  • C. Build Dataform assertions into your code.
  • D. Write a Spark-based stored procedure.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 6 comments Click to expand

Comment 1

ID: 1113645 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 12:04 Selected Answer: C Upvotes: 7

- Dataform provides a feature called "assertions," which are essentially SQL-based tests that you can define to verify the quality of your data.
- Assertions in Dataform are a built-in way to perform data quality checks, including checking for uniqueness and null values in your tables.

Comment 2

ID: 1152532 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Sat 17 Aug 2024 12:14 Selected Answer: C Upvotes: 2

https://docs.dataform.co/guides/assertions

Comment 3

ID: 1125744 User: tibuenoc Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 18 Jul 2024 10:43 Selected Answer: C Upvotes: 4

https://cloud.google.com/dataform/docs/assertions

Comment 4

ID: 1115900 User: Alex3551 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 14:15 Selected Answer: C Upvotes: 1

Agree with C

Comment 5

ID: 1115899 User: Alex3551 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 14:14 Selected Answer: - Upvotes: 1

agree with C

Comment 6

ID: 1109551 User: e70ea9e Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 08:49 Selected Answer: C Upvotes: 3

Native Integration:

Dataform assertions are designed specifically for data quality checks within Dataform pipelines, ensuring seamless integration and compatibility.
They leverage Dataform's execution model and configuration, aligning with the existing workflow.
Declarative Syntax:

Assertions are defined using a simple, declarative syntax within Dataform code, making them easy to write and understand, even for users with less SQL expertise.

21. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 219

Sequence
263
Discussion ID
129866
Source URL
https://www.examtopics.com/discussions/google/view/129866-exam-professional-data-engineer-topic-1-question-219/
Posted By
e70ea9e
Posted At
Dec. 30, 2023, 9:44 a.m.

Question

You orchestrate ETL pipelines by using Cloud Composer. One of the tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-party service. You want to be notified when the task does not succeed. What should you do?

  • A. Assign a function with notification logic to the on_retry_callback parameter for the operator responsible for the task at risk.
  • B. Configure a Cloud Monitoring alert on the sla_missed metric associated with the task at risk to trigger a notification.
  • C. Assign a function with notification logic to the on_failure_callback parameter tor the operator responsible for the task at risk.
  • D. Assign a function with notification logic to the sla_miss_callback parameter for the operator responsible for the task at risk.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 7 comments Click to expand

Comment 1

ID: 1266801 User: jonty4gcp Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Fri 16 Aug 2024 05:25 Selected Answer: - Upvotes: 1

What is Task is long-running and in between stuck?

Comment 2

ID: 1217491 User: Anudeep58 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 15:19 Selected Answer: C Upvotes: 2

https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/callbacks.html

Comment 3

ID: 1158021 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Sat 24 Feb 2024 16:53 Selected Answer: C Upvotes: 2

on_failure_callback

Comment 4

ID: 1123276 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 12:20 Selected Answer: C Upvotes: 3

on_failure_callback is invoked when the task fails
https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/callbacks.html

Comment 5

ID: 1121508 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 11:27 Selected Answer: C Upvotes: 1

Option C to me

Comment 6

ID: 1113634 User: raaad Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 12:47 Selected Answer: C Upvotes: 4

- The on_failure_callback is a function that gets called when a task fails.
- Assigning a function with notification logic to this parameter is a direct way to handle task failures.
- When the task fails, this function can trigger a notification, making it an appropriate solution for the need to be alerted on task failures.

Comment 7

ID: 1109546 User: e70ea9e Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:44 Selected Answer: C Upvotes: 4

Direct Trigger:

The on_failure_callback parameter is specifically designed to invoke a function when a task fails, ensuring immediate notification.
Customizable Logic:

You can tailor the notification function to send emails, create alerts, or integrate with other notification systems, providing flexibility.

22. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 273

Sequence
266
Discussion ID
130514
Source URL
https://www.examtopics.com/discussions/google/view/130514-exam-professional-data-engineer-topic-1-question-273/
Posted By
Smakyel79
Posted At
Jan. 7, 2024, 5:11 p.m.

Question

You are creating the CI/CD cycle for the code of the directed acyclic graphs (DAGs) running in Cloud Composer. Your team has two Cloud Composer instances: one instance for development and another instance for production. Your team is using a Git repository to maintain and develop the code of the DAGs. You want to deploy the DAGs automatically to Cloud Composer when a certain tag is pushed to the Git repository. What should you do?

  • A. 1. Use Cloud Build to copy the code of the DAG to the Cloud Storage bucket of the development instance for DAG testing.
    2. If the tests pass, use Cloud Build to copy the code to the bucket of the production instance.
  • B. 1. Use Cloud Build to build a container with the code of the DAG and the KubernetesPodOperator to deploy the code to the Google Kubernetes Engine (GKE) cluster of the development instance for testing.
    2. If the tests pass, use the KubernetesPodOperator to deploy the container to the GKE cluster of the production instance.
  • C. 1. Use Cloud Build to build a container and the KubernetesPodOperator to deploy the code of the DAG to the Google Kubernetes Engine (GKE) cluster of the development instance for testing.
    2. If the tests pass, copy the code to the Cloud Storage bucket of the production instance.
  • D. 1. Use Cloud Build to copy the code of the DAG to the Cloud Storage bucket of the development instance for DAG testing.
    2. If the tests pass, use Cloud Build to build a container with the code of the DAG and the KubernetesPodOperator to deploy the container to the Google Kubernetes Engine (GKE) cluster of the production instance.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 9 comments Click to expand

Comment 1

ID: 1119643 User: BIGQUERY_ALT_ALT Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 11 Jan 2024 12:22 Selected Answer: A Upvotes: 11

The Answer is A. Given that there are two instances (development and production) already available, and the goal is to deploy DAGs to Cloud Composer not entire composer infra build.

Explanation:
- This approach leverages Cloud Build to manage the deployment process.
- It first deploys the code to the Cloud Storage bucket of the development instance for testing purposes.
- If the tests are successful in the development environment, the same Cloud Build process is used to copy the code to the Cloud Storage bucket of the production instance.

B. GKE-based approach is not standard for Cloud Composer. C. GKE used for testing is unconventional for DAG deployments. D. Involves unnecessary GKE deployment for production. Testing DAGs should use Composer instances directly, not Kubernetes containers in GKE.

Comment 2

ID: 1263391 User: meh_33 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 10:51 Selected Answer: A Upvotes: 2

Most confusing question to confuse us why GKE needed its already mentioned they have 2 composer environment

Comment 3

ID: 1155256 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 06:10 Selected Answer: A Upvotes: 2

Option A

Comment 4

ID: 1121815 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 16:47 Selected Answer: A Upvotes: 4

Option A, DAGs are routinely stored in cloud storage buckets, Cloud Build act as a trigger for both the deployment process to test env and the test itslef
https://cloud.google.com/composer/docs/dag-cicd-integration-guide

Comment 5

ID: 1117562 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 15:23 Selected Answer: A Upvotes: 1

I vote fore A

Comment 6

ID: 1116005 User: GCP001 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 17:31 Selected Answer: - Upvotes: 1

C.
It looks the correct choice, first build, test and verify everything on dev enviornment and then just copy the files on prod bucket.
https://cloud.google.com/composer/docs/dag-cicd-integration-guide

Comment 6.1

ID: 1117560 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 15:22 Selected Answer: - Upvotes: 1

But why do we need the Google Kubernetes Engine (GKE) cluster for this?

Comment 6.1.1

ID: 1125401 User: GCP001 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 17 Jan 2024 23:52 Selected Answer: - Upvotes: 1

Yea, it should be A

Comment 7

ID: 1115997 User: Smakyel79 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 17:11 Selected Answer: A Upvotes: 2

This approach is straightforward and leverages Cloud Build to automate the deployment process. It doesn't require containerization, making it simpler and possibly quicker.

23. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 308

Sequence
272
Discussion ID
130319
Source URL
https://www.examtopics.com/discussions/google/view/130319-exam-professional-data-engineer-topic-1-question-308/
Posted By
scaenruy
Posted At
Jan. 4, 2024, 1:05 p.m.

Question

You are migrating a large number of files from a public HTTPS endpoint to Cloud Storage. The files are protected from unauthorized access using signed URLs. You created a TSV file that contains the list of object URLs and started a transfer job by using Storage Transfer Service. You notice that the job has run for a long time and eventually failed. Checking the logs of the transfer job reveals that the job was running fine until one point, and then it failed due to HTTP 403 errors on the remaining files. You verified that there were no changes to the source system. You need to fix the problem to resume the migration process. What should you do?

  • A. Set up Cloud Storage FUSE, and mount the Cloud Storage bucket on a Compute Engine instance. Remove the completed files from the TSV file. Use a shell script to iterate through the TSV file and download the remaining URLs to the FUSE mount point.
  • B. Renew the TLS certificate of the HTTPS endpoint. Remove the completed files from the TSV file and rerun the Storage Transfer Service job.
  • C. Create a new TSV file for the remaining files by generating signed URLs with a longer validity period. Split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel.
  • D. Update the file checksums in the TSV file from using MD5 to SHA256. Remove the completed files from the TSV file and rerun the Storage Transfer Service job.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 5 comments Click to expand

Comment 1

ID: 1114705 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 18:35 Selected Answer: C Upvotes: 7

- It addresses the likely issue: that the signed URLs have expired or are otherwise invalid. By creating a new TSV file with freshly generated signed URLs (with a longer validity period), you're ensuring that the Storage Transfer Service has valid authorization to access the files.
- Splitting the TSV file and running parallel jobs might help in managing the workload more efficiently and overcoming any limitations related to the number of files or transfer speed.

Comment 2

ID: 1146367 User: srivastavas08 Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Sat 10 Feb 2024 15:42 Selected Answer: - Upvotes: 6

C. Create a new TSV file for the remaining files by generating signed URLs with a longer validity period. Split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel.

Here's why:

HTTP 403 errors: These errors indicate unauthorized access, but since you verified the source system and signed URLs, the issue likely lies with expired signed URLs. Renewing the URLs with a longer validity period prevents this issue for the remaining files.
Separate jobs: Splitting the file into smaller chunks and submitting them as separate jobs improves parallelism and potentially speeds up the transfer process.
Avoid manual intervention: Options A and D require manual intervention and complex setups, which are less efficient and might introduce risks.
Longer validity: While option B addresses expired URLs, splitting the file offers additional benefits for faster migration.

Comment 3

ID: 1260547 User: iooj Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Sun 04 Aug 2024 09:13 Selected Answer: C Upvotes: 2

got this one on the exam, aug 2024, passed

Comment 4

ID: 1121871 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 17:50 Selected Answer: C Upvotes: 3

Option C - agree with Raaad

Comment 5

ID: 1113646 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 13:05 Selected Answer: C Upvotes: 2

C. Create a new TSV file for the remaining files by generating signed URLs with a longer validity period. Split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel.

24. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 49

Sequence
274
Discussion ID
17083
Source URL
https://www.examtopics.com/discussions/google/view/17083-exam-professional-data-engineer-topic-1-question-49/
Posted By
-
Posted At
March 21, 2020, 8:48 a.m.

Question

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.
You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (Choose two.)

  • A. Introduce data compression for each file to increase the rate file of file transfer.
  • B. Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
  • C. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
  • D. Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
  • E. Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premises data to the designated storage bucket.

Suggested Answer

CD

Answer Description Click to expand


Community Answer Votes

Comments 35 comments Click to expand

Comment 1

ID: 246085 User: Toto2020 Badges: Highly Voted Relative Date: 5 years, 2 months ago Absolute Date: Thu 17 Dec 2020 01:57 Selected Answer: - Upvotes: 55

E cannot be: Transfer Service is recommended for 300mbps or faster
https://cloud.google.com/storage-transfer/docs/on-prem-overview

Bandwidth is not an issue, so B is not an answer

Cloud Storage loading gets better throughput the larger the files are. Therefore making them smaller with compression does not seem a solution. -m option to do parallel work is recommended. Therefore A is not and C is an answer.
https://medium.com/@duhroach/optimizing-google-cloud-storage-small-file-upload-performance-ad26530201dc

That leaves D as the other option. It is true you cannot user tar directly with gsutil, but you can load the tar file to Cloud Storage, move the file to a Compute Engine instance with Linux, use tar to split files and copy them back to Cloud Storage. Batching many files in a larger tar will improve Cloud Storage throughput.

So, given the alternatives, I think answer is CD

Comment 1.1

ID: 398555 User: awssp12345 Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Sun 04 Jul 2021 19:39 Selected Answer: - Upvotes: 3

This should be the correct answer.

Comment 1.2

ID: 819826 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Thu 23 Feb 2023 23:21 Selected Answer: - Upvotes: 1

50mbps is so slow, why you think bandwidth is ok! For parallel upload you need good internet ?

Comment 1.2.1

ID: 825392 User: Booqq Badges: - Relative Date: 3 years ago Absolute Date: Wed 01 Mar 2023 00:46 Selected Answer: - Upvotes: 2

normally the solutions are Google Cloud Services based, as it's a vendor exam

Comment 1.2.2

ID: 880839 User: Jarek7 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 22:37 Selected Answer: - Upvotes: 2

They have 20.000 files 4kb each per hour, so bandwith needed for it is far below 1mbps. 50mbps is enough to upload all day generated data in about 5 minutes.

Comment 1.3

ID: 459264 User: vholti Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Fri 08 Oct 2021 16:09 Selected Answer: - Upvotes: 4

D is incorrect. gsutil with -m option uses multiprocessing/multithreading. It means it will copy the file in parallel. The benefit of multiprocessing/multithreading is significantly high when working with large number of files, instead of file size. The important point of multiprocessing/multithreading is sending multiple files in parallel. Hence file size doesn't give impact to gsutil with -m option. Gsutil with -m option doesn't split a big file into multiple chunks and transfer it in parallel. So in my opinion the answer is A and C.

Comment 1.3.1

ID: 459266 User: vholti Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Fri 08 Oct 2021 16:13 Selected Answer: - Upvotes: 3

Here is the docs which support my opinion: https://cloud.google.com/storage/docs/gsutil/addlhelp/TopLevelCommandLineOptions

Comment 1.3.1.1

ID: 961413 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 11:34 Selected Answer: - Upvotes: 1

We have small files of 4KB and no issues with bandwidth. It's not an issue that -m does not split files. Our problem is with total volume.

Comment 1.3.2

ID: 958535 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Fri 21 Jul 2023 15:20 Selected Answer: - Upvotes: 2

As far as I understand compression is not something we want here because bandwidth is not an issue and compressed files will need to be decompressed on the cloud. On top of that if we want to load those files later in BigQuery to create the report we know that we cannot load compressed csv files in parallel.

gsutil makes the most sense because it will be used to load all new files in parallel.

I answered D as well because I thought that none of the others made sense and D is the only one that mentions creating the bucket on GCS and perhaps migrating data that is missed during the update in the architecture.

So D to create the bucket, C to update the process and move the data to the bucket, then D to move any lost data during the update.

Comment 1.3.2.1

ID: 961417 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 11:36 Selected Answer: - Upvotes: 1

Typo, I meant E in my post. C and E, not C and D.

Comment 2

ID: 66427 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 21 Mar 2020 08:48 Selected Answer: - Upvotes: 35

Should be AC

Comment 2.1

ID: 219379 User: GeeBeeEl Badges: - Relative Date: 5 years, 3 months ago Absolute Date: Sat 14 Nov 2020 23:38 Selected Answer: - Upvotes: 3

support this with a link....

Comment 2.1.1

ID: 299396 User: gcppde Badges: - Relative Date: 5 years ago Absolute Date: Thu 25 Feb 2021 22:22 Selected Answer: - Upvotes: 1

Here you go: https://cloud.google.com/storage-transfer/docs/overview#gsutil

Comment 2.1.1.1

ID: 596835 User: tavva_prudhvi Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Wed 04 May 2022 15:20 Selected Answer: - Upvotes: 1

This link does support for C, but what about A? any supported links?

Comment 2.1.1.2

ID: 309312 User: BhupiSG Badges: - Relative Date: 5 years ago Absolute Date: Sat 13 Mar 2021 02:01 Selected Answer: - Upvotes: 8

Thank you! From this doc:
Follow these rules of thumb when deciding whether to use gsutil or Storage Transfer Service:

Transfer scenario Recommendation
Transferring from another cloud storage provider Use Storage Transfer Service.
Transferring less than 1 TB from on-premises Use gsutil.
Transferring more than 1 TB from on-premises Use Transfer service for on-premises data.
Transferring less than 1 TB from another Cloud Storage region Use gsutil.
Transferring more than 1 TB from another Cloud Storage region Use Storage Transfer Service.

Comment 3

ID: 1259099 User: iooj Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Wed 31 Jul 2024 22:59 Selected Answer: CD Upvotes: 2

C - because gsutil is recommended for transferring less than 1 TB from on-premises
C excludes E;
bandwidth is not a problem due to a simple math, so we exclude B;
4 KB file is compressed enough, so we exclude A;
D - works fine because even with -m flag we can send tars in parallel.

Comment 4

ID: 1166622 User: zohra-khouy.f Badges: - Relative Date: 2 years ago Absolute Date: Tue 05 Mar 2024 17:42 Selected Answer: AC Upvotes: 1

AC is the answer

Comment 5

ID: 1063766 User: spicebits Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 06 Nov 2023 12:16 Selected Answer: - Upvotes: 2

How can C and E be the answer? They are solving the same problem with different approaches. If you pick C then E can not be an answer. If you pick E then C can not be an answer. This question also seems a bit dated because of gcloud storage cli which is much more performant than gsutil. I would pick C&D as the combination makes the most sense given the choices.

Comment 6

ID: 916672 User: Maurilio_Cardoso Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 00:17 Selected Answer: - Upvotes: 1

@hendrixlives arguments are correct. The approach between the resources in use and how to optimize the ingestion must be balanced.

Comment 7

ID: 896453 User: Kiroo Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 13 May 2023 09:15 Selected Answer: CD Upvotes: 2

C is correct without an doubt
I was in doubt between D and E
A and B does not seems correct because it states that the bandwidth is not fully utilized .
Now D and E
If the bandwidth was higher the E would be good
D even if it seems that will not make difference because tar files does not have compression transmit one file instead of 1000 is significantly faster so I would choose
C and D

Comment 8

ID: 889907 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Fri 05 May 2023 10:50 Selected Answer: CD Upvotes: 1

CD , i guess. Liked explanation in discussion.
changing from BC to CD

Comment 9

ID: 882016 User: Kart87 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 26 Apr 2023 22:48 Selected Answer: - Upvotes: 2

Guys. need a help. anyone appeared for the exam very recently (Apr 2023)? preparing all the questions from here, would be enough?

Comment 10

ID: 880862 User: Jarek7 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 23:10 Selected Answer: AC Upvotes: 3

It seems that Google would like AC. A is not neccessary - it doesnt make a significant change - small files do not compress well and bandwith is so big that file size is not an issue - the isue is 0,2s latency. The biggest benefit is that we can simply enable compression from gsutils parameters, it will not add any implementation complexity.
For me C solo is ok and D solo might be even better, but more complex. C and D cannot be mixed - they exclude each other. C is more simple and uses Google service so it seems to be desired answer. And it makes sense if they want us to select 2 actions we have to make - If we go for C we can also get some benefit from A, if we go for D there is no other answer we can select and it is much more complex in implementation than AC(which is by far good enough).

Comment 10.1

ID: 1044688 User: patiwwb Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 16 Oct 2023 07:43 Selected Answer: - Upvotes: 1

Yes the 2 are excluding each other. So it's AC

Comment 11

ID: 808267 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Tue 14 Feb 2023 11:15 Selected Answer: - Upvotes: 3

Answer: B&C
A: files are 4kb, no need for compression
B: more files to be transmitted per unit time with 100mbps or get 5g network (~200 mbps)
C: gsutil parallel ingestion will reduce time
D: TAR is not a good compression and slower in transfer even slower than csv. speed is 50mbps so don't go with it.
E: Storage Transfer service needs good internet and used for large size of data and for on premises storage, this one is regular ingestion.

Comment 12

ID: 801780 User: manigcp Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Wed 08 Feb 2023 09:26 Selected Answer: - Upvotes: 1

-- From ChatGPT --

B. Redesign the data ingestion process to use gsutil tool to send the CSV les to a storage bucket in parallel.
A. Introduce data compression for each file to increase the rate of file transfer.

Reasoning:
B. Redesigning the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel will help improve the rate at which the data is transferred to the cloud. This is because gsutil allows for parallel transfers, thereby utilizing the available bandwidth more efficiently and reducing the time required to transfer the data.

A. Introducing data compression for each file will also help improve the rate of file transfer. This is because compressed data takes up less space and can be transferred faster, thereby reducing the time required to transfer the data.

Comment 12.1

ID: 801781 User: manigcp Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Wed 08 Feb 2023 09:27 Selected Answer: - Upvotes: 1

Why not option D?
Option D, which involves assembling 1000 files into a TAR file and then transmitting it, may not be an effective solution for the current situation. While TAR archives can help reduce the number of files that need to be transmitted, disassembling the TAR archive in the cloud after receiving it could increase the time required to process the data. This could make it difficult to meet the goal of making reports with data from the previous day available by 10:00 a.m. each day.

Furthermore, compressing the TAR archive could increase the time required to create the archive, and may not provide a significant improvement in terms of transfer time, as the individual CSV files are already small in size. This makes it less effective compared to the other options of parallel transfers and data compression.

Comment 12.1.1

ID: 880849 User: Jarek7 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 22:55 Selected Answer: - Upvotes: 1

II wouldnt agree, the main issue here is latency 0,2s and 20000 files per hour - it is even beyond possible transfer without paralelisation or file merging. Compression and sending 1000 files at once resolves the issue. Just as option C. But they don't make any sense together. I think they exclude D because of additional complexity - compression and then decompression is much more difficult than using gsutil. Thus we go for C. If we need one more then only A makes some sense, but I wouldn't go for it. We have enough bandwith for this size of file. We just need get rid of latency, by paralelization.

Comment 12.1.1.1

ID: 880854 User: Jarek7 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 23:00 Selected Answer: - Upvotes: 1

OK. AC seems to be rright as we can simply enable the compression by gsutil options.

Comment 13

ID: 727126 User: Leeeeee Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 26 Nov 2022 00:02 Selected Answer: CD Upvotes: 2

https://cloud.google.com/storage/docs/parallel-composite-uploads

Comment 14

ID: 590415 User: abhineet1313 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sat 23 Apr 2022 07:27 Selected Answer: - Upvotes: 1

A is incorrect as rate of file transfer is not an issue, system is not able to handle current load itself, compression will make it even faster

Comment 15

ID: 588923 User: alecuba16 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Wed 20 Apr 2022 21:32 Selected Answer: DE Upvotes: 1

Multiple small files transfer is a bad practice. You should always use some aggregation strategy like tar or zip multiple files. A is discarded because talks about compressing a single file. B is discarded because the bandwidth is not the problem.

Option C could be , but multi threading has a limit. Then the best option is D or use some google on prem mirroring service like E.

Comment 15.1

ID: 616541 User: tavva_prudhvi Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 15 Jun 2022 07:14 Selected Answer: - Upvotes: 2

E is wrong, as Bandwidth already low, so storage Transfer service will not help here

Comment 16

ID: 573138 User: Jojo9400 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 22 Mar 2022 19:10 Selected Answer: - Upvotes: 1

E is wrong Google Cloud Storage Transfer Service (online) != Transfer Appliance(on-premise)

Comment 17

ID: 568779 User: OmJanmeda Badges: - Relative Date: 3 years, 12 months ago Absolute Date: Wed 16 Mar 2022 05:32 Selected Answer: CD Upvotes: 1

CD is correct option

25. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 105

Sequence
291
Discussion ID
79319
Source URL
https://www.examtopics.com/discussions/google/view/79319-exam-professional-data-engineer-topic-1-question-105/
Posted By
damaldon
Posted At
Sept. 2, 2022, 9:41 a.m.

Question

You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Dataproc and Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?

  • A. cron
  • B. Cloud Composer
  • C. Cloud Scheduler
  • D. Workflow Templates on Dataproc

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 13 comments Click to expand

Comment 1

ID: 1115892 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 14:10 Selected Answer: B Upvotes: 1

Of course, it is Cloud Composer!

Comment 2

ID: 1026373 User: Nirca Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 06 Apr 2024 10:47 Selected Answer: B Upvotes: 1

B. Cloud Composer is the right answer !

Comment 3

ID: 964504 User: vamgcp Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 27 Jan 2024 11:36 Selected Answer: B Upvotes: 1

Cloud Composer is a managed service that allows you to create and run Apache Airflow workflows. Airflow is a workflow management platform that can be used to automate complex data pipelines. It is a good choice for this use case because it is a managed service, which means that Google will take care of the underlying infrastructure. It also supports multiple dependencies, so you can easily schedule a multi-step pipeline

Comment 4

ID: 880272 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 25 Oct 2023 13:16 Selected Answer: B Upvotes: 2

Airflow is the only choiche to handle dependencies and being able to call all of the services included in the question

Comment 5

ID: 817638 User: niketd Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Tue 22 Aug 2023 09:02 Selected Answer: B Upvotes: 2

Multi-step sequential pipelines -> Cloud Composer

Comment 6

ID: 762264 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 19:35 Selected Answer: - Upvotes: 2

Cloud composer B is right

Comment 7

ID: 738170 User: odacir Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 17:29 Selected Answer: B Upvotes: 1

Cloud Composer (Airflow) is the answer to chain different steps from different apps...

Comment 8

ID: 704290 User: MisuLava Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 26 Apr 2023 03:03 Selected Answer: B Upvotes: 1

" multiple dependencies on each other. You want to use managed service"
= Cloud Composer

Comment 9

ID: 669141 User: John_Pongthorn Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Tue 14 Mar 2023 18:29 Selected Answer: B Upvotes: 2

if you want your wf to schedule there are 3 ways to perform it, it of them is composer
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions

Comment 10

ID: 662837 User: Remi2021 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 22:41 Selected Answer: B Upvotes: 1

composer :)

Comment 11

ID: 659814 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Sun 05 Mar 2023 09:48 Selected Answer: B Upvotes: 1

Composer

Comment 12

ID: 658414 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 15:05 Selected Answer: B Upvotes: 1

B. Cloud Composer

Comment 13

ID: 657129 User: damaldon Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 10:41 Selected Answer: - Upvotes: 1

Use composer to schedule tasks

26. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 144

Sequence
295
Discussion ID
17228
Source URL
https://www.examtopics.com/discussions/google/view/17228-exam-professional-data-engineer-topic-1-question-144/
Posted By
-
Posted At
March 22, 2020, 9:36 a.m.

Question

You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec. How should you migrate this data to Cloud Storage?

  • A. Use Transfer Appliance to copy the data to Cloud Storage
  • B. Use gsutil cp ג€"J to compress the content being uploaded to Cloud Storage
  • C. Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage
  • D. Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 13 comments Click to expand

Comment 1

ID: 398320 User: sumanshu Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Tue 04 Jan 2022 15:42 Selected Answer: - Upvotes: 9

Vote for A

A - Correct , Transfer Appliance for moving offline data, large data sets, or data from a source with limited bandwidth
https://cloud.google.com/storage-transfer/docs/overview
B - Eliminated (Not recommended for large storage). recommended for < 1TB
C - Its ONLINE, but we have bandwidth issue - So eliminated.
D - Eliminated (Not recommended for large storage). recommended for < 1TB

Comment 2

ID: 1109735 User: patitonav Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 13:01 Selected Answer: A Upvotes: 1

A . Easy, just by the amount of data

Comment 3

ID: 1039296 User: Nirca Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 10 Apr 2024 10:19 Selected Answer: A Upvotes: 1

In 6 months - only 0.0290304 Petabytes will be uploaded. Right - compression might help, but we do not have any info to support the ration. Lets go for A

Comment 4

ID: 1015449 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 06:08 Selected Answer: A Upvotes: 1

Physical Transfer: Transfer Appliance is a physical device provided by Google Cloud that you can use to physically transfer large volumes of data to the cloud. It allows you to avoid the limitations of network bandwidth and transfer data much faster.

Capacity: Transfer Appliance can handle large volumes of data, including the 2 PB you need to migrate, without the constraints of slow network speeds.

Efficiency: It is highly efficient for large-scale data transfers and is a practical choice for transferring multi-terabyte or petabyte-scale datasets.

Comment 5

ID: 985639 User: arien_chen Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 11:18 Selected Answer: - Upvotes: 2

it would take 34 years.
Option A no doubt.

https://cloud.google.com/static/architecture/images/big-data-transfer-how-to-get-started-transfer-size-and-speed.png

Comment 6

ID: 893185 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 09 Nov 2023 17:43 Selected Answer: A Upvotes: 1

2,000,000,000,000,000 bytes = 2 Petabytes
20,000,000 bytes = 20 Megabytes

Once we do the math:
2 Petabytes / 20 Megabytes = 100,000,000 seconds forecasted to migrate the data.

100,000,000 seconds =
1,666,666.7 minutes =
27,777.8 hours =
1,157.4 days

6 months = 180 days

1,157.4 days > 180 days

Still, with such amount Transfer Appliance is recommended.

Comment 7

ID: 812436 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 17 Aug 2023 22:00 Selected Answer: - Upvotes: 1

Transfer alliance is a physical device of size like cabin luggage or slightly larger.
It has Seagate/WD harddisk (these are name of companies) size varries from 100 to 480 TB.
In our case 2PB (2000 TB) accordingly. Google send you this and transfer data into it by wire connection and then upload data from this to GCS and empty the appliance.

Comment 8

ID: 761629 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 02:38 Selected Answer: - Upvotes: 2

This is no brainer question, A is right

Comment 9

ID: 705593 User: jkhong Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 27 Apr 2023 15:29 Selected Answer: A Upvotes: 3

Problem: Transferring 2 peta data to Cloud Storage
Considerations: Bad network speed


Bad network = cannot initiate from client’s end through network. So, B, C is out
D will still be super slow. At this speed it will take 27,777 hours to transfer the data

Comment 10

ID: 186085 User: SteelWarrior Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Wed 24 Mar 2021 13:59 Selected Answer: - Upvotes: 3

Correct answer is A. with little calculation we know the kind of data will require approx 19 months to transfer on 20Mbps bandwidth. Also, google recommends Transfer appliance for petabytes of data.

Comment 11

ID: 163560 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Mon 22 Feb 2021 14:31 Selected Answer: - Upvotes: 3

A is correct

Comment 12

ID: 128203 User: Rajuuu Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Wed 06 Jan 2021 22:40 Selected Answer: - Upvotes: 3

Correct - A

Comment 13

ID: 70330 User: Rajokkiyam Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Fri 02 Oct 2020 05:16 Selected Answer: - Upvotes: 3

Answer A

27. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 147

Sequence
296
Discussion ID
16675
Source URL
https://www.examtopics.com/discussions/google/view/16675-exam-professional-data-engineer-topic-1-question-147/
Posted By
madhu1171
Posted At
March 15, 2020, 5:11 p.m.

Question

You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

  • A. Cloud Scheduler
  • B. Cloud Dataflow
  • C. Cloud Functions
  • D. Cloud Composer

Suggested Answer

D

Answer Description Click to expand


Community Answer Votes

Comments 21 comments Click to expand

Comment 1

ID: 317726 User: mario_ordinola Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Thu 23 Sep 2021 03:55 Selected Answer: - Upvotes: 42

if someone are not sure that D is the answer, I suggest to don't take the exam

Comment 2

ID: 64365 User: madhu1171 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 15 Sep 2020 16:11 Selected Answer: - Upvotes: 23

D should be the answer

Comment 3

ID: 1109739 User: patitonav Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 13:09 Selected Answer: D Upvotes: 2

No duobt

Comment 4

ID: 1015468 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 06:44 Selected Answer: D Upvotes: 2

Workflow Orchestration: Cloud Composer is a fully managed workflow orchestration service based on Apache Airflow. It allows you to define, schedule, and manage complex workflows with multiple steps, including shell scripts, Hadoop jobs, and BigQuery queries.

Dependency Management: You can define dependencies between different steps in your workflow to ensure they are executed in a specific order.

Retry Mechanism: Cloud Composer provides built-in retry mechanisms, so if any step fails, it can be retried a fixed number of times according to your configuration.

Scheduled Execution: Cloud Composer allows you to schedule the execution of your workflows on a regular basis, meeting the requirement for executing the jobs on a schedule.

Comment 5

ID: 761626 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 02:33 Selected Answer: - Upvotes: 3

D is right

Comment 6

ID: 634219 User: DataEngineer_WideOps Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 20 Jan 2023 22:25 Selected Answer: A Upvotes: 1

Cloud Composer for sure.

Comment 7

ID: 612147 User: nadavw Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Tue 06 Dec 2022 06:53 Selected Answer: - Upvotes: 1

D.
per document "Scheduler" is aimed to a single service and composer for an ETL , in addition it's not even specified all jobs are on cloud so only composer can handle it.

Comment 7.1

ID: 612148 User: nadavw Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Tue 06 Dec 2022 06:53 Selected Answer: - Upvotes: 1

https://cloud.google.com/blog/topics/developers-practitioners/choosing-right-orchestrator-google-cloud

Comment 8

ID: 519565 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 14:04 Selected Answer: D Upvotes: 2

Cloud Composer

Comment 9

ID: 487257 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Thu 26 May 2022 10:40 Selected Answer: - Upvotes: 2

Why there are so many wrong answers? Examtopics.com are you enjoying paid subscription by giving random answers from people?
Ans: D

Comment 10

ID: 294490 User: daghayeghi Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 19 Aug 2021 17:51 Selected Answer: - Upvotes: 4

D:
the main point is that Cloud Composer should be used when there is inter-dependencies between the job, e.g. we need the output of a job to start another whenever the first finished, and use dependencies coming from first job.

Comment 11

ID: 251435 User: ashuchip Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 24 Jun 2021 04:12 Selected Answer: - Upvotes: 3

D seems to be quiet relevant , because using composure you can do all things which are being asked to perform, even retry property is there in composure.

Comment 12

ID: 218560 User: Alasmindas Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Thu 13 May 2021 15:02 Selected Answer: - Upvotes: 5

The correct answer is Option A : Cloud Scheduler .
Although at first instance, I thought it should be Cloud Composer but then looking at the question and reading it few times - it concluded me to go for Option A.

Cloud Scheduler has built in retry handling so you can set a fixed number of times and doesn't have time limits for requests. The functionality is much simpler than Cloud Composer. Cloud Composer is managed Apache Airflow that "helps you create, schedule, monitor and manage workflows. For automate scheduled jobs - the most preferred method would be Scheduler, Composer would typically be used when we want to orchestrate many managed services and automate the work flow.

Comment 12.1

ID: 222375 User: kavs Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Wed 19 May 2021 01:35 Selected Answer: - Upvotes: 1

A seems to be right

Comment 12.1.1

ID: 242416 User: mumukshu Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Sun 13 Jun 2021 09:21 Selected Answer: - Upvotes: 3

I think D , how scheduler can handle this part " The jobs are expected to run for many minutes up to several hours"

Comment 12.2

ID: 506039 User: baubaumiaomiao Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 21 Jun 2022 10:48 Selected Answer: - Upvotes: 1

You forgot the "These jobs have many interdependent steps" which can be handled only though Composer

Comment 13

ID: 218464 User: Abby1356 Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Thu 13 May 2021 12:43 Selected Answer: - Upvotes: 1

should be A

Comment 14

ID: 200400 User: arghya13 Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Thu 15 Apr 2021 11:21 Selected Answer: - Upvotes: 2

Answer should be A..Cloud scheduler..cloud composer is an workflow manager. Can't run unix,bigquery jobs

Comment 15

ID: 182622 User: Tanmoyk Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Sat 20 Mar 2021 06:02 Selected Answer: - Upvotes: 3

D should be the best option

Comment 16

ID: 163565 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Mon 22 Feb 2021 14:41 Selected Answer: - Upvotes: 3

D is correct

Comment 17

ID: 132093 User: lgdantas Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Mon 11 Jan 2021 16:06 Selected Answer: - Upvotes: 6

D!
"Cloud Scheduler is a fully managed enterprise-grade cron job scheduler"
https://cloud.google.com/scheduler