Google Professional Data Engineer Pipelines and Processing

1. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 224

Sequence: 1
Discussion ID: 129871
Source URL: https://www.examtopics.com/discussions/google/view/129871-exam-professional-data-engineer-topic-1-question-224/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:50 a.m.

Question

A web server sends click events to a Pub/Sub topic as messages. The web server includes an eventTimestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their eventTimestamp and publishTime. What is the problem and what should you do?

A. The advertising department is causing delays when consuming the messages. Work with the advertising department to fix this.
B. Messages in your Dataflow job are taking more than 30 seconds to process. Optimize your job or increase the number of workers to fix this.
G. Messages in your Dataflow job are processed in less than 30 seconds, but your job cannot keep up with the backlog in the Pub/Sub subscription. Optimize your job or increase the number of workers to fix this.
D. The web server is not pushing messages fast enough to Pub/Sub. Work with the web server team to fix this.

Community Answer Votes

G: 24 most voted
B: 3

Comments 11 comments Click to expand

Comment 1

ID: 1109552 User: e70ea9e Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:50 Selected Answer: G Upvotes: 11

System Lag vs. Data Freshness: System lag is low (5 seconds), indicating that individual messages are processed quickly. However, data freshness is high (40 seconds), suggesting a backlog in the pipeline.
Not Advertising's Fault: The issue is upstream of their consumption, as they're already receiving delayed messages.
Not Web Server's Fault: The lag between eventTimestamp and publishTime is minimal (1 second), meaning the server is publishing messages promptly.

Comment 2

ID: 1113796 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 16:12 Selected Answer: G Upvotes: 5

- It suggest a backlog problem.
- It indicates that while individual messages might be processed quickly once they're handled, the job overall cannot keep up with the rate of incoming messages, causing a delay in processing the backlog.

Comment 2.1

ID: 1123304 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 12:51 Selected Answer: - Upvotes: 2

Why not B than?

Comment 2.1.1

ID: 1148637 User: RenePetersen Badges: - Relative Date: 2 years ago Absolute Date: Mon 12 Feb 2024 21:51 Selected Answer: - Upvotes: 2

I guess that's because it says in the text that "Your Dataflow job's system lag is about 5 seconds".

Comment 3

ID: 1718385 User: SajadAhm Badges: Most Recent Relative Date: 1 week, 3 days ago Absolute Date: Mon 02 Mar 2026 11:14 Selected Answer: B Upvotes: 1

Low System Lag means no backlog problem. The bottleneck is in the actual processing logic.

Comment 4

ID: 1401985 User: desertlotus1211 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Sat 22 Mar 2025 18:23 Selected Answer: G Upvotes: 1

Answer is C lol

Comment 5

ID: 1280679 User: 4a8ffd7 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Mon 09 Sep 2024 03:57 Selected Answer: B Upvotes: 2

I don't know why you guys got the processing time is less than 30 sec. But I would consider the processing time with 40(freshness) - 5(system lag) = 35 sec. Even minus the publish time of Pub/sub which is less than 1 sec. The processing time still larger than 30 sec. I believe inspecting a few messages show no more than 1 sec lag is about pub/sub processing time. Not inspecting a few messages for dataflow. So I would choose B.

Comment 5.1

ID: 1573093 User: 22c1725 Badges: - Relative Date: 9 months, 2 weeks ago Absolute Date: Wed 28 May 2025 18:50 Selected Answer: - Upvotes: 1

Your Dataflow job's system lag is about 5 seconds,

Comment 6

ID: 1152534 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 13:15 Selected Answer: G Upvotes: 2

Option C

Comment 7

ID: 1121520 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 11:46 Selected Answer: G Upvotes: 4

Option C - low system lag (which identifies fast processing) but high data freshness (which identifies that the messages sit in the backlog a lot)

Comment 8

ID: 1115901 User: Alex3551 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 15:17 Selected Answer: G Upvotes: 1

agree correct is C

2. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 295

Sequence: 3
Discussion ID: 130309
Source URL: https://www.examtopics.com/discussions/google/view/130309-exam-professional-data-engineer-topic-1-question-295/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 11:50 a.m.

Question

You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network. What should you do?

A. Assign the compute.networkUser role to the Dataflow service agent.
B. Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.
C. Assign the dataflow.admin role to the Dataflow service agent.
D. Assign the dataflow.admin role to the service account that executes the Dataflow pipeline.

Community Answer Votes

B: 28 most voted
A: 26

Comments 20 comments Click to expand

Comment 1

ID: 1119982 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 11 Jan 2024 17:55 Selected Answer: A Upvotes: 12

- Dataflow service agent is the one responsible for setting up and managing the network resources that Dataflow requires.
- By granting the compute.networkUser role to this service agent, we are enabling it to provision the necessary network resources within the Shared VPC for your Dataflow job.

Comment 2

ID: 1145314 User: saschak94 Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Fri 09 Feb 2024 08:52 Selected Answer: A Upvotes: 5

All projects that have used the resource Dataflow Job have a Dataflow Service Account, also known as the Dataflow service agent.

Make sure the Shared VPC subnetwork is shared with the Dataflow service account and has the Compute Network User role assigned on the specified subnet.

Comment 3

ID: 1715270 User: pamrona Badges: Most Recent Relative Date: 3 weeks, 2 days ago Absolute Date: Tue 17 Feb 2026 09:10 Selected Answer: B Upvotes: 1

If VMs need to attach to a subnet :
Grant compute.networkUser to the runtime service account, not the service agent

Comment 4

ID: 1701445 User: 50336e5 Badges: - Relative Date: 2 months, 2 weeks ago Absolute Date: Wed 24 Dec 2025 00:21 Selected Answer: A Upvotes: 1

A Dataflow service agent is the one responsible for setting up and managing the network resources that Dataflow requires.

Comment 5

ID: 1700302 User: lmch Badges: - Relative Date: 2 months, 3 weeks ago Absolute Date: Thu 18 Dec 2025 13:55 Selected Answer: A Upvotes: 1

Key Technical Takeaway
When working with Dataflow and Shared VPC, always remember: Service Agent = Network Permissions (Host Project) and Worker Service Account = Data Permissions (Service Project).

Comment 6

ID: 1606036 User: judy_data Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Thu 04 Sep 2025 08:23 Selected Answer: B Upvotes: 2

Make sure the Shared VPC subnetwork is shared with the Dataflow service account and has the Compute Network User role assigned on the specified subnet. The Compute Network User role must be assigned to the Dataflow service account in the host project.
https://cloud.google.com/dataflow/docs/guides/specifying-networks?utm_source=chatgpt.com#shared

Comment 7

ID: 1563106 User: gabbferreira Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Wed 23 Apr 2025 18:10 Selected Answer: A Upvotes: 1

It’s a

Comment 8

ID: 1411719 User: desertlotus1211 Badges: - Relative Date: 11 months, 2 weeks ago Absolute Date: Sat 29 Mar 2025 14:53 Selected Answer: - Upvotes: 2

Answer is B:

The Dataflow service agent manages Dataflow internals, but does not launch pipeline worker VMs. So Answer A is incorrect

Comment 9

ID: 1347889 User: loki82 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Tue 28 Jan 2025 13:24 Selected Answer: A Upvotes: 1

https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#df-service-account

Comment 9.1

ID: 1410606 User: gord_nat Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Wed 26 Mar 2025 23:08 Selected Answer: - Upvotes: 1

There is no compute.networkUser role in df agent, only in IAM. Answer is B

Comment 9.2

ID: 1411718 User: desertlotus1211 Badges: - Relative Date: 11 months, 2 weeks ago Absolute Date: Sat 29 Mar 2025 14:53 Selected Answer: - Upvotes: 1

The Dataflow service agent manages Dataflow internals, but does not launch pipeline worker VMs.

Comment 10

ID: 1305969 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Fri 01 Nov 2024 20:23 Selected Answer: A Upvotes: 2

From https://cloud.google.com/dataflow/docs/guides/specifying-networks, it says "Make sure the Shared VPC subnetwork is shared with the Dataflow service account and has the Compute Network User role assigned on the specified subnet. The Compute Network User role must be assigned to the Dataflow service account in the host project."

Comment 10.1

ID: 1307244 User: ach5 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 05 Nov 2024 09:01 Selected Answer: - Upvotes: 5

service account - it's B

Comment 11

ID: 1289232 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Thu 26 Sep 2024 02:39 Selected Answer: B Upvotes: 4

If you see in the comments, A was answer by people around 8 months ago but recent ones have answered B with the documentation. The GCP documentation evolves with time

Comment 12

ID: 1289231 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Thu 26 Sep 2024 02:37 Selected Answer: B Upvotes: 2

service account that executes the Dataflow pipeline
It's straight forward

Comment 13

ID: 1288812 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 00:52 Selected Answer: B Upvotes: 2

Assign the compute.networkUser role to the service account that executes the Dataflow pipeline

Comment 14

ID: 1254125 User: Jeyaraj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 24 Jul 2024 06:57 Selected Answer: - Upvotes: 3

The correct answer is B. Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.

Here's why:

Shared VPC and Network Access: When using a Shared VPC, you need to grant specific permissions to service accounts in the service project (where your Dataflow pipeline runs) to access resources in the host project's network.
compute.networkUser Role: This role grants the necessary permissions for a service account to use the network resources in the Shared VPC. This includes accessing subnets, creating instances, and communicating with other services within the network.
Service Account for Pipeline Execution: The service account that executes your Dataflow pipeline is the one that needs these network permissions. This is because the Dataflow service uses this account to create and manage worker instances within the Shared VPC network.

Comment 15

ID: 1228103 User: extraego Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 10 Jun 2024 22:53 Selected Answer: B Upvotes: 4

Dataflow service agent is a role that is assigned to a service account. So is compute.networkUser.
https://cloud.google.com/dataflow/docs/concepts/access-control#example

Comment 16

ID: 1214007 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 20 May 2024 00:14 Selected Answer: B Upvotes: 4

Option B https://cloud.google.com/knowledge/kb/dataflow-job-in-shared-vpc-xpn-permissions-000004261

Comment 17

ID: 1205027 User: chrissamharris Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Wed 01 May 2024 13:02 Selected Answer: B Upvotes: 3

I believe the answer is B. All authentication documentation points to Service Accounts. https://cloud.google.com/dataflow/docs/concepts/authentication#on-gcp

Dataflow service agent typically manages general interactions with the Dataflow service but does not execute the actual jobs.

3. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 242

Sequence: 4
Discussion ID: 130186
Source URL: https://www.examtopics.com/discussions/google/view/130186-exam-professional-data-engineer-topic-1-question-242/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 2:15 p.m.

Question

You have designed an Apache Beam processing pipeline that reads from a Pub/Sub topic. The topic has a message retention duration of one day, and writes to a Cloud Storage bucket. You need to select a bucket location and processing strategy to prevent data loss in case of a regional outage with an RPO of 15 minutes. What should you do?

A. 1. Use a dual-region Cloud Storage bucket.
2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs.
3. Seek the subscription back in time by 15 minutes to recover the acknowledged messages.
4. Start the Dataflow job in a secondary region.
B. 1. Use a multi-regional Cloud Storage bucket.
2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs.
3. Seek the subscription back in time by 60 minutes to recover the acknowledged messages.
4. Start the Dataflow job in a secondary region.
C. 1. Use a regional Cloud Storage bucket.
2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs.
3. Seek the subscription back in time by one day to recover the acknowledged messages.
4. Start the Dataflow job in a secondary region and write in a bucket in the same region.
D. 1. Use a dual-region Cloud Storage bucket with turbo replication enabled.
2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs.
3. Seek the subscription back in time by 60 minutes to recover the acknowledged messages.
4. Start the Dataflow job in a secondary region.

Community Answer Votes

D: 21 most voted
A: 4

Comments 13 comments Click to expand

Comment 1

ID: 1124081 User: datapassionate Badges: Highly Voted Relative Date: 1 year, 7 months ago Absolute Date: Tue 16 Jul 2024 10:16 Selected Answer: D Upvotes: 9

D. 1. Use a dual-region Cloud Storage bucket with turbo replication enabled.
2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs.
3. Seek the subscription back in time by 60 minutes to recover the acknowledged messages.
4. Start the Dataflow job in a secondary region.

RPO of 15 minutes is guaranteed when turbo replication is used
https://cloud.google.com/storage/docs/availability-durability

Comment 1.1

ID: 1157471 User: ashdam Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Fri 23 Aug 2024 20:21 Selected Answer: - Upvotes: 1

Why multi-region is not correct. There is no downtime in case a region goes down.

Comment 2

ID: 1154441 User: JyoGCP Badges: Highly Voted Relative Date: 1 year, 6 months ago Absolute Date: Tue 20 Aug 2024 02:41 Selected Answer: D Upvotes: 6

Option D is correct.

Not A, because dual-region bucket WITHOUT turbo replication takes atleast 1 hour to sync data between regions. SLA for 100% data sync is 12 hours as per google.

Comment 3

ID: 1565495 User: 0dd4e0c Badges: Most Recent Relative Date: 10 months, 2 weeks ago Absolute Date: Thu 01 May 2025 20:56 Selected Answer: D Upvotes: 1

it's D, keyword "Turbo replication" for RPO recovery

Comment 4

ID: 1347076 User: LP_PDE Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sun 26 Jan 2025 20:27 Selected Answer: D Upvotes: 1

Could be A or D. The choice between a 15-minute seek and a 60-minute seek depends on your specific requirements and priorities. If a very low RPO is critical, a 60-minute seek might be necessary to ensure data completeness.If minimizing cost and processing time is more important, a 15-minute seek might be sufficient, especially if you're confident in the reliability of Turbo Replication.

Comment 5

ID: 1331924 User: m_a_p_s Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 26 Dec 2024 14:30 Selected Answer: A Upvotes: 1

An RPO of 15 minutes seemingly suggests using Turbo Replication. But here's the thing - why would you want to seek the subscription back in time by 60 minutes and run the Dataflow job? Thus, if turbo replication is enabled, steps 3 & 4 are completely redundant and unnecessary. Which is why option A is correct. This was a tricky one!

Comment 5.1

ID: 1713541 User: SajadAhm Badges: - Relative Date: 4 weeks, 1 day ago Absolute Date: Wed 11 Feb 2026 15:19 Selected Answer: - Upvotes: 1

It might take up to 1 day to replicate data across regions without turbo enabled. so without turbo, we should seek up to 1 day.

Comment 6

ID: 1329614 User: shangning007 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 20 Dec 2024 19:51 Selected Answer: A Upvotes: 1

I don't like answer D. If we have turbo replication can ensure that change within 15min can be replicated, why do we still need to seek the subscription back in time by 60min?

Comment 7

ID: 1210267 User: SVGoogle89 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 12 Nov 2024 18:33 Selected Answer: - Upvotes: 1

D
https://cloud.google.com/storage/docs/availability-durability#cross-region-redundancy

Comment 8

ID: 1131592 User: lipa31 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 25 Jul 2024 11:13 Selected Answer: D Upvotes: 4

https://cloud.google.com/storage/docs/availability-durability#turbo-replication says : "When enabled, turbo replication is designed to replicate 100% of newly written objects to both regions that constitute the dual-region within the recovery point objective of 15 minutes, regardless of object size."
so seems D to me

Comment 9

ID: 1114073 User: raaad Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 21:13 Selected Answer: A Upvotes: 1

- Low RPO: Dual-region buckets offer synchronous replication, ensuring data is immediately available in both regions, aligning with the 15-minute RPO.
- Turbo Replication: enabling turbo replication can further reduce replication latency to near-real-time for even stricter RPO requirements.
- Resilient Data Storage: Dual-region buckets ensure data availability even during regional outages, protecting processed data.
- Fast Recovery: Reprocessing from the last 15 minutes of acknowledged messages minimizes data loss and downtime.

Comment 9.1

ID: 1126379 User: qq589539483084gfrgrgfr Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Fri 19 Jul 2024 03:07 Selected Answer: - Upvotes: 2

why not D then, if turbo replication improves RPO??

Comment 10

ID: 1112778 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 13:15 Selected Answer: A Upvotes: 1

A.
1. Use a dual-region Cloud Storage bucket.
2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs.
3. Seek the subscription back in time by 15 minutes to recover the acknowledged messages.
4. Start the Dataflow job in a secondary region.

4. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 228

Sequence: 12
Discussion ID: 129875
Source URL: https://www.examtopics.com/discussions/google/view/129875-exam-professional-data-engineer-topic-1-question-228/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:54 a.m.

Question

You have a streaming pipeline that ingests data from Pub/Sub in production. You need to update this streaming pipeline with improved business logic. You need to ensure that the updated pipeline reprocesses the previous two days of delivered Pub/Sub messages. What should you do? (Choose two.)

A. Use the Pub/Sub subscription clear-retry-policy flag
B. Use Pub/Sub Snapshot capture two days before the deployment.
C. Create a new Pub/Sub subscription two days before the deployment.
D. Use the Pub/Sub subscription retain-acked-messages flag.
E. Use Pub/Sub Seek with a timestamp.

Community Answer Votes

D: 34 most voted
B: 29
E: 9

Comments 22 comments Click to expand

Comment 1

ID: 1125977 User: tibuenoc Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Thu 18 Jan 2024 15:51 Selected Answer: D Upvotes: 17

DE

Another way to replay messages that have been acknowledged is to seek to a timestamp. To seek to a timestamp, you must first configure the subscription to retain acknowledged messages using retain-acked-messages. If retain-acked-messages is set, Pub/Sub retains acknowledged messages for 7 days.

You only need to do this step if you intend to seek to a timestamp, not to a snapshot.

https://cloud.google.com/pubsub/docs/replay-message

Comment 1.1

ID: 1197910 User: joao_01 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 18 Apr 2024 13:30 Selected Answer: - Upvotes: 2

Its BE.

By the way, you can seek to a snapshot yes:
"Seeks an existing subscription to a point in time or to a given snapshot, whichever is provided in the request"

Link:https://cloud.google.com/pubsub/docs/reference/rest/v1/projects.subscriptions/seek

Comment 1.1.1

ID: 1332313 User: 2ad2bc7 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 27 Dec 2024 09:31 Selected Answer: - Upvotes: 2

Think about it -
what you want to do - process last 2 days of messages.
What does snapshot give you - it give you what were the un-acknowledged messages in pub/sub at that point in time 2 days ago.
How will that help you process messages that were sent to pub/sub in the last 2 days (i.e. after the snapshot?)

Comment 1.2

ID: 1272470 User: nadavw Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Mon 26 Aug 2024 08:17 Selected Answer: - Upvotes: 1

this is correct as there are 2 options (timestamp and snapshot) and foreach there are 2 stages.
Snapshot - create ('B') and seek
Timestamp - configure 'retain' ('D') and seek ('E')
as shown 'B' is missing the 'seek' operation

Comment 2

ID: 1126012 User: GCP001 Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Thu 18 Jan 2024 16:46 Selected Answer: B Upvotes: 12

B and E, already tested at cloud console.

Comment 3

ID: 1703998 User: Kalai_1 Badges: Most Recent Relative Date: 2 months ago Absolute Date: Mon 05 Jan 2026 11:12 Selected Answer: E Upvotes: 1

D & E. Make use of Seek feature of Pub/Sub which is available for pre-defined configuration.

Comment 4

ID: 1581498 User: Ben_oso Badges: - Relative Date: 8 months, 2 weeks ago Absolute Date: Sat 28 Jun 2025 21:28 Selected Answer: D Upvotes: 1

DE.
The questions says "Delivered Messages", so this are a acknowledged messages, this can only do the D. retain-acked-messages flag.
Snapshot only save the unacknowledged messages, so, you miss the acknowledged to make a reprocess.

And Letter E is mandatory to comeback in time.

Comment 5

ID: 1402278 User: desertlotus1211 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Sun 23 Mar 2025 13:45 Selected Answer: B Upvotes: 1

The only way to ensure that the updated pipeline reprocesses the previous two days of delivered Pub/Sub messages (current) is to use B&E.... for future D&E can work.

Comment 6

ID: 1345070 User: 71083a7 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Thu 23 Jan 2025 04:31 Selected Answer: D Upvotes: 1

Since E is seek to a timestamp only one to replay, I think D is correct answer and not B

Comment 7

ID: 1330317 User: AWSandeep Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sun 22 Dec 2024 10:01 Selected Answer: - Upvotes: 1

D and E

Please seem to not understand that the retain_acked_messages parameter needs to be enabled for any snapshot or seeking functionality to work.

Comment 8

ID: 1304671 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 29 Oct 2024 20:31 Selected Answer: B Upvotes: 1

After reading https://cloud.google.com/pubsub/docs/replay-overview#snapshot_overview carefully, B and E are correct.

Comment 8.1

ID: 1304674 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 29 Oct 2024 20:35 Selected Answer: - Upvotes: 2

change my mind. D and E should be correct.

Comment 9

ID: 1288785 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 23:54 Selected Answer: D Upvotes: 1

First, read this document: https://cloud.google.com/pubsub/docs/replay-overview.

Key Points:

Seek to a Snapshot: Reprocesses only unacknowledged messages.
Seek to a Timestamp: Reprocesses all messages (acknowledged and unacknowledged) after that time.
Since the question asks for delivering all messages, option E is correct, as it includes both acknowledged and unacknowledged messages.

Regarding Option D: Configuring a subscription with the retain_acked_messages property allows replaying previously acknowledged messages retained for up to 31 days. This satisfies the requirement to deliver all messages and retains them longer than the mentioned 2 days.

Comment 10

ID: 1266725 User: MithunDesai Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Fri 16 Aug 2024 00:18 Selected Answer: B Upvotes: 1

B&D - based on Vertex AI feedback

Comment 11

ID: 1225575 User: Anudeep58 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Thu 06 Jun 2024 16:45 Selected Answer: E Upvotes: 2

BE
B. Use Pub/Sub Snapshot capture two days before the deployment.

Pub/Sub Snapshot: Creating a snapshot captures the state of the subscription at a specific point in time. You can then seek to this snapshot to replay messages from that point onwards.
By capturing a snapshot two days before the deployment, you can ensure that your pipeline reprocesses messages from the past two days.
E. Use Pub/Sub Seek with a timestamp.

Pub/Sub Seek: This feature allows you to reset the subscription to a specific timestamp. Messages published to the topic after this timestamp are re-delivered.
By seeking to the timestamp from two days ago, you can instruct Pub/Sub to start re-delivering messages from that point in time

Comment 12

ID: 1214299 User: virat_kohli Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 20 May 2024 13:27 Selected Answer: D Upvotes: 2

D. Use the Pub/Sub subscription retain-acked-messages flag.
E. Use Pub/Sub Seek with a timestamp.

Comment 13

ID: 1160053 User: cuadradobertolinisebastiancami Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 21:32 Selected Answer: D Upvotes: 3

E for sure, you need to seek from a timestamp.
To accomplish to that you need to "Set the retain-acked-messages flag to true for the subscription."

From google documentation:

"Note: To seek to a previous time point, your subscription must be configured to retain acknowledged messages. You can change this setting by clicking Edit on the subscription details page, and checking the box for Retain acknowledged messages."

https://cloud.google.com/pubsub/docs/replay-message

Comment 14

ID: 1159340 User: Tryolabs Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 04:44 Selected Answer: E Upvotes: 3

D and E,
https://cloud.google.com/pubsub/docs/replay-message

Comment 15

ID: 1151998 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Fri 16 Feb 2024 13:47 Selected Answer: E Upvotes: 3

B and E: The seek feature extends subscriber capabilities by allowing you to alter the acknowledgement state of messages in bulk. For example, you can replay previously acknowledged messages or purge messages in bulk. In addition, you can copy the acknowledgement state of one subscription to another by using seek in combination with a snapshot. Source: https://cloud.google.com/pubsub/docs/replay-overview

Comment 16

ID: 1123360 User: Sofiia98 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 13:52 Selected Answer: B Upvotes: 3

BE
https://cloud.google.com/pubsub/docs/replay-overview

Comment 16.1

ID: 1125985 User: tibuenoc Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 18 Jan 2024 15:56 Selected Answer: - Upvotes: 1

But There is a problem snapshot you shoudl seek by subscriptions not by timestamp

Comment 17

ID: 1121533 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 12:08 Selected Answer: D Upvotes: 3

Option D and E

5. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 19

Sequence: 15
Discussion ID: 16870
Source URL: https://www.examtopics.com/discussions/google/view/16870-exam-professional-data-engineer-topic-1-question-19/
Posted By: -
Posted At: March 17, 2020, 4:29 p.m.

Question

Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for- like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

A. Put the data into Google Cloud Storage.
B. Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
C. Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
D. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.

Community Answer Votes

A: 9 most voted
D: 1

Comments 19 comments Click to expand

Comment 1

ID: 462672 User: anji007 Badges: Highly Voted Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:30 Selected Answer: - Upvotes: 8

Ans: A
B: Wrong eVM wont solve the problem of larger storage prices.
C: May be, but nothing mentioned in terms of what to tune in the question, also this is like-for-like migration so tuning may not be part of the migration.
D: Again, this is like-for-like so need to define what is hot data and which is cold data, also persistent disk costlier than cloud storage.

Comment 2

ID: 1701388 User: 29d063d Badges: Most Recent Relative Date: 2 months, 2 weeks ago Absolute Date: Tue 23 Dec 2025 18:06 Selected Answer: A Upvotes: 1

Ans A
Best Practice: Use Cloud Storage as the primary storage layer for Dataproc, with minimal Persistent Disk only for temporary processing and shuffle data. This is the cloud-native architecture Google recommends.

Comment 3

ID: 1604921 User: Bugnumber1 Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Sun 31 Aug 2025 20:34 Selected Answer: A Upvotes: 1

First I selected D, it just made sense right? It's faster to access data on the persistent disk. But then there are a couple of things to take into account:
- The question wants you to minimise cost. You only do this 100% on "A", "D" is still costing you a bit more.
- Performance, speed of access... is never mentioned. Common sense? Yes, but no one is asking you about that.
- In general Google Cloud's approach to Big Data architecture is to separate compute and storage, because of all the improvements and resilience GCS has.
- "D" is not very direct, has additional effort, and you could consider that cost.
- People mention the "like for like" section of the question. I'd argue like for like is both having a persistent disk, or having all data in one point. I'd not really consider it a point.

So all in all, A. It's just simpler and cheaper. Also, Gemini said so ;)

Comment 4

ID: 1559569 User: fassil Badges: - Relative Date: 11 months ago Absolute Date: Thu 10 Apr 2025 14:05 Selected Answer: A Upvotes: 1

A like-for-like migration to Cloud Dataproc that replicates on-premises Hadoop would require each node to have 50 TB of persistent disk, which is costly. Instead, you can minimize storage costs by leveraging Google Cloud Storage (GCS). Cloud Dataproc seamlessly integrates with GCS through the Hadoop connector, allowing you to store your data cost-effectively in Cloud Storage and run ephemeral clusters that read data directly from GCS. This approach eliminates the need for each node to carry 50 TB of expensive persistent disk storage while still supporting your Hadoop workload.

Comment 5

ID: 1398857 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 14:59 Selected Answer: D Upvotes: 1

. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.
Google Cloud Storage (GCS) is a cost-effective alternative to Persistent Disk for storing less frequently accessed ("cold") data.
Hot data that requires fast access can remain on Persistent Disk, reducing storage costs while maintaining performance.
Cloud Dataproc supports HDFS-to-GCS integration, allowing Hadoop jobs to access data in GCS seamlessly.D

Comment 6

ID: 1114870 User: Vullibabu Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 06 Jan 2024 02:26 Selected Answer: - Upvotes: 1

You are most of the people looking at like for like migration would require 50TB persistent storage but missing to look at CIO concern about cost of block storage...considering CIO concern the option here is cloud storage... moreover that is recommended as well ..

Comment 7

ID: 1027040 User: imran79 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 07 Oct 2023 05:06 Selected Answer: - Upvotes: 2

Option A: Put the data into Google Cloud Storage.

This is the best option. Google Cloud Dataproc is designed to work well with Google Cloud Storage. Using GCS instead of Persistent Disk can save money, and GCS offers advantages such as higher durability and the ability to share data across multiple clusters.

Comment 8

ID: 1024079 User: emmylou Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Tue 03 Oct 2023 18:39 Selected Answer: - Upvotes: 1

I have seen this question in other places and I believe that you store the older data in Cloud Storage and retain processing data in persistent disk. D

Comment 9

ID: 998830 User: hxy8 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 04 Sep 2023 22:31 Selected Answer: - Upvotes: 1

Answer: D
Question: A like-for- like migration of the cluster would require 50 TB of Google Persistent Disk per node.
which means Persistent is still required.

Comment 9.1

ID: 1008592 User: suku2 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 15 Sep 2023 18:35 Selected Answer: - Upvotes: 1

Google Cloud Storage is designed for 11 9's availability. So it is also kind of persistent storage. Also, it is a Google product, hence recommended.
https://cloud.google.com/storage/docs/availability-durability#key-concepts

Comment 10

ID: 982660 User: GHOST1985 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 16 Aug 2023 16:14 Selected Answer: - Upvotes: 1

the question is talking about block storage , GCS is object storage !

Comment 11

ID: 970828 User: hjava_111 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 03 Aug 2023 08:57 Selected Answer: A Upvotes: 1

GCS is cost-effective and also Google's recommendation!

Comment 12

ID: 835668 User: bha11111 Badges: - Relative Date: 3 years ago Absolute Date: Sat 11 Mar 2023 05:53 Selected Answer: A Upvotes: 1

Minimize cost then GCS

Comment 13

ID: 771646 User: Nirca Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 10 Jan 2023 18:37 Selected Answer: A Upvotes: 1

A - is the right answer.

Comment 14

ID: 743395 User: DGames Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 12 Dec 2022 23:57 Selected Answer: A Upvotes: 1

A - dataproc - storage - cost effective is cloud storage

Comment 15

ID: 688843 User: devaid Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Fri 07 Oct 2022 19:24 Selected Answer: A Upvotes: 1

Cloud Storage

Comment 16

ID: 609287 User: sankar_s Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Mon 30 May 2022 17:45 Selected Answer: A Upvotes: 1

Cloud Storage is google recommended one

Comment 17

ID: 390715 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Fri 25 Jun 2021 20:43 Selected Answer: - Upvotes: 2

Vote for 'A"

Comment 17.1

ID: 401776 User: sumanshu Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:31 Selected Answer: - Upvotes: 6

A is correct because Google recommends using Cloud Storage instead of HDFS as it is much more cost effective especially when jobs aren’t running.
B is not correct because this will decrease the compute cost but not the storage cost.
C is not correct because while this will reduce cost somewhat, it will not be as cost effective as using Cloud Storage.
D is not correct because while this will reduce cost somewhat, it will not be as cost effective as using Cloud Storage.

6. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 20

Sequence: 16
Discussion ID: 16282
Source URL: https://www.examtopics.com/discussions/google/view/16282-exam-professional-data-engineer-topic-1-question-20/
Posted By: jvg637
Posted At: March 11, 2020, 6:37 p.m.

Question

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom
HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

A. The message body for the sensor event is too large.
B. Your custom endpoint has an out-of-date SSL certificate.
C. The Cloud Pub/Sub topic has too many messages published to it.
D. Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Community Answer Votes

D: 39 most voted
B: 6

Comments 17 comments Click to expand

Comment 1

ID: 62584 User: jvg637 Badges: Highly Voted Relative Date: 6 years ago Absolute Date: Wed 11 Mar 2020 18:37 Selected Answer: - Upvotes: 91

The Answer should be D. The custom endpoint is not acknowledging the message, that is the reason for Pub/Sub to send the message again and again. Not B.

Comment 2

ID: 71346 User: MauryaSushil Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sun 05 Apr 2020 09:44 Selected Answer: - Upvotes: 11

D : Doubt should be only between B & D. But B is not possible because if SSL is expired then endpoint URL will not receive any messages forget about duplicates. So It should be D for Duplicates.

Comment 3

ID: 1701389 User: 29d063d Badges: Most Recent Relative Date: 2 months, 2 weeks ago Absolute Date: Tue 23 Dec 2025 18:09 Selected Answer: D Upvotes: 1

D. Your custom endpoint is not acknowledging messages within the acknowledgement deadline
Why This Causes Duplicate Messages
How Pub/Sub Delivery Works:

Pub/Sub delivers a message to your HTTPS endpoint
Your endpoint must return an HTTP success status (200, 201, 202, or 204) within the acknowledgement deadline (default is 10 seconds)
If Pub/Sub doesn't receive acknowledgement in time, it assumes delivery failed
Pub/Sub redelivers the message, causing duplicates

Comment 4

ID: 161802 User: atnafu2020 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:32 Selected Answer: - Upvotes: 6

D
Why are there too many duplicate messages?

Pub/Sub guarantees at-least-once message delivery, which means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Pub/Sub is retrying the message delivery. This can be observed in the monitoring metrics pubsub.googleapis.com/subscription/pull_ack_message_operation_count for pull subscriptions, and pubsub.googleapis.com/subscription/push_request_count for push subscriptions. Look for elevated expired or webhook_timeout values in the /response_code. This is particularly likely if there are many small messages, since Pub/Sub may batch messages internally and a partially acknowledged batch will be fully redelivered.

Comment 5

ID: 166842 User: ganesh2121 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:32 Selected Answer: - Upvotes: 3

D is correct
As per google docs- When you do not acknowledge a message before its acknowledgement deadline has expired, Pub/Sub resends the message. As a result, Pub/Sub can send duplicate messages. Use Google Cloud's operations suite to monitor acknowledge operations with the expired response code to detect this condition

Comment 6

ID: 218380 User: Radhika7983 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:32 Selected Answer: - Upvotes: 6

The correct answer is D. Look for the link
https://cloud.google.com/pubsub/docs/faq

Why are there too many duplicate messages?
Pub/Sub guarantees at-least-once message delivery, which means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Pub/Sub is retrying the message delivery. This can be observed in the monitoring metrics pubsub.googleapis.com/subscription/pull_ack_message_operation_count for pull subscriptions, and pubsub.googleapis.com/subscription/push_request_count for push subscriptions. Look for elevated expired or webhook_timeout values in the /response_code. This is particularly likely if there are many small messages, since Pub/Sub may batch messages internally and a partially acknowledged batch will be fully redelivered.

Another possibility is that the subscriber is not acknowledging some messages because the code path processing those specific messages fails, and the Acknowledge call is never made; or the push endpoint never responds or responds with an error.

Comment 7

ID: 475028 User: MaxNRG Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:32 Selected Answer: - Upvotes: 2

D as the Cloud Pub/Sub will deliver duplicate messages only if there has been no acknowledgement from the subscriber.
Refer GCP documentation - Cloud Pub/Sub FAQs - Duplicates:
https://cloud.google.com/pubsub/docs/faq#duplicates
Why are there too many duplicate messages?
Cloud Pub/Sub guarantees at-least-once message delivery, which means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Cloud Pub/Sub is retrying the message delivery.

Comment 8

ID: 665256 User: crismo04 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:31 Selected Answer: D Upvotes: 1

If the answer had been B the messages had not reached the HTTP page since the push subscription requires an SSL certificate (https://cloud.google.com/pubsub/docs/push#:~:text=Endpoint%20URL%20(required)), so the answer must be D.

I think it is due to a bad response due to overcharge in the HTTPS page, which is sending a non correct status code (https://cloud.google.com/pubsub/docs/push#:~:text=VPC%20Service%20Controls.-,Receive%20messages,-When%20Pub/Sub)

Comment 9

ID: 1050521 User: rtcpost Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:31 Selected Answer: D Upvotes: 2

In Google Cloud Pub/Sub, when you use a push subscription, messages are delivered to the specified endpoint (in this case, your custom HTTPS endpoint). The acknowledgment deadline is the time given to the endpoint to acknowledge that it has received and processed the message. If the acknowledgment is not received within the deadline, Pub/Sub may consider the message as unacknowledged and may attempt redelivery, which can lead to duplicate messages.

You should ensure that your custom HTTPS endpoint acknowledges messages within the acknowledgment deadline to prevent duplicate messages from being sent. Additionally, it's essential to handle messages in an idempotent way, so even if duplicates do occur, the action taken by your endpoint doesn't have unintended consequences.

Comment 10

ID: 1238391 User: petergjohnson Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:31 Selected Answer: B Upvotes: 1

After re-reading the question it seem to me that it is asking for a root cause. It is possible that the most common cause of this symptom is and expired certificate. Once expired duplicates would be received for every message.

Comment 11

ID: 1217189 User: VictorBa Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 04:39 Selected Answer: D Upvotes: 1

Agree with previous explanations regarding validity of D

Comment 12

ID: 1163342 User: searching4alicense Badges: - Relative Date: 2 years ago Absolute Date: Fri 01 Mar 2024 10:05 Selected Answer: - Upvotes: 1

D - If a message has not been acknowledged within its acknowledgement deadline, Dataflow attempts to maintain the lease on the message by repeatedly extending the acknowledgement deadline to prevent redelivery from Pub/Sub. However this is best effort and there is a possibility that messages may be redelivered. This can be monitored using metrics listed here. https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow

Comment 13

ID: 1131703 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 25 Jan 2024 14:39 Selected Answer: - Upvotes: 1

D should be the answer. If acknowlegement is not received back to pub/sub , pub/sub may resend meassages.

Comment 14

ID: 1027041 User: imran79 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 07 Oct 2023 05:08 Selected Answer: - Upvotes: 2

The correct answer is:
D. Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Comment 15

ID: 1024080 User: emmylou Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Tue 03 Oct 2023 18:41 Selected Answer: - Upvotes: 1

if there were an out of date certificate then nothing would get through. D

Comment 16

ID: 965380 User: FP77 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Fri 28 Jul 2023 10:19 Selected Answer: D Upvotes: 3

It should be D
https://cloud.google.com/pubsub/docs/troubleshooting#dupes

Comment 17

ID: 919240 User: dgteixeira Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 09 Jun 2023 13:36 Selected Answer: D Upvotes: 3

The correct answer is D, because it's how Pub/Sub works.
Documentation here: https://cloud.google.com/pubsub/docs/troubleshooting#dupes

7. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 282

Sequence: 17
Discussion ID: 130157
Source URL: https://www.examtopics.com/discussions/google/view/130157-exam-professional-data-engineer-topic-1-question-282/
Posted By: Ed_Kim
Posted At: Jan. 3, 2024, 3:06 a.m.

Question

You are using a Dataflow streaming job to read messages from a message bus that does not support exactly-once delivery. Your job then applies some transformations, and loads the result into BigQuery. You want to ensure that your data is being streamed into BigQuery with exactly-once delivery semantics. You expect your ingestion throughput into BigQuery to be about 1.5 GB per second. What should you do?

A. Use the BigQuery Storage Write API and ensure that your target BigQuery table is regional.
B. Use the BigQuery Storage Write API and ensure that your target BigQuery table is multiregional.
C. Use the BigQuery Streaming API and ensure that your target BigQuery table is regional.
D. Use the BigQuery Streaming API and ensure that your target BigQuery table is multiregional.

Community Answer Votes

B: 35 most voted
A: 32
D: 1

Comments 21 comments Click to expand

Comment 1

ID: 1224767 User: AlizCert Badges: Highly Voted Relative Date: 1 year, 9 months ago Absolute Date: Wed 05 Jun 2024 16:12 Selected Answer: B Upvotes: 18

It should B, Storage Write API has "3 GB per second throughput in multi-regions; 300 MB per second in regions"

Comment 1.1

ID: 1560320 User: rajshiv Badges: - Relative Date: 11 months ago Absolute Date: Sun 13 Apr 2025 15:33 Selected Answer: - Upvotes: 1

B is incorrect. Multiregional tables are not supported by the Storage Write API for exactly-once delivery. This option is invalid.

Comment 2

ID: 1117901 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 23:56 Selected Answer: A Upvotes: 16

- BigQuery Storage Write API: This API is designed for high-throughput, low-latency writing of data into BigQuery. It also provides tools to prevent data duplication, which is essential for exactly-once delivery semantics.
- Regional Table: Choosing a regional location for the BigQuery table could potentially provide better performance and lower latency, as it would be closer to the Dataflow job if they are in the same region.

Comment 2.1

ID: 1131348 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 25 Jan 2024 07:04 Selected Answer: - Upvotes: 4

Agree.
https://cloud.google.com/bigquery/docs/write-api#advantages

Comment 2.2

ID: 1571968 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Sat 24 May 2025 19:08 Selected Answer: - Upvotes: 1

Max throughput for regional currently only 300 MB/s as for the dcos

Comment 3

ID: 1701494 User: Kalai_1 Badges: Most Recent Relative Date: 2 months, 2 weeks ago Absolute Date: Wed 24 Dec 2025 08:45 Selected Answer: A Upvotes: 1

BigQuery is Regional component and not Multiregional. Hence Option:A

Comment 4

ID: 1701320 User: 50336e5 Badges: - Relative Date: 2 months, 2 weeks ago Absolute Date: Tue 23 Dec 2025 14:38 Selected Answer: A Upvotes: 1

API BigQuery Storage Write with region because lower latency.

Comment 5

ID: 1700298 User: lmch Badges: - Relative Date: 2 months, 3 weeks ago Absolute Date: Thu 18 Dec 2025 13:22 Selected Answer: B Upvotes: 1

To achieve high-scale, exactly-once ingestion in BigQuery:

API: Always use the Storage Write API.

Mode: Use Committed Streams (for exactly-once) or the Default Stream (for at-least-once).

Scalability: Monitor the "Bytes per second" quota. If you are operating at "Gbps" levels, multiregional locations are the architectural standard due to higher throughput limits.

Comment 6

ID: 1588190 User: imrane1995 Badges: - Relative Date: 7 months, 3 weeks ago Absolute Date: Fri 18 Jul 2025 08:12 Selected Answer: A Upvotes: 1

Why A is correct:
BigQuery Storage Write API:

Supports exactly-once semantics (deduplication via stream offsets).

Designed for high-throughput streaming ingestion (scales much better than the legacy Streaming Insert API).

Can handle your throughput requirement of 1.5 GB/s.

Regional table:

Required for exactly-once delivery guarantees with the Storage Write API.

Comment 7

ID: 1581309 User: Ben_oso Badges: - Relative Date: 8 months, 2 weeks ago Absolute Date: Sat 28 Jun 2025 01:35 Selected Answer: B Upvotes: 1

3 GB per second throughput in multi-regions;
300 MB per second in regions

Comment 8

ID: 1574216 User: 22c1725 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Mon 02 Jun 2025 15:42 Selected Answer: B Upvotes: 1

Go With (B) Not (A),
Max throughput for regional currently only 300 MB/s as for the dcos

Comment 9

ID: 1571638 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Fri 23 May 2025 18:05 Selected Answer: A Upvotes: 1

Honestly, I think exam topic should do better job. the answers only make more mislead.

Comment 10

ID: 1566261 User: aditya_ali Badges: - Relative Date: 10 months, 1 week ago Absolute Date: Sun 04 May 2025 22:33 Selected Answer: A Upvotes: 1

You need a write latency of 1.5 GBs per second. Given the high throughput requirement, a regional BigQuery table (Option A) is generally preferred over a multi-regional table due to potentially lower write latency in multi-region. Simple.

Comment 11

ID: 1565261 User: Aungshuman Badges: - Relative Date: 10 months, 2 weeks ago Absolute Date: Thu 01 May 2025 01:57 Selected Answer: B Upvotes: 1

As per GCP document multi-region meets the troughput requirement.

Comment 12

ID: 1563063 User: gabbferreira Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Wed 23 Apr 2025 16:56 Selected Answer: A Upvotes: 1

It’s A

Comment 13

ID: 1352537 User: Siahara Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Thu 06 Feb 2025 18:21 Selected Answer: A Upvotes: 4

A. Implement the BigQuery Storage Write API and guarantee that the target BigQuery table is regional.

Here's the breakdown:

Why Option A is Superior

Exactly-Once Delivery: The BigQuery Storage Write API intrinsically supports exactly-once delivery using stream offsets. This guarantees that each message is written to BigQuery exactly one time, even in the case of retries due to the lack of native exactly-once support in your message bus.

High Throughput: The Storage Write API is optimized for high-throughput scenarios. It can handle the expected ingestion throughput of 1.5 GB per second.

Regional Tables: Using a regional BigQuery table aligns with best practices when utilizing the Storage Write API, as it helps to minimize latency and reduce potential cross-region communication costs.

Comment 13.1

ID: 1410936 User: gord_nat Badges: - Relative Date: 11 months, 2 weeks ago Absolute Date: Thu 27 Mar 2025 15:50 Selected Answer: - Upvotes: 1

Has to be multi-regional (B)
Max throughput for regional currently only 300 MB/s
https://cloud.google.com/bigquery/quotas

Comment 14

ID: 1349571 User: juliorevk Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Fri 31 Jan 2025 16:11 Selected Answer: B Upvotes: 1

- BigQuery Storage Write API: This API is designed for high-throughput, low-latency writing of data into BigQuery. It also provides tools to prevent data duplication, which is essential for exactly-once delivery semantics.
- The multiregional table ensures that your data is highly available and can be streamed into BigQuery across multiple regions. It is better suited for high-throughput and low-latency workloads, as it provides distributed write capabilities that can handle large data volumes, such as the 1.5 GB per second you expect to stream.

Comment 15

ID: 1332385 User: hussain.sain Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 27 Dec 2024 13:11 Selected Answer: B Upvotes: 1

B is correct.
When aiming for exactly-once delivery in a Dataflow streaming job, the key is to use the BigQuery Storage Write API, as it provides the capability to handle large-scale data ingestion with the correct semantics, including exactly-once delivery.

Comment 16

ID: 1326591 User: himadri1983 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sat 14 Dec 2024 19:47 Selected Answer: B Upvotes: 3

3 GB per second throughput in multi-regions; 300 MB per second in regions
https://cloud.google.com/bigquery/quotas#write-api-limits

Comment 17

ID: 1325823 User: m_a_p_s Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 12 Dec 2024 20:26 Selected Answer: B Upvotes: 2

streamed into BigQuery with exactly-once delivery semantics >>> Storage Write API

ingestion throughput into BigQuery to be about 1.5 GB per second >>> multiregional (check throughput rate here >>> https://cloud.google.com/bigquery/quotas#write-api-limits)

8. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 15

Sequence: 18
Discussion ID: 16723
Source URL: https://www.examtopics.com/discussions/google/view/16723-exam-professional-data-engineer-topic-1-question-15/
Posted By: -
Posted At: March 16, 2020, 9:37 a.m.

Question

You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?

A. Re-write the application to load accumulated data every 2 minutes.
B. Convert the streaming insert code to batch load for individual messages.
C. Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.
D. Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.

Community Answer Votes

D: 35 most voted
B: 14
A: 12

Comments 19 comments Click to expand

Comment 1

ID: 616873 User: noob_master Badges: Highly Voted Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:26 Selected Answer: D Upvotes: 11

Answer: D. The only that describe a way to resolve the problem, with buffering the data.

(the question is possible old, the best approach would be Pub/Sub + Dataflow Streaming + Bigquery for streaming data instead near-real time)

Comment 2

ID: 474426 User: MaxNRG Badges: Highly Voted Relative Date: 4 years, 4 months ago Absolute Date: Mon 08 Nov 2021 19:09 Selected Answer: - Upvotes: 7

B. Streams data into BigQuery one record at a time without needing to run a load job: https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata.insertAll method. This approach enables querying data without the delay of running a load job:
https://cloud.google.com/bigquery/streaming-data-into-bigquery
The BigQuery Storage Write API is a unified data-ingestion API for BigQuery. It combines the functionality of streaming ingestion and batch loading into a single high-performance API. You can use the Storage Write API to stream records into BigQuery that become available for query as they are written, or to batch process an arbitrarily large number of records and commit them in a single atomic operation.
Committed mode. Records are available for reading immediately as you write them to the stream. Use this mode for streaming workloads that need minimal read latency.
https://cloud.google.com/bigquery/docs/write-api

Comment 2.1

ID: 481244 User: Abhi16820 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Fri 19 Nov 2021 02:08 Selected Answer: - Upvotes: 1

IN THIS ALSO BIGQUERY HAS A BUFFER WHICH IT TAKES SLOWLY ANS INSERTS INTO REAL THING, WHAT YOU SAID IS HELPFULL IN REMOVING THE APPLICATION PART

Comment 2.1.1

ID: 502500 User: MarcoDipa Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Wed 15 Dec 2021 23:04 Selected Answer: - Upvotes: 1

could you please argue?

Comment 3

ID: 1701385 User: 29d063d Badges: Most Recent Relative Date: 2 months, 2 weeks ago Absolute Date: Tue 23 Dec 2025 17:55 Selected Answer: A Upvotes: 1

Option A Works:

Batch loading (even frequent batches like every 2 minutes) provides strong consistency
Once a batch load completes, the data is immediately and fully available for queries

Comment 4

ID: 1611673 User: nats828 Badges: - Relative Date: 5 months, 2 weeks ago Absolute Date: Tue 23 Sep 2025 13:55 Selected Answer: A Upvotes: 2

A.
Loading data in batches (using batch loads instead of streaming inserts) ensures strong consistency in BigQuery.
Batch loads are available for querying immediately after the load job completes.
This approach is recommended when you need consistent query results right after data ingestion.

Answer D (which is most voted) is a workaround, not a solution

Comment 5

ID: 1577964 User: Annie00000 Badges: - Relative Date: 8 months, 4 weeks ago Absolute Date: Mon 16 Jun 2025 14:16 Selected Answer: A Upvotes: 1

Why this is best:
Instead of inserting each message individually using streaming (which has consistency delay), you batch the data and load it using BATCH LOADS, which are strongly consistent once complete.
This reduces the chance of missing in-flight data in queries and avoids unnecessary complexity.
Loading in intervals (e.g. every 2 minutes) also helps control costs and improves query reliability.

Comment 6

ID: 1576230 User: ajit_j Badges: - Relative Date: 9 months ago Absolute Date: Tue 10 Jun 2025 11:15 Selected Answer: B Upvotes: 1

not D because we want data in report in near real-time.

Comment 7

ID: 1331528 User: Rav761 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Wed 25 Dec 2024 12:34 Selected Answer: A Upvotes: 4

To address the issue of strong consistency and ensure your reports do not miss in-flight data after streaming inserts, you should re-write the application to load accumulated data every 2 minutes (option A).

Here's why:

By accumulating and loading data in 2-minute intervals, you can balance between real-time data processing and ensuring data consistency.

This approach allows you to process the data in manageable batches, reducing the likelihood of inconsistencies that might occur with individual streaming inserts.

It maintains a near real-time analysis capability while allowing enough time for all in-flight data to be captured and accurately represented in your reports.

This adjustment should help improve the reliability of your data analysis and reporting.

Comment 8

ID: 1319572 User: imrane1995 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Fri 29 Nov 2024 07:38 Selected Answer: A Upvotes: 2

Accumulating data and loading it periodically (e.g., every 2 minutes) via batch inserts ensures strong consistency for queries. Batch loads in BigQuery allow you to avoid the latency issues inherent to streaming inserts and guarantee data availability for queries.

Comment 9

ID: 1301209 User: GHill1982 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 20:01 Selected Answer: A Upvotes: 1

For maintaining data consistency while handling high throughput streaming inserts and subsequent aggregations in Google BigQuery, the best approach is to re-write the application to load accumulated data every 2 minutes.

Comment 10

ID: 425837 User: fire558787 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:26 Selected Answer: - Upvotes: 5

"D" seems to use the typical approximate terminology of a wrong answer. "estimate the time" (how do you do that? do you do that over different times of the day?) and "wait twice as long" (who tells you that there are not a lot of cases when lag is twice as long?). Instead, "A" seems good. You don't need to show the exact results, but an approximation thereof, but you still want consistency. So an aggregation of the data every 2 minutes is a viable thing.

Comment 11

ID: 687001 User: Parth_P Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:26 Selected Answer: D Upvotes: 2

D is correct. The problem requirement is doing analytics on real-time data. You cannot do batch processing because the business requires it to be real-time even if it makes your job simpler, so B is incorrect. Other options are not streaming.

Comment 12

ID: 746252 User: jkhong Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:25 Selected Answer: D Upvotes: 2

There are assumptions over the quality of data acceptable. If slight variations of the analytics against actual can be accepted, then D would be a good choice.

Many people chose B, but this also requires some form of waiting for the late data to arrive.

I think a combination of D and B can be applied, but for an intial fix, delaying the aggregation queries with D seems to make more sense. If the variance is small and the some late data leakage is acceptable, and we can remain as D.

If problems arise, we can always proceed to attempt B

Comment 13

ID: 768256 User: korntewin Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:25 Selected Answer: D Upvotes: 2

The streaming mode may be in pending mode or buffered mode where the streaming data is not immediately available before committing or flushing. Thus, we need to wait before the data will be available. Or else we need to switch to commited mode (which is not present in the choices).

Comment 14

ID: 819106 User: musumusu Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:24 Selected Answer: - Upvotes: 5

Answer: D
What to learn or look for
1. In-Flight data = (Real Time data, i.e still in streaming pipeline and not landed in BigQuery)
2. Dataflow (assume in best case) streaming pipeline is running to send data to Bigquery.
Why not option B: change streaming to batch upload is not business requirement, we have to stuck to streaming and real time analysis.

Option D: make bigquery run after waiting for sometime (twice here), How will you do it?
- there is not setting in BQ to do it, right!. So, adjust it in your pipeline (dataflow)
- For example, add Fixed window, and you want to execute aggregation query after 2 min.
Code
```pipeline.apply(...)
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardMinutes(2))))
.apply(BigQueryIO.writeTableRows()
.to("my_dataset.my_table")
```

Comment 15

ID: 1131643 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 25 Jan 2024 13:19 Selected Answer: - Upvotes: 1

Answer: D
I agree with the first part of the D answer, but for the second part, I don't know how they came about the 2 mins, is it from a calculation?

Comment 16

ID: 1027034 User: imran79 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 07 Oct 2023 04:54 Selected Answer: - Upvotes: 3

A. Re-write the application to load accumulated data every 2 minutes.

By accumulating data and performing a batch load every 2 minutes, you can reduce the potential inconsistency caused by streaming inserts. While this introduces a slight delay, it provides a more consistent approach than streaming each individual message. This method can still meet the near real-time requirement, and the slight delay is often acceptable in scenarios where data consistency is paramount.

Comment 17

ID: 1018794 User: Nirca Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Wed 27 Sep 2023 14:35 Selected Answer: B Upvotes: 1

BBBBB is the only option

9. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 78

Sequence: 23
Discussion ID: 17115
Source URL: https://www.examtopics.com/discussions/google/view/17115-exam-professional-data-engineer-topic-1-question-78/
Posted By: -
Posted At: March 21, 2020, 6:11 p.m.

Question

You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?

A. PigLatin using Pig
B. HiveQL using Hive
C. Java using MapReduce
D. Python using MapReduce

Community Answer Votes

A: 10 most voted
C: 6

Comments 21 comments Click to expand

Comment 1

ID: 180366 User: IsaB Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Wed 16 Sep 2020 15:34 Selected Answer: - Upvotes: 10

Is this really a question that could appear in Google Cloud Professional Data Engineer Exam? What does it have to do with Google Cloud? I would use DataProc no?

Comment 1.1

ID: 205242 User: Pupina Badges: - Relative Date: 5 years, 4 months ago Absolute Date: Sat 24 Oct 2020 19:07 Selected Answer: - Upvotes: 1

Did you take the exam? I am ready to do it this month

Comment 1.2

ID: 507639 User: MaxNRG Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 23 Dec 2021 07:46 Selected Answer: - Upvotes: 2

seems like a very old question :)
not sure it's actual

Comment 2

ID: 128213 User: dg63 Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Mon 06 Jul 2020 21:47 Selected Answer: - Upvotes: 8

A, C and D - all are valid answers. You can do checkpointing, splitting pipelines and merging pipelines with all three options.

Comment 3

ID: 1628282 User: MD_84 Badges: Most Recent Relative Date: 3 months, 2 weeks ago Absolute Date: Tue 25 Nov 2025 11:51 Selected Answer: C Upvotes: 1

The most appropriate method to write complex ETL pipelines requiring explicit checkpointing and splitting pipelines directly on an Apache Hadoop cluster is Mapreduce. Java or Python both are supported but Java is primary language and so Option C

Comment 4

ID: 1564071 User: oliiivier Badges: - Relative Date: 10 months, 2 weeks ago Absolute Date: Sun 27 Apr 2025 07:02 Selected Answer: C Upvotes: 1

Réponse : C
Explication :
La bonne réponse ici est C. Java using MapReduce.
Explication rapide :
Tu parles de besoin de "checkpointing" et "splitting pipelines".
PigLatin (Pig) et HiveQL (Hive) sont des langages déclaratifs (de plus haut niveau), pas faits pour un contrôle précis sur la manière dont les jobs se séquencent, checkpointent ou s'articulent.
MapReduce en Java te donne un contrôle total sur :
L'ordonnancement précis.
Le checkpointing manuel.
Le découpage et le chaînage des étapes.
Python avec MapReduce est possible mais beaucoup moins natif ; Hadoop est conçu principalement pour Java MapReduce, donc en Python ce serait plus compliqué, plus fragile et moins performant.
Résumé rapide pour que tu t'en souviennes :👉 Besoin de contrôle précis sur la logique ETL complexe = MapReduce en Java.

Comment 5

ID: 1398902 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 16:06 Selected Answer: C Upvotes: 1

MapReduce in Java (C) allows more control for checkpointing and splitting.

Comment 6

ID: 1398844 User: desertlotus1211 Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 14:24 Selected Answer: C Upvotes: 1

Writing your pipeline in Java using MapReduce allows you to implement these custom controls and fine-tune the execution, ensuring robust and manageable ETL processes on your Hadoop cluster

Comment 7

ID: 1342508 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 13:44 Selected Answer: A Upvotes: 1

Pig Latin supports both splitting pipelines and checkpointing, allowing users to create complex data processing workflows with the ability to restart from specific points in the pipeline if necessary.

Comment 8

ID: 1302103 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 23 Oct 2024 17:51 Selected Answer: A Upvotes: 3

I would go to A.
C, D are similar. So both are excluded. B, Hive is actually a data warehouse system. I don't use Apache Pig. But, BCD are wrong. Then A should be correct.

Comment 9

ID: 986292 User: AnonymousPanda Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 09:38 Selected Answer: A Upvotes: 1

A as others have said

Comment 10

ID: 880124 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 10:16 Selected Answer: C Upvotes: 2

Comment content is too short

Comment 11

ID: 848636 User: juliobs Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Thu 23 Mar 2023 21:20 Selected Answer: A Upvotes: 2

PigLatin is the correct answer, however... the last release was 6 years ago and has lots of bugs.

Comment 12

ID: 809336 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Wed 15 Feb 2023 11:02 Selected Answer: - Upvotes: 7

This answer depends which language you are comfortable with.
Hadoop is your framework, where mapReduce is your Native programming model in JAVA, which is designed to scale, parallel processing, restart pipeline from any checkpoint etc. , So if you are comfortable with JAVA, you can customize your checkpoint at lowlevel in better way. otherwise, choose PIG which is another programming concept run over JAVA but then you need to learn this also, if not choose python as it can be deployed with hadoop because hadoop has been making updates for python clients regularly.
Option C: is the best one.

Comment 13

ID: 788942 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 26 Jan 2023 17:53 Selected Answer: - Upvotes: 4

C. Java using MapReduce or D. Python using MapReduce

Apache Hadoop is a distributed computing framework that allows you to process large datasets using the MapReduce programming model. There are several options for writing ETL pipelines to run on a Hadoop cluster, but the most common are using Java or Python with the MapReduce programming model.

Comment 13.1

ID: 788943 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 26 Jan 2023 17:53 Selected Answer: - Upvotes: 3

A. PigLatin using Pig is a high-level data flow language that is used to create ETL pipelines. Pig is built on top of Hadoop, and it allows you to write scripts in PigLatin, a SQL-like language that is used to process data in Hadoop. Pig is a simpler option than MapReduce but it lacks some capabilities like the control over low-level data manipulation operations.

B. HiveQL using Hive is a SQL-like language for querying and managing large datasets stored in Hadoop's distributed file system. Hive is built on top of Hadoop and it provides an SQL-like interface for querying data stored in Hadoop. Hive is more suitable for querying and managing large datasets stored in Hadoop than for ETL pipelines.

Both Java and Python using MapReduce provide low-level control over data manipulation operations, and they allow you to write custom mapper and reducer functions that can be used to process data in a Hadoop cluster. The choice between Java and Python will depend on the development team's expertise and preference.

Comment 13.1.1

ID: 905768 User: cetanx Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 13:00 Selected Answer: - Upvotes: 2

It has to be C
because while Pig can be used to simplify the writing of complex data transformation tasks and can store intermediate results, it doesn't provide the detailed control over checkpointing and pipeline splitting in the way that is typically implied by those terms.

also, while one can write MapReduce jobs in languages other than Java (like Python) using Hadoop Streaming or other similar APIs, it may not be as efficient or as seamless as using Java due to the JVM-native nature of Hadoop.

Comment 14

ID: 664182 User: Koushik25sep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 09 Sep 2022 05:24 Selected Answer: A Upvotes: 1

Description: Pig is scripting language which can be used for checkpointing and splitting pipelines

Comment 15

ID: 548470 User: BigDataBB Badges: - Relative Date: 4 years ago Absolute Date: Wed 16 Feb 2022 10:54 Selected Answer: - Upvotes: 1

Why not D?

Comment 16

ID: 530786 User: rbeeraka Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 21:13 Selected Answer: A Upvotes: 1

PigLatin supports checkpoints

Comment 17

ID: 530777 User: davidqianwen Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 20:49 Selected Answer: A Upvotes: 1

Answer: A

10. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 112

Sequence: 24
Discussion ID: 17249
Source URL: https://www.examtopics.com/discussions/google/view/17249-exam-professional-data-engineer-topic-1-question-112/
Posted By: -
Posted At: March 22, 2020, 1:58 p.m.

Question

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

A. Deploy small Kafka clusters in your data centers to buffer events.
B. Have the data acquisition devices publish data to Cloud Pub/Sub.
C. Establish a Cloud Interconnect between all remote data centers and Google.
D. Write a Cloud Dataflow pipeline that aggregates all data in session windows.

Community Answer Votes

B: 17 most voted
A: 4
C: 3
D: 2

Comments 27 comments Click to expand

Comment 1

ID: 73991 User: Ganshank Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Mon 13 Apr 2020 10:20 Selected Answer: - Upvotes: 22

C.
This is a tricky one. The issue here is the unreliable connection between data collection and data processing infrastructure, and to resolve it in a cost-effective manner. However, it also mentions that the company is using leased lines. I think replacing the leased lines with Cloud InterConnect would solve the problem, and hopefully not be an added expense.
https://cloud.google.com/interconnect/docs/concepts/overview

Comment 1.1

ID: 101809 User: serg3d Badges: - Relative Date: 5 years, 9 months ago Absolute Date: Wed 03 Jun 2020 19:55 Selected Answer: - Upvotes: 7

Yea, this would definitely solve the issue, but it's not "the most cost-effective way". I think PubSub is the correct answer.

Comment 1.2

ID: 222886 User: snamburi3 Badges: - Relative Date: 5 years, 3 months ago Absolute Date: Thu 19 Nov 2020 16:22 Selected Answer: - Upvotes: 3

the question also talks about a cost effective way...

Comment 1.3

ID: 114014 User: sh2020 Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Fri 19 Jun 2020 16:01 Selected Answer: - Upvotes: 5

I agree, C is the only choice that addresses the problem. The problem is caused by leased line. How come pub/sub service can resolve it? Pub/sub will still use the leased line

Comment 1.4

ID: 399486 User: awssp12345 Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Tue 06 Jul 2021 00:05 Selected Answer: - Upvotes: 7

DEFINITELY NOT COST EFFECT. C IS THE WORST CHOICE.

Comment 2

ID: 791408 User: ayush_1995 Badges: Highly Voted Relative Date: 3 years, 1 month ago Absolute Date: Sun 29 Jan 2023 06:10 Selected Answer: B Upvotes: 10

B. Have the data acquisition devices publish data to Cloud Pub/Sub. This would provide a reliable messaging service for your event data, allowing you to ingest and process your data in a timely manner, regardless of the reliability of the leased lines. Cloud Pub/Sub also offers automatic retries and fault-tolerance, which would further improve the reliability of your event delivery. Additionally, using Cloud Pub/Sub would allow you to easily scale up or down your event processing infrastructure as needed, which would help to minimize costs.

Comment 3

ID: 1625939 User: b2aaace Badges: Most Recent Relative Date: 3 months, 3 weeks ago Absolute Date: Sat 15 Nov 2025 14:07 Selected Answer: A Upvotes: 1

Option A: Deploy small Kafka clusters in your data centers
• Kafka acts as a local buffer for events.
• Sensors write to the local Kafka cluster, ensuring no data loss even if connectivity to the central processing infrastructure is slow or unreliable.
• Once connectivity is available, Kafka can replicate or forward events to the central processing pipeline.
• This is cost-effective compared to provisioning dedicated interconnects or streaming directly to cloud services with unreliable lines.
• Kafka is designed for high-throughput, low-latency buffering, making it ideal here.

Comment 4

ID: 1342784 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 23:41 Selected Answer: B Upvotes: 1

The data acquisition devices are spread across the globe. How many cloud interconnects would this require? Option C is definitely not cost effective. Option B seems to make sense as it suggests publishing from the acquisition device to Pub/Sub and not from the collection infrastructure. The problem clearly stated the connection from collection infrastructure to the processing infrastructure was unreliable.

Comment 5

ID: 1231954 User: Anudeep58 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 17 Jun 2024 15:41 Selected Answer: B Upvotes: 1

Option B: Have the data acquisition devices publish data to Cloud Pub/Sub.

Rationale:

Managed Service: Cloud Pub/Sub is a fully managed service, reducing the operational overhead compared to managing Kafka clusters.
Reliability and Scalability: Cloud Pub/Sub can handle high volumes of data with low latency and provides built-in mechanisms for reliable message delivery, even in the face of intermittent connectivity.
Cost-Effective: Cloud Pub/Sub offers a pay-as-you-go pricing model, which can be more cost-effective than setting up and maintaining dedicated network infrastructure like Cloud Interconnect.
Global Availability: Cloud Pub/Sub is available globally and can handle data from multiple regions efficiently.

Comment 6

ID: 1098165 User: Nandababy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 16 Dec 2023 13:43 Selected Answer: - Upvotes: 1

Even with Cloud Pub/Sub, unpredictable latency or delays could still occur due to the unreliable leased lines connecting your event collection infrastructure and event processing infrastructure. While Cloud Pub/Sub offers reliable message delivery within its own network, the handoff to your processing infrastructure is still dependent on the leased lines.
Replacing leased lines with Cloud Interconnect could potentially resolve the overall issue of unpredictable latency in event processing pipeline but it could be unnecessary expense provided data centers distributed world wide.
Cloud Pub/Sub along with other optimization techniques like Cloud VPN or edge computing might be sufficient.

Comment 7

ID: 979916 User: FP77 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 13 Aug 2023 12:04 Selected Answer: C Upvotes: 1

I don't know why B is the most voted. The issue here is unreliable connectivity and C is the perfect use-case for that

Comment 8

ID: 973025 User: NeoNitin Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 05 Aug 2023 14:44 Selected Answer: - Upvotes: 1

its says with unpredictable latency and here no need to worry about connection
So B is the right one

Comment 9

ID: 946080 User: ZZHZZH Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sat 08 Jul 2023 04:09 Selected Answer: C Upvotes: 1

The question is misleading. But should be C since it addresses the unpredictablility and latency directly.

Comment 10

ID: 810774 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Thu 16 Feb 2023 15:48 Selected Answer: - Upvotes: 2

Best answer is A, By using Kafka, you can buffer the events in the data centers until a reliable connection is established with the event processing infrastructure.
But go with B, its google asking :P

Comment 10.1

ID: 820752 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Fri 24 Feb 2023 18:06 Selected Answer: - Upvotes: 1

I read this question again, I wanna answer C. Buying Data acquisition devices and set them up with sensor, i dont think its practical approach. Imagine, Adruino is cheapest IOT available in market for 15 dollars, but who will open the sensor box and install it .. omg,, its a big job. This question depends if IOT devices that are attached to sensor needs to be programmed. Big Headache right. Use google cloud connect to deal with current situation. Or reprogramme IOT if they have connected with sensors.

Comment 11

ID: 781635 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 19 Jan 2023 22:37 Selected Answer: - Upvotes: 2

Are they talking about GCP in this question?
Where is the event processing infrastructure?

Answer A, might be correct!

Comment 12

ID: 758145 User: PrashantGupta1616 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 27 Dec 2022 06:03 Selected Answer: B Upvotes: 1

pub/sub is region is a global service
It's important to note that the term "global" in this context refers to the geographical scope of the service

Comment 13

ID: 746169 User: NicolasN Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 15 Dec 2022 15:16 Selected Answer: A Upvotes: 3

As usual the answer is hidden somewhere in the Google Cloud Blog:
"In the case of our automotive company, the data is already stored and processed in local data centers in different regions. This happens by streaming all sensor data from the cars via MQTT to local Kafka Clusters that leverage Confluent’s MQTT Proxy."
"This integration from devices to a local Kafka cluster typically is its own standalone project, because you need to handle IoT-specific challenges like constrained devices and unreliable networks."

🔗 https://cloud.google.com/blog/products/ai-machine-learning/enabling-connected-transformation-with-apache-kafka-and-tensorflow-on-google-cloud-platform

Comment 13.1

ID: 781630 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 19 Jan 2023 22:35 Selected Answer: - Upvotes: 2

The question is asking from the on-premise infrastructure, which already has the data, to the event processing infrastructure, which is in the GCP, is unreliable....

it not asking from the sensors to the on-premise...

Comment 13.1.1

ID: 781636 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 19 Jan 2023 22:38 Selected Answer: - Upvotes: 1

I might have to retract my answer... Are they talking about GCP in this question?
where is the event processing infrastructure?

Comment 14

ID: 723607 User: piotrpiskorski Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 21 Nov 2022 15:28 Selected Answer: - Upvotes: 1

yeah, changing whole architecture arround the world for the use of pub/sub is so much more cost efficient than Cloud Interconnect (which is like 3k$)..

It's C.

Comment 14.1

ID: 738241 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 19:41 Selected Answer: - Upvotes: 1

It's not a Cloud Interconnect, it's a lot of interconnect ones per data center, PUB/SUB addresses all the requirements. Its B

Comment 14.1.1

ID: 738244 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 19:44 Selected Answer: - Upvotes: 1

ALSO, the problem it's no t your connection, its the connectivity BT your event collection infrastructure to your event processing infrastructure, so PUSUB it's perfect for this

Comment 14.2

ID: 738494 User: jkhong Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 08 Dec 2022 01:59 Selected Answer: - Upvotes: 1

Wouldn't using cloud interconnect also result in amendments to each of the data center around the world? I don't see why there would be a huge architecture change when using PubSub, the publishers would just need to push messages directly to pubsub, instead of pushing to their own cost center.

Also, if the script for pushing messages can be standardised, the data centers can share it around to

Comment 15

ID: 668639 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 14 Sep 2022 07:51 Selected Answer: B Upvotes: 1

Cloud Pub/Sub, it supports batch & streaming , push and pull capabilities
Answer B

Comment 16

ID: 649569 User: t11 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 21 Aug 2022 03:42 Selected Answer: - Upvotes: 1

It has to be B.

Comment 17

ID: 647279 User: rr4444 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 15 Aug 2022 18:25 Selected Answer: D Upvotes: 2

Feels like everyone is wrong.

A. Deploy small Kafka clusters in your data centers to buffer events.
- Silly in a GCP cloudnative context, plus they have messaging infra anyway
B. Have the data acquisition devices publish data to Cloud Pub/Sub.
- They have messaging infra, so why? Unless they want to replace, it, but that doesn't change the issue
C. Establish a Cloud Interconnect between all remote data centers and Google.
- Wrong, because Interconnect is basically a leased line. There must be some telecoms issue with it, which we can assume is unresolvable e.g. long distance remote locations and sometimes water ingress, and the telco can't justify sorting it yet, or is slow to, or something. Leased lines usually don't come with awful internet connectivity, so sound physical connectivity issue. Sure, an Interconnect is better, more direct, but a leased line should be bullet proof.
D. Write a Cloud Dataflow pipeline that aggregates all data in session windows.
- The only way to address dodgy/delayed data delivery

11. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 31

Sequence: 29
Discussion ID: 17051
Source URL: https://www.examtopics.com/discussions/google/view/17051-exam-professional-data-engineer-topic-1-question-31/
Posted By: -
Posted At: March 20, 2020, 2:13 p.m.

Question

You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update. What should you do?

A. Update the current pipeline and use the drain flag.
B. Update the current pipeline and provide the transform mapping JSON object.
C. Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.
D. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.

Community Answer Votes

A: 20 most voted
B: 14
D: 7
C: 3

Comments 34 comments Click to expand

Comment 1

ID: 143863 User: VishalB Badges: Highly Voted Relative Date: 5 years, 7 months ago Absolute Date: Sun 26 Jul 2020 08:38 Selected Answer: - Upvotes: 80

Correct Option : A
Explanation:-This option is correct as the key requirement is not to lose
the data, the Dataflow pipeline can be stopped using the Drain option.
Drain options would cause Dataflow to stop any new processing, but would
also allow the existing processing to complete

Comment 1.1

ID: 143865 User: VishalB Badges: - Relative Date: 5 years, 7 months ago Absolute Date: Sun 26 Jul 2020 08:44 Selected Answer: - Upvotes: 3

Option C & D are incorrect as Cancel Option will lead to loose the data
Option B is very Close, since the new Code make pipeline incompatible by providing transform mapping JSON file you can handle this

Comment 1.1.1

ID: 470257 User: sergio6 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 30 Oct 2021 15:49 Selected Answer: - Upvotes: 2

A is incorrect because updating pipeline does not include any drain flag

Comment 1.1.1.1

ID: 532211 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 25 Jan 2022 15:54 Selected Answer: - Upvotes: 2

two steps.. 1st drain the job w/ sdk or console. then, update the pipeline. cause it is OK to update a job while in draining

Comment 1.1.1.1.1

ID: 689941 User: maxdataengineer Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Sun 09 Oct 2022 08:33 Selected Answer: - Upvotes: 1

Yes but the compatibility problem will still be there, stopping the pipeline does not solve that

Comment 1.1.2

ID: 532217 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 25 Jan 2022 16:01 Selected Answer: - Upvotes: 6

There are 5 update scenarios in a job in update-a-pipeline context.
a- changing transform name (requires mapping) , adding a new step (no need for mapping)
b- windowing or triggering (only for minor changes, otherwise don't do that)
c- coders (don't do that
d- schema (adding or required to nullable is possible) other scenarios not possible
e- stateful operations

none of them are relevant, here. cause there is no specific detail, secondly incompatible w. new pipeline.

and mostly if in compatible only a has a solve. but not for all cases.
so, drain == no data loss (ingesting, buffered and in-flight data) is the only scenario.

Comment 1.2

ID: 689940 User: maxdataengineer Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Sun 09 Oct 2022 08:32 Selected Answer: - Upvotes: 3

As you said, Drain stops the pipeline but it does not solve the compatibility issue. The pipeline will not be able to be updated which is the core problem of the question.

Comment 1.2.1

ID: 722048 User: assU2 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 19 Nov 2022 15:23 Selected Answer: - Upvotes: 2

You do not want to lose any data when making this update - is the core problem. You are doing it ANYWAY.

Comment 1.3

ID: 437906 User: sergio6 Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 02 Sep 2021 15:30 Selected Answer: - Upvotes: 7

C and D are incorrect because canceling the old pipeline can cause data loss
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
A is incorrect because updating pipeline does not include any drain flag
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline

Comment 1.3.1

ID: 470255 User: sergio6 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 30 Oct 2021 15:47 Selected Answer: - Upvotes: 5

B is correct: Update the current pipeline and provide the transform mapping JSON object.
Dataflow always performs a compatibility check between the old and new job and without the mapping (necessary as old and new are incompatible) it would give an error and the old job would continue to be executed
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#Mapping
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#CCheck

Comment 1.3.1.1

ID: 532208 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 25 Jan 2022 15:50 Selected Answer: - Upvotes: 2

new pipeline is incompatible means, compatibility check will fail. so you wil not be able to update as new pipeline.

that's why B cannot be valid answer here in this context.

Comment 1.3.1.1.1

ID: 689943 User: maxdataengineer Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Sun 09 Oct 2022 08:34 Selected Answer: - Upvotes: 2

B is a way to solve compatibility issues

Comment 1.3.2

ID: 532205 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 25 Jan 2022 15:48 Selected Answer: - Upvotes: 2

drain is in the guide ...stopping-a-pipeline. Just ...updating-a-pipeline is not enough to evaluate this question.

that's why drainnin is not a flag in a pipeline update. it is a process about how to stop a pipeline w/o data loss !

data in dataflow is in 3 stages. ingestion data, buffered data and in-flight data which is processing by old pipeline.

Comment 1.4

ID: 494419 User: BigQuery Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Sun 05 Dec 2021 15:28 Selected Answer: - Upvotes: 33

To all the New Guys Here. Please don't get confused with all the people's fight over here. Just google the question and you will get the correct ans in many website. Still I recommend to refer this website for question. for this Particular problem ans is A. Reason is here --> https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#python
have time to read the full page when to use Update using Json mapping and when to use Drain. (you will have question following for Drain option though).
Thumb rule is this,
# If any major change to windowing transformation (like completely changing window fn from fixed to sliding) in Beam/Dataflow/you want to stop pipeline but want inflight data --> use Drain option.
# For all other use cases and Minor changing to windowing fn (like just changing window time of sliding window) --> Use Update with Json mapping.

In this case it is Code change to new version. so, Update with Json mapping. Simple as that.

All the Best Guys.

Comment 1.4.1

ID: 494916 User: BigQuery Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Mon 06 Dec 2021 06:23 Selected Answer: - Upvotes: 6

SORRY I MEANT TO SAY ANS IS 'B'. In this case it is Code change to new version. so, Update with Json mapping.

Comment 1.4.1.1

ID: 572405 User: anji007 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Mon 21 Mar 2022 18:44 Selected Answer: - Upvotes: 7

Its clearly mentioned in the question that pipeline in compatible, if it is so you can not update with JSON mapping. Only way is to stop the pipeline with Drain and replace it with a new one. So the closest answer is A only.

Comment 1.4.1.1.1

ID: 689944 User: maxdataengineer Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Sun 09 Oct 2022 08:35 Selected Answer: - Upvotes: 1

JSON Mapping is a way to solve compatibility issues when updating

Comment 2

ID: 234368 User: IKGx1iGetOWGSjAQDD2x3 Badges: Highly Voted Relative Date: 5 years, 3 months ago Absolute Date: Thu 03 Dec 2020 22:04 Selected Answer: - Upvotes: 8

Answer is D.
* A and B are not possible, since the new job is not compatible.
* C might lead to lost data.
* D might lead to data being processed twice, but no data will be lost.
Better would usually be to drain and start a new pipeline.

Comment 3

ID: 1622395 User: b2aaace Badges: Most Recent Relative Date: 4 months, 1 week ago Absolute Date: Sun 02 Nov 2025 21:14 Selected Answer: B Upvotes: 1

B is correct: Update the current pipeline and provide the transform mapping JSON object.

Comment 4

ID: 1578880 User: leticiaarj Badges: - Relative Date: 8 months, 3 weeks ago Absolute Date: Thu 19 Jun 2025 14:29 Selected Answer: D Upvotes: 1

This is the Google Cloud recommended way to transition between streaming pipeline versions without data loss:
1. Create a new subscription on the same Pub/Sub topic.
2. New subscription starts receiving messages from the moment of creation.
3. Create a new pipeline with this new subscription.
4. After validation, cancel the old pipeline and subscribe if necessary.

Comment 5

ID: 1410161 User: abhaya2608 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Tue 25 Mar 2025 21:46 Selected Answer: B Upvotes: 1

Please refer the google doc link below,
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain

Comment 6

ID: 1398874 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 15:21 Selected Answer: D Upvotes: 1

Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline

Comment 7

ID: 1326167 User: dans_puts Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 13 Dec 2024 16:03 Selected Answer: D Upvotes: 1

D. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline:

By creating a new subscription, the new pipeline will consume messages independently of the old pipeline. This ensures no data is lost as messages published to Pub/Sub are delivered to all subscriptions.
Once the new pipeline is verified to be running as expected, the old pipeline can be safely canceled.

Comment 8

ID: 1315389 User: Smakyel79 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 20 Nov 2024 17:33 Selected Answer: - Upvotes: 1

Why option D is better for this case - In this scenario: the pipeline is incompatible with the old one; running the two pipelines concurrently ensures no data loss and allows for easier debugging of the new pipeline; a new subscription ensures the old pipeline can finish processing its messages while the new pipeline starts fresh

Comment 9

ID: 1315379 User: Smakyel79 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 20 Nov 2024 17:25 Selected Answer: - Upvotes: 1

Draining a pipeline stops it in an orderly manner but does not address incompatibility issues. Once the pipeline is drained, no more data is processed, and the new pipeline starts fresh. This can lead to data loss if there are messages in Pub/Sub that the drained pipeline didn't process.

Comment 10

ID: 1132471 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 12:28 Selected Answer: - Upvotes: 2

Option: C
using draining will stop the subscription totally while allowing the existing data to complete processing. While the pipeline is stopped, will lose streaming data. The best option is to create a new pipeline that is connected to the same subscription, then we can apply drain to the old pipeline and end it. That way we will capture all the streaming data.

Comment 11

ID: 1098051 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 16 Dec 2023 11:06 Selected Answer: A Upvotes: 1

Drain flag: This flag allows the pipeline to finish processing all existing data in the Pub/Sub subscription before shutting down. This ensures no data is lost during the update.
Current pipeline: Updating the current pipeline minimizes disruption and avoids setting up entirely new infrastructure.
Incompatible changes: Even with incompatible changes, the drain flag ensures existing data is processed correctly.

Comment 11.1

ID: 1098052 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 16 Dec 2023 11:06 Selected Answer: - Upvotes: 1

While other options might work in some cases, they have drawbacks:

B. Transform mapping JSON: This option is mainly for schema changes and doesn't guarantee data completion before shutdown.
C. New pipeline, same subscription: This risks duplicate processing of data if both pipelines run concurrently.
D. New pipeline, new subscription: This loses the current pipeline's state and potentially data, making it impractical for incompatible changes.
Therefore, the most reliable and data-safe approach is to update the current pipeline with the drain flag for seamless transition and data integrity.

Remember, always test updates in a staging environment before deploying to production.

Comment 12

ID: 1096406 User: TVH_Data_Engineer Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 14 Dec 2023 12:36 Selected Answer: C Upvotes: 2

Same Cloud Pub/Sub Subscription: By using the same Cloud Pub/Sub subscription for the new pipeline, you ensure that no messages are lost during the transition. Pub/Sub manages message delivery, ensuring that unacknowledged messages (those that haven't been processed by your old pipeline) will be available for the new pipeline to process.

Creating a New Pipeline: Since the update makes the new pipeline incompatible with the current version, it's necessary to create a new pipeline. Attempting to update the current pipeline in place (options A and B) would not be feasible due to compatibility issues.

Cancel the Old Pipeline: Once the new pipeline is up and running and processing messages, you can safely cancel the old pipeline. This ensures a smooth transition with no data loss.

Comment 13

ID: 1087725 User: JOKKUNO Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 04 Dec 2023 15:51 Selected Answer: - Upvotes: 1

In order to make an update to a Google Cloud Dataflow streaming pipeline without losing any data, the recommended approach is:

A. Update the current pipeline and use the drain flag.

Explanation:

The drain flag is designed to allow the current pipeline to finish processing any remaining data before shutting down. This helps ensure that no data is lost during the update process.
By updating the current pipeline and using the drain flag, you allow the pipeline to complete its current processing before the update takes effect, minimizing the risk of data loss.
This approach is a safe way to transition from the old version to the new version without interrupting data processing.

Comment 14

ID: 1076383 User: axantroff Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 15:46 Selected Answer: - Upvotes: 1

I would vote for A because of the structure of the exam, but there are other options worth considering as well

Comment 15

ID: 1065203 User: RT_G Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 07 Nov 2023 23:42 Selected Answer: C Upvotes: 1

My answer is C. Chatted with ChatGPT and narrowed down on this option. Let me know your thoughts on this perspective.
Option C - By using the existing subscription, you can ensure that the data flow remains uninterrupted, and there is no loss of data during the transition from the old pipeline to the new one.

Creating a new pipeline that uses the same Cloud Pub/Sub subscription allows for a seamless transition without any interruptions to the data flow. This approach ensures that the new pipeline can continue to consume data from the same subscription as the old pipeline, thereby maintaining data continuity throughout the update process.

Comment 16

ID: 1064932 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 07 Nov 2023 16:24 Selected Answer: A Upvotes: 1

Correct Option : A
Explanation:-This option is correct as the key requirement is not to lose
the data, the Dataflow pipeline can be stopped using the Drain option.

Comment 17

ID: 1053171 User: mk_choudhary Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 24 Oct 2023 23:10 Selected Answer: - Upvotes: 1

It should be B
Drain will stop the existing job only and it does not suffice the updated schema.
In order to bring updated schema into effect, updated JSON mapping need to be applied.

12. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 5

Sequence: 32
Discussion ID: 16637
Source URL: https://www.examtopics.com/discussions/google/view/16637-exam-professional-data-engineer-topic-1-question-5/
Posted By: -
Posted At: March 15, 2020, 8:14 a.m.

Question

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values
(CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

A. Use federated data sources, and check data in the SQL query.
B. Enable BigQuery monitoring in Google Stackdriver and create an alert.
C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Community Answer Votes

D: 19 most voted

Comments 21 comments Click to expand

Comment 1

ID: 213167 User: Radhika7983 Badges: Highly Voted Relative Date: 5 years, 4 months ago Absolute Date: Thu 05 Nov 2020 04:13 Selected Answer: - Upvotes: 17

The answer is D. An ETL pipeline will be implemented for this scenario. Check out handling invalid inputs in cloud data flow

https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

ParDos . . . and don’ts: handling invalid inputs in Dataflow using Side Outputs as a “Dead Letter” file

Comment 1.1

ID: 734899 User: jkhong Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 04 Dec 2022 08:11 Selected Answer: - Upvotes: 5

The sources you've provided cannot be accessed. Here is an updated best practice. https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-developing-and-testing#use_dead_letter_queues

Comment 1.1.1

ID: 1263850 User: nadavw Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 11 Aug 2024 09:13 Selected Answer: - Upvotes: 1

https://cloud.google.com/dataflow/docs/guides/write-to-bigquery#:~:text=It%27s%20a%20good%20practice%20to%20send%20the%20errors%20to%20a%20dead%2Dletter%20queue%20or%20table%2C%20for%20later%20processing.%20For%20more%20information%20about%20this%20pattern%2C%20see%20BigQueryIO%20dead%20letter%20pattern.

It's a good practice to send the errors to a dead-letter queue or table, for later processing. For more information about this pattern, see BigQueryIO dead letter pattern.

Comment 2

ID: 1060874 User: rocky48 Badges: Highly Voted Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:05 Selected Answer: D Upvotes: 7

Option A is incorrect because federated data sources do not provide any data validation or cleaning capabilities and you'll have to do it on the SQL query, which could slow down the performance.

Option B is incorrect because Stackdriver monitoring can only monitor the performance of the pipeline, but it can't handle corrupted or incorrectly formatted data.

Option C is incorrect because using gcloud CLI and setting max_bad_records to 0 will ignore the corrupted or incorrectly formatted data and continue the load process, this will lead to incorrect analysis.

Answer D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Comment 3

ID: 1618136 User: 3244fd8 Badges: Most Recent Relative Date: 4 months, 3 weeks ago Absolute Date: Sun 19 Oct 2025 10:52 Selected Answer: D Upvotes: 1

D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Comment 4

ID: 1399880 User: willyunger Badges: - Relative Date: 12 months ago Absolute Date: Tue 18 Mar 2025 00:06 Selected Answer: D Upvotes: 1

"you want to keep the data"

Comment 5

ID: 1362315 User: Ahamada Badges: - Relative Date: 1 year ago Absolute Date: Wed 26 Feb 2025 22:45 Selected Answer: D Upvotes: 1

You should transform the raw data by eliminate the error before analyse.

Comment 6

ID: 1050467 User: rtcpost Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:05 Selected Answer: D Upvotes: 2

Google Cloud Dataflow allows you to create a data pipeline that can preprocess and transform data before loading it into BigQuery. This approach will enable you to handle problematic rows, push them to a dead-letter table for later analysis, and load the valid data into BigQuery.

Option A (using federated data sources and checking data in the SQL query) can be used but doesn't directly address the issue of handling corrupted or incorrectly formatted rows.

Options B and C are not the best choices for handling data quality and error issues. Enabling monitoring and setting max_bad_records to 0 in BigQuery may help identify errors but won't store the problematic rows for further analysis, and it might prevent loading any data with issues, which may not be ideal.

Comment 7

ID: 784800 User: samdhimal Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:05 Selected Answer: - Upvotes: 2

D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

By running a Cloud Dataflow pipeline to import the data, you can perform data validation, cleaning and transformation before it gets loaded into BigQuery. Dataflow allows you to handle corrupted or incorrectly formatted rows by pushing them to another dead-letter table for analysis. This way, you can ensure that only clean and correctly formatted data is loaded into BigQuery for analysis.

Comment 7.1

ID: 784801 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 00:54 Selected Answer: - Upvotes: 5

Option A is incorrect because federated data sources do not provide any data validation or cleaning capabilities and you'll have to do it on the SQL query, which could slow down the performance.

Option B is incorrect because Stackdriver monitoring can only monitor the performance of the pipeline, but it can't handle corrupted or incorrectly formatted data.

Option C is incorrect because using gcloud CLI and setting max_bad_records to 0 will ignore the corrupted or incorrectly formatted data and continue the load process, this will lead to incorrect analysis.

Comment 7.1.1

ID: 962828 User: hamza101 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 25 Jul 2023 16:49 Selected Answer: - Upvotes: 2

for Option C i think when setting max_bad_records to 0 this will prevent the loading to be achieved since the condition will cut off the loading if we have at least 1 corrupted row

Comment 8

ID: 1065062 User: RT_G Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:04 Selected Answer: D Upvotes: 1

All other options only alert or error out bad data. As the question requires, option D sends bad data to the dead letter table for further analysis while valid data is loaded to the table

Comment 9

ID: 901961 User: vaga1 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 19 May 2023 15:19 Selected Answer: D Upvotes: 1

Agreed: D

Comment 10

ID: 849552 User: odiez3 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 24 Mar 2023 19:40 Selected Answer: - Upvotes: 1

D because you need Transform the data

Comment 11

ID: 810164 User: Morock Badges: - Relative Date: 3 years ago Absolute Date: Thu 16 Feb 2023 02:33 Selected Answer: D Upvotes: 3

D. The question is asking pipeline, then let’s build a pipeline.

Comment 12

ID: 696382 User: Besss Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 16 Oct 2022 18:33 Selected Answer: D Upvotes: 1

Agreed: D

Comment 13

ID: 641546 User: Dip1994 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Wed 03 Aug 2022 05:35 Selected Answer: - Upvotes: 1

The correct answer is D

Comment 14

ID: 559271 User: Arkon88 Badges: - Relative Date: 4 years ago Absolute Date: Wed 02 Mar 2022 09:33 Selected Answer: D Upvotes: 1

Correct - D (as we need to create Pipeline) which possible via 'D'

Comment 15

ID: 473992 User: MaxNRG Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sun 07 Nov 2021 18:51 Selected Answer: - Upvotes: 3

Looks like D, with C you will not import anything, stackdriver alerts will not help you with this and with federated resources you won’t know what happened with those bad records. D is the most complete one.
https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

Comment 16

ID: 462016 User: anji007 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Thu 14 Oct 2021 14:40 Selected Answer: - Upvotes: 1

Ans: D

Comment 17

ID: 451707 User: nickozz Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sun 26 Sep 2021 09:31 Selected Answer: - Upvotes: 1

D seems to be correct. explained here how combined wth Pub/Sub, this can be achieved. https://cloud.google.com/pubsub/docs/handling-failures

13. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 268

Sequence: 37
Discussion ID: 130219
Source URL: https://www.examtopics.com/discussions/google/view/130219-exam-professional-data-engineer-topic-1-question-268/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 6:31 p.m.

Question

You created a new version of a Dataflow streaming data ingestion pipeline that reads from Pub/Sub and writes to BigQuery. The previous version of the pipeline that runs in production uses a 5-minute window for processing. You need to deploy the new version of the pipeline without losing any data, creating inconsistencies, or increasing the processing latency by more than 10 minutes. What should you do?

A. Update the old pipeline with the new pipeline code.
B. Snapshot the old pipeline, stop the old pipeline, and then start the new pipeline from the snapshot.
C. Drain the old pipeline, then start the new pipeline.
D. Cancel the old pipeline, then start the new pipeline.

Community Answer Votes

C: 15 most voted
B: 5

Comments 14 comments Click to expand

Comment 1

ID: 1114680 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 18:04 Selected Answer: C Upvotes: 8

- Graceful Data Transition: Draining the old pipeline ensures it processes all existing data in its buffers and watermarks before shutting down, preventing data loss or inconsistencies.
- Minimal Latency Increase: The latency increase will be limited to the amount of time it takes to drain the old pipeline, typically within the acceptable 10-minute threshold.

Comment 2

ID: 1147610 User: AlizCert Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Sun 11 Feb 2024 20:32 Selected Answer: - Upvotes: 8

I don't think C is correct, as it will immediately fire the window:
"Draining can result in partially filled windows. In that case, if you restart the drained pipeline, the same window might fire a second time, which can cause issues with your data. "
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#effects

Maybe "A" means launching a replacement job?
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#Launching

Comment 2.1

ID: 1305178 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 30 Oct 2024 22:02 Selected Answer: - Upvotes: 1

we don't restart the drained pipeline.

Comment 2.2

ID: 1181875 User: d11379b Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 18:06 Selected Answer: - Upvotes: 3

So why not B it is the better choice to save intermediate state and easy to use

Comment 3

ID: 1614485 User: Kadhem Badges: Most Recent Relative Date: 5 months, 1 week ago Absolute Date: Fri 03 Oct 2025 09:29 Selected Answer: B Upvotes: 1

Draining can result in partially filled windows. In that case, if you restart the drained pipeline, the same window might fire a second time, which can cause issues with your data. For example, in the following scenario, files might have conflicting names, and data might be overwritten.
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline

Comment 4

ID: 1613155 User: nimo9977 Badges: - Relative Date: 5 months, 2 weeks ago Absolute Date: Sun 28 Sep 2025 15:51 Selected Answer: B Upvotes: 1

Drain (C)

Avoids in-flight loss, but can create inconsistent results due to early window closure.

This violates “no inconsistencies.”

Snapshot (B)

Captures state + offsets, lets new job continue without gaps or duplicates.

Meets “no data loss / no inconsistencies.”

Latency bump stays well under 10 minutes in practice.

Comment 5

ID: 1606086 User: judy_data Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Thu 04 Sep 2025 13:07 Selected Answer: B Upvotes: 1

A snapshot captures the exact ack state of the subscription. Starting the new pipeline from that snapshot guarantees no data loss and no duplicates, preserving window consistency even if your windowing/logic changed.
(C) Draining waits for in-flight windows/watermarks to complete; with a 5-minute window this can easily exceed the allowed +10 minutes latency and still risks a gap while you switch pipelines.

Comment 6

ID: 1316803 User: petulda Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sat 23 Nov 2024 22:09 Selected Answer: - Upvotes: 2

Why not B?
https://cloud.google.com/dataflow/docs/guides/upgrade-guide#stop-and-replace

Comment 7

ID: 1261692 User: STEVE_PEGLEG Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 06 Aug 2024 15:58 Selected Answer: C Upvotes: 1

There is requirement to avoid data loss.

https://cloud.google.com/dataflow/docs/guides/upgrade-guide#stop-and-replace
"To avoid data loss, in most cases, draining is the preferred action."

Comment 8

ID: 1228910 User: Ouss_123 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 12 Jun 2024 10:58 Selected Answer: C Upvotes: 2

- Draining the old pipeline ensures that it finishes processing all in-flight data before stopping, which prevents data loss and inconsistencies.
- After draining, you can start the new pipeline, which will begin processing new data from where the old pipeline left off.
- This approach maintains a smooth transition between the old and new versions, minimizing latency increases and avoiding data gaps or overlaps.

==> Other options, such as updating, snapshotting, or canceling, might not provide the same level of consistency and could lead to data loss or increased latency beyond the acceptable 10-minute window. Draining is the safest method to ensure a seamless transition.

Comment 9

ID: 1181876 User: d11379b Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 18:09 Selected Answer: B Upvotes: 2

I would choose B as mentioned by Alizcert, a simple drain may cause problem
Dataflow snapshots save the state of a streaming pipeline, which lets you start a new version of your Dataflow job without losing state. Snapshots are useful for backup and recovery, testing and rolling back updates to streaming pipelines, and other similar scenarios.

Comment 10

ID: 1174782 User: hanoverquay Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Sat 16 Mar 2024 07:30 Selected Answer: C Upvotes: 1

C option

Comment 11

ID: 1121779 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 16:09 Selected Answer: C Upvotes: 1

Option C, draining the old pipeline solves all requests

Comment 12

ID: 1112973 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 18:31 Selected Answer: C Upvotes: 2

C. Drain the old pipeline, then start the new pipeline.

14. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 26

Sequence: 40
Discussion ID: 16288
Source URL: https://www.examtopics.com/discussions/google/view/16288-exam-professional-data-engineer-topic-1-question-26/
Posted By: jvg637
Posted At: March 11, 2020, 7:18 p.m.

Question

You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users' privacy?

A. Grant the consultant the Viewer role on the project.
B. Grant the consultant the Cloud Dataflow Developer role on the project.
C. Create a service account and allow the consultant to log on with it.
D. Create an anonymized sample of the data for the consultant to work with in a different project.

Community Answer Votes

D: 78 most voted
B: 20

Comments 25 comments Click to expand

Comment 1

ID: 62606 User: jvg637 Badges: Highly Voted Relative Date: 6 years ago Absolute Date: Wed 11 Mar 2020 19:18 Selected Answer: - Upvotes: 77

The Answer should be B. The Dataflow developer role will not provide access to the underlying data.

Comment 1.1

ID: 65170 User: cleroy Badges: - Relative Date: 5 years, 12 months ago Absolute Date: Tue 17 Mar 2020 12:53 Selected Answer: - Upvotes: 5

Remember he's an external consultant. You need to create a service account for him, you can't grant before that... I think C is correct in this case.

Comment 1.1.1

ID: 130963 User: Rajuuu Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Thu 09 Jul 2020 22:16 Selected Answer: - Upvotes: 36

Service account is between applications and non human entry.

Comment 1.1.1.1

ID: 530735 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 19:16 Selected Answer: - Upvotes: 1

u can enable a service account as user so that externals can use to login

Comment 1.1.1.2

ID: 530740 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 19:26 Selected Answer: - Upvotes: 3

u can enable a service account as user so that externals can use to login.

but the problem is service account is about login. not the minimum resources to do the dataflow related staffs. so C is not enough !.

so the answer should be B.

if the question was about "doing the 1st thing", then yeah may be creating a service account could be the 1st thing.

Comment 1.2

ID: 968943 User: VincentMenzel Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 01 Aug 2023 12:22 Selected Answer: - Upvotes: 4

Im not sure how you expect the consultant to implement a pipeline without having access to any data that is being processed. Having test data is a prerequisite.

Comment 1.3

ID: 867107 User: ThorstenStaerk Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 11 Apr 2023 10:50 Selected Answer: - Upvotes: 5

and now? For seeing test data, (D) would be right. And the system tells me (C) is the right answer. What shall I click in the exam?

Comment 1.4

ID: 614288 User: willymac2 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Fri 10 Jun 2022 03:35 Selected Answer: - Upvotes: 4

The answer should be D.
You do not need any DataFlow permission to implement a pipeline.
If needed, you can test using the DirectRunner which runs locally:

ttps://cloud.google.com/dataflow/docs/concepts/access-control#example_role_assignment

Comment 1.4.1

ID: 614291 User: willymac2 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Fri 10 Jun 2022 03:40 Selected Answer: - Upvotes: 3

Sorry I did a wrong copy/paste on the link, I wanted to send:

https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#security_and_permissions_for_local_pipelines

https://cloud.google.com/dataflow/docs/guides/setting-pipeline-options#LocalExecution

Comment 2

ID: 458175 User: Anirkent Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Wed 06 Oct 2021 12:15 Selected Answer: - Upvotes: 8

Not sure who it could be anything apart from D. if I put myself in developer shoes then without seeing the data how can I develop any logic let alone the complex one. and if I have access thru any means (ex. service account) then I can just print the logs and see the data in the logs anyway. So option D appears to be the only option.

Comment 3

ID: 1607585 User: brokeTechBro Badges: Most Recent Relative Date: 6 months ago Absolute Date: Tue 09 Sep 2025 21:03 Selected Answer: D Upvotes: 1

A. Viewer role: Still exposes all resources and metadata in the project, too broad.

B. Dataflow Developer role: Lets them run pipelines on the real project (and potentially access private data).

C. Service account sharing: A security anti-pattern; violates identity/accountability.

D. Anonymized sample in a separate project: Safest and compliant option—consultant can code and test transformations without ever seeing sensitive data.

Comment 4

ID: 1602256 User: brokeTechBro Badges: - Relative Date: 6 months, 2 weeks ago Absolute Date: Mon 25 Aug 2025 11:51 Selected Answer: D Upvotes: 1

I think only D really covers user's privacy

Comment 5

ID: 1578266 User: Annie00000 Badges: - Relative Date: 8 months, 4 weeks ago Absolute Date: Tue 17 Jun 2025 12:28 Selected Answer: D Upvotes: 1

You’re dealing with sensitive user data (possibly PII or other regulated content).
Principle of least privilege and data minimization are key.
External consultants should never be given access to production data unless absolutely necessary — and in this case, it's not.
Creating an anonymized or synthetic dataset lets them do their work (e.g., test code or build transformations) without compromising real user data.
Using a separate project also provides better audit boundaries and isolation.

Comment 6

ID: 1575247 User: theRafael7 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Fri 06 Jun 2025 10:30 Selected Answer: D Upvotes: 1

D because the data is anonymized and that satisfies the privacy part of the question. It also isolates the consultant's work from your actual work. This is standard for privacy and the consultant's work will not impact your actual project.

Comment 7

ID: 1364883 User: Abizi Badges: - Relative Date: 1 year ago Absolute Date: Tue 04 Mar 2025 13:02 Selected Answer: D Upvotes: 1

logical answer for me

Comment 8

ID: 1287244 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sat 21 Sep 2024 10:53 Selected Answer: D Upvotes: 5

The answer cannot be B, because B is too retrictive, it can only create and manage dataflow jobs, but cannot view data. I acknowledge that is secure, but no consultant can do the job without seeing representative test data. D is the only one that provides enough to do the job, while still remaining totally private.

Comment 9

ID: 1285978 User: mouthwash Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 18 Sep 2024 21:59 Selected Answer: - Upvotes: 3

D cannot be the answer because the question clearly states the developer has to work in your project. Creating another project is not in scope and is a waste of time. Correct answer is B. Developer role has developer rights only, no view rights.

Comment 10

ID: 1259009 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 31 Jul 2024 19:20 Selected Answer: D Upvotes: 4

A. Grant the consultant the Viewer role on the project.
This role provides read-only access to all resources in the project, which could expose sensitive data to the consultant, violating privacy principles.

B. Grant the consultant the Cloud Dataflow Developer role on the project.
This role allows the consultant to create and manage Dataflow jobs but does not give them access the underlying data, it is not sufficient, the developer still needs data.

- C. Create a service account and allow the consultant to log on with it.
Allowing the consultant to log on with a service account could grant them access to sensitive data if the service account has broad permissions. This approach does not address the need to limit data exposure.

- D. Create an anonymized sample of the data for the consultant to work with in a different project.
- this fits the requrements

Comment 11

ID: 1172220 User: Shash_88 Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Wed 13 Mar 2024 04:49 Selected Answer: - Upvotes: 1

D
B is a good option to maintain privacy of sensitive data, but he also need some test data to validate the transformation logic right, so creating sample data and allow him to test in another project seems good.

Comment 12

ID: 1147766 User: hamzad_basha Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 12 Feb 2024 02:27 Selected Answer: B Upvotes: 1

Data flow data privacy rules cant allow the developer to see what the data, He/she just designs the pipelines and the flow as the interdependent tasks for the composer

Comment 13

ID: 1098032 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 16 Dec 2023 10:41 Selected Answer: B Upvotes: 3

B as the Dataflow developer role would help provide the third-party consultant access to create and work on the Dataflow pipeline. However, it does not provide access to view the data, thus maintaining user's privacy.
Refer GCP documentation - Dataflow roles:
https://cloud.google.com/dataflow/docs/concepts/access-control#roles
Option A is wrong as it would not allow the consultant to work on the pipeline.
Option C is wrong as the consultant cannot use the service account to login.
Option D is wrong as it does not enable collaboration.

Comment 14

ID: 1087900 User: Jconnor Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 04 Dec 2023 20:29 Selected Answer: - Upvotes: 1

C and A will not maintain user's privacy so out. B without data will be enough. D will give a good sample data, maintain privacy and the consultant will help creating the dataflow pipe for the project as requested. so D.

Comment 15

ID: 1076362 User: axantroff Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 15:22 Selected Answer: D Upvotes: 1

I follow the corresponding logic choosing between B and D:

Yes, with the Dataflow Developer role it is possible to execute and manipulate Dataflow jobs, but do we need to execute it? Based on my understanding we only need to ask for help to write it down. Is it possible without having access to test the data? I don't think so. At the same time, we need to perform an anonymization on it. So the answer D is more appropriate for me

Comment 16

ID: 1064171 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 06 Nov 2023 19:53 Selected Answer: D Upvotes: 1

By creating an anonymized sample of the data, you can provide the consultant with a realistic dataset that doesn't contain sensitive or private information. This way, the consultant can work on the project without direct access to sensitive data, reducing privacy risks.

Options A and B involve granting the consultant access to the project, which may expose sensitive data, even if they have limited permissions.

Option C involves creating a service account, but it doesn't address the need to anonymize the data or provide a separate, safe environment for the consultant to work with.

Option D provides a controlled environment that allows the consultant to work effectively while maintaining data privacy.

Comment 17

ID: 1050533 User: rtcpost Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 22 Oct 2023 14:06 Selected Answer: D Upvotes: 2

D. Create an anonymized sample of the data for the consultant to work within a different project.

By creating an anonymized sample of the data, you can provide the consultant with a realistic dataset that doesn't contain sensitive or private information. This way, the consultant can work on the project without direct access to sensitive data, reducing privacy risks.

Options A and B involve granting the consultant access to the project, which may expose sensitive data, even if they have limited permissions.

Option C involves creating a service account, but it doesn't address the need to anonymize the data or provide a separate, safe environment for the consultant to work with.

Option D provides a controlled environment that allows the consultant to work effectively while maintaining data privacy.

15. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 254

Sequence: 41
Discussion ID: 130205
Source URL: https://www.examtopics.com/discussions/google/view/130205-exam-professional-data-engineer-topic-1-question-254/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 4:34 p.m.

Question

You are running a Dataflow streaming pipeline, with Streaming Engine and Horizontal Autoscaling enabled. You have set the maximum number of workers to 1000. The input of your pipeline is Pub/Sub messages with notifications from Cloud Storage. One of the pipeline transforms reads CSV files and emits an element for every CSV line. The job performance is low, the pipeline is using only 10 workers, and you notice that the autoscaler is not spinning up additional workers. What should you do to improve performance?

A. Enable Vertical Autoscaling to let the pipeline use larger workers.
B. Change the pipeline code, and introduce a Reshuffle step to prevent fusion.
C. Update the job to increase the maximum number of workers.
D. Use Dataflow Prime, and enable Right Fitting to increase the worker resources.

Community Answer Votes

B: 23 most voted
C: 1
D: 1

Comments 8 comments Click to expand

Comment 1

ID: 1114145 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 01:39 Selected Answer: B Upvotes: 16

- Fusion optimization in Dataflow can lead to steps being "fused" together, which can sometimes hinder parallelization.
- Introducing a Reshuffle step can prevent fusion and force the distribution of work across more workers.
- This can be an effective way to improve parallelism and potentially trigger the autoscaler to increase the number of workers.

Comment 2

ID: 1263441 User: meh_33 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 13:02 Selected Answer: B Upvotes: 1

https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion

Comment 3

ID: 1226654 User: Lestrang Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 08 Jun 2024 11:15 Selected Answer: C Upvotes: 1

Right fitting is for declaration, declaring the correct resources will not help. Reshuffling step is what can prevent fusion which can lead to unused workers.

Comment 4

ID: 1152540 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 13:45 Selected Answer: B Upvotes: 3

Fusion occurs when multiple transformations are fused into a single stage, which can limit parallelism and hinder performance, especially in streaming pipelines. By introducing a Reshuffle step, you break fusion and allow for better parallelism.

Comment 5

ID: 1146402 User: srivastavas08 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 10 Feb 2024 16:20 Selected Answer: - Upvotes: 2

https://cloud.google.com/dataflow/docs/guides/right-fitting

Comment 6

ID: 1117080 User: GCP001 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Mon 08 Jan 2024 23:30 Selected Answer: B Upvotes: 3

Problem is performnace and not using all workers properly, https://cloud.google.com/dataflow/docs/pipeline-lifecycle#fusion_optimization

Comment 7

ID: 1112897 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 16:34 Selected Answer: D Upvotes: 1

D. Use Dataflow Prime, and enable Right Fitting to increase the worker resources.

Comment 7.1

ID: 1607443 User: judy_data Badges: - Relative Date: 6 months ago Absolute Date: Tue 09 Sep 2025 08:29 Selected Answer: - Upvotes: 1

Right fitting is used in batch pipelines and not streaming https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime?hl=en#features

16. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 2

Sequence: 43
Discussion ID: 15911
Source URL: https://www.examtopics.com/discussions/google/view/15911-exam-professional-data-engineer-topic-1-question-2/
Posted By: RP123
Posted At: March 9, 2020, 9:40 a.m.

Question

You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

A. Continuously retrain the model on just the new data.
B. Continuously retrain the model on a combination of existing data and the new data.
C. Train on the existing data while using the new data as your test set.
D. Train on the new data while using the existing data as your test set.

Community Answer Votes

B: 25 most voted
C: 1

Comments 20 comments Click to expand

Comment 1

ID: 98567 User: serg3d Badges: Highly Voted Relative Date: 5 years, 9 months ago Absolute Date: Sat 30 May 2020 02:30 Selected Answer: - Upvotes: 39

I think it should be B because we have to use a combination of old and new test data as well as training data

Comment 1.1

ID: 115950 User: dambilwa Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Mon 22 Jun 2020 03:32 Selected Answer: - Upvotes: 5

Yes - The training set should be shuffled well to represent data across all scenarios

Comment 2

ID: 121871 User: jagadamba Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Sun 28 Jun 2020 14:59 Selected Answer: - Upvotes: 12

B, as we need to train the data with new data, so that it will keep learning, and as well as used for test

Comment 3

ID: 1606795 User: gunnerski Badges: Most Recent Relative Date: 6 months, 1 week ago Absolute Date: Sat 06 Sep 2025 22:50 Selected Answer: B Upvotes: 1

You always need to retrain on such combination

Comment 4

ID: 1605910 User: israndroid Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Wed 03 Sep 2025 18:18 Selected Answer: B Upvotes: 1

Answer is B:
Fow new scenarios, we must need use old and new data for training, for continuos learning of patterns, etc.

Comment 5

ID: 1362311 User: Ahamada Badges: - Relative Date: 1 year ago Absolute Date: Wed 26 Feb 2025 22:35 Selected Answer: B Upvotes: 2

Answer is B, need to take acount both data to train

Comment 6

ID: 1339948 User: cqrm3n Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 13 Jan 2025 16:22 Selected Answer: B Upvotes: 1

We should continuously retrain existing data with latest data to balance the need to adapt to new changes while not overriding historical knowledge. This is called Continuous Retraining where the model is periodically updated with latest data so that recommendations remain accurate over time.

Comment 7

ID: 1330890 User: jaimecalderon Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Mon 23 Dec 2024 18:24 Selected Answer: B Upvotes: 2

Continuously retrain the model on a combination of existing data and the new data.

Comment 8

ID: 1300753 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 06:35 Selected Answer: B Upvotes: 1

From my point of view, we should take both datasets.

Comment 9

ID: 1060861 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 02 Nov 2023 21:18 Selected Answer: B Upvotes: 2

Option A is not recommended because retraining the model on just new data will cause the model to lose the information it has learned from the historical data.

Option C and D are not recommended because they are using the new data as test set and this approach will lead to a model that is overfitting and not generalize well to new users.

So answer is B

Comment 10

ID: 1057993 User: rajkinz Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 30 Oct 2023 16:50 Selected Answer: - Upvotes: 2

Answer is C. It is time sensitive data so latest data should be used for testing.
Reference: https://cloud.google.com/automl-tables/docs/prepare#ml-use

Comment 11

ID: 1050460 User: rtcpost Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 22 Oct 2023 12:47 Selected Answer: B Upvotes: 1

This approach allows the model to benefit from both the historical data (existing data) and the new data, ensuring that it adapts to changing preferences while retaining knowledge from the past. By combining both types of data, the model can learn to make recommendations that are up-to-date and relevant to users' evolving preferences.

Comment 12

ID: 975503 User: Websurfer Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 08 Aug 2023 12:49 Selected Answer: B Upvotes: 1

train on old and new data

Comment 13

ID: 904813 User: AmmarFasih Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 12:26 Selected Answer: B Upvotes: 1

Option B is the right answer. Since the questions states the models needs to be updated since the clothing preference changes. Hence we need the new data to be utilized for training/ updating model.

Comment 14

ID: 835650 User: bha11111 Badges: - Relative Date: 3 years ago Absolute Date: Sat 11 Mar 2023 05:13 Selected Answer: B Upvotes: 1

Have verified this

Comment 15

ID: 816823 User: jin0 Badges: - Relative Date: 3 years ago Absolute Date: Tue 21 Feb 2023 17:20 Selected Answer: - Upvotes: 1

there are two point first when retraining second what data. I think retraining should be occur when the model could not predict well in this case there is monitoring metric should be needed first but no one said, second what data? in this case I think the answer is A. because when the model could not predict well it means the data variance and bias are changed so, it's no make sense what is combination new data with old data because the data being not be changed is not necessary anymore..

Comment 15.1

ID: 816839 User: jin0 Badges: - Relative Date: 3 years ago Absolute Date: Tue 21 Feb 2023 17:28 Selected Answer: - Upvotes: 1

And the questions should explain in detail.. whether it's deep learning or tree based machine learning model.. and how large of new dataset is.. I think

Comment 16

ID: 810159 User: Morock Badges: - Relative Date: 3 years ago Absolute Date: Thu 16 Feb 2023 02:24 Selected Answer: C Upvotes: 1

The trend keep changing, so must mix new and old data...

Comment 17

ID: 784781 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 00:38 Selected Answer: B Upvotes: 7

B. Continuously retrain the model on a combination of existing data and the new data.

This approach will help to ensure that the model remains up-to-date with the latest fashion preferences of the users, while also leveraging the historical data to provide context and improve the accuracy of the recommendations. Retraining the model on a combination of existing and new data will help to prevent the model from being overly influenced by the new data and losing its ability to generalize to users with different preferences.

Option A is not recommended because retraining the model on just new data will cause the model to lose the information it has learned from the historical data.

Option C and D are not recommended because they are using the new data as test set and this approach will lead to a model that is overfitting and not generalize well to new users.

Comment 17.1

ID: 1060859 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 02 Nov 2023 21:16 Selected Answer: - Upvotes: 2

Nice explanation bro.

17. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 271

Sequence: 45
Discussion ID: 130427
Source URL: https://www.examtopics.com/discussions/google/view/130427-exam-professional-data-engineer-topic-1-question-271/
Posted By: raaad
Posted At: Jan. 5, 2024, 6:29 p.m.

Question

You are monitoring your organization’s data lake hosted on BigQuery. The ingestion pipelines read data from Pub/Sub and write the data into tables on BigQuery. After a new version of the ingestion pipelines is deployed, the daily stored data increased by 50%. The volumes of data in Pub/Sub remained the same and only some tables had their daily partition data size doubled. You need to investigate and fix the cause of the data increase. What should you do?

A. 1. Check for duplicate rows in the BigQuery tables that have the daily partition data size doubled.
2. Schedule daily SQL jobs to deduplicate the affected tables.
3. Share the deduplication script with the other operational teams to reuse if this occurs to other tables.
B. 1. Check for code errors in the deployed pipelines.
2. Check for multiple writing to pipeline BigQuery sink.
3. Check for errors in Cloud Logging during the day of the release of the new pipelines.
4. If no errors, restore the BigQuery tables to their content before the last release by using time travel.
C. 1. Check for duplicate rows in the BigQuery tables that have the daily partition data size doubled.
2. Check the BigQuery Audit logs to find job IDs.
3. Use Cloud Monitoring to determine when the identified Dataflow jobs started and the pipeline code version.
4. When more than one pipeline ingests data into a table, stop all versions except the latest one.
D. 1. Roll back the last deployment.
2. Restore the BigQuery tables to their content before the last release by using time travel.
3. Restart the Dataflow jobs and replay the messages by seeking the subscription to the timestamp of the release.

Community Answer Votes

C: 13 most voted
B: 4

Comments 8 comments Click to expand

Comment 1

ID: 1114700 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 18:29 Selected Answer: C Upvotes: 12

- Detailed Investigation of Logs and Jobs Checking for duplicate rows targets the potential immediate cause of the issue.
- Checking the BigQuery Audit logs helps identify which jobs might be contributing to the increased data volume.
- Using Cloud Monitoring to correlate job starts with pipeline versions helps identify if a specific version of the pipeline is responsible.
- Managing multiple versions of pipelines ensures that only the intended version is active, addressing any versioning errors that might have occurred during deployment.

=======
Why not B
While it addresses the symptom (excess data), it doesn't necessarily stop the problem from recurring. (The questions asked to investigate and fix)

Comment 2

ID: 1305290 User: mi_yulai Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Thu 31 Oct 2024 07:10 Selected Answer: - Upvotes: 1

Why not D?

Comment 2.1

ID: 1606452 User: judy_data Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Fri 05 Sep 2025 12:52 Selected Answer: - Upvotes: 1

Because option D doesn't investigate the root cause of the issue and doesn't even check the code.

Comment 3

ID: 1305191 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 30 Oct 2024 22:30 Selected Answer: B Upvotes: 2

No idea which one to choose. Option C miss a step - to restore the tables.

Comment 3.1

ID: 1345970 User: Ryannn23 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Fri 24 Jan 2025 09:14 Selected Answer: - Upvotes: 1

" You need to investigate and fix the cause of the data increase. " - fixing the target tables was not required.

Comment 4

ID: 1121789 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 16:22 Selected Answer: C Upvotes: 1

Option C - agree with Raaad on the reasons

Comment 5

ID: 1120504 User: task_7 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 12 Jan 2024 08:04 Selected Answer: B Upvotes: 2

B. Check for code errors in the deployed pipelines, multiple writing to pipeline BigQuery sink, errors in Cloud Logging, and if necessary, restore tables using time travel.
Check for code errors
Check for multiple writes
Check Cloud Logging
Restore tables if necessary:

Comment 5.1

ID: 1150129 User: RenePetersen Badges: - Relative Date: 2 years ago Absolute Date: Wed 14 Feb 2024 12:14 Selected Answer: - Upvotes: 3

This does not fix the error, it basically assumes that the error is not really there.

18. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 30

Sequence: 49
Discussion ID: 16655
Source URL: https://www.examtopics.com/discussions/google/view/16655-exam-professional-data-engineer-topic-1-question-30/
Posted By: jvg637
Posted At: March 15, 2020, 12:56 p.m.

Question

Your company's customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations.
The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?

A. Add a node to the MySQL cluster and build an OLAP cube there.
B. Use an ETL tool to load the data from MySQL into Google BigQuery.
C. Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
D. Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.

Community Answer Votes

B: 46 most voted
D: 38
C: 1

Comments 23 comments Click to expand

Comment 1

ID: 237801 User: HectorLeon2099 Badges: Highly Voted Relative Date: 5 years, 3 months ago Absolute Date: Tue 08 Dec 2020 03:14 Selected Answer: - Upvotes: 116

It is a GOOGLE exam. The answer won't be on-premise or OLAP cubes even if it is the easiest. The answer is B

Comment 1.1

ID: 1155144 User: Preetmehta1234 Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 02:42 Selected Answer: - Upvotes: 2

That’s so true! This should be the first logic for elimination

Comment 1.2

ID: 531431 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 24 Jan 2022 16:56 Selected Answer: - Upvotes: 11

choose dataproc over hadoop cluster
chose bigquery over all..

there is no special customer requirement that gonna drive us to hadoop or dataproc.

Comment 1.2.1

ID: 786684 User: cetanx Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 17:14 Selected Answer: - Upvotes: 2

Answer - B
mysql dump: This utility creates a logical backup and a flat file containing the SQL statements that can be run again to bring back the database to the state when this file was created. So this file can easily be processed by an ETL tool and loaded into BQ.

Comment 2

ID: 68560 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Fri 27 Mar 2020 11:02 Selected Answer: - Upvotes: 42

Answer: D
Description: Easy and it won’t affect processing

Comment 2.1

ID: 258428 User: Alexej_123 Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Sun 03 Jan 2021 12:45 Selected Answer: - Upvotes: 15

I think it is B and not D:
1) There are no info regarding date freshness required for analytics. So nightly backup might be not enough as a source because it will only provide info one tie a day.
2) Dataproc is recommended as easiest way for migration of hadoop processes. SO no reason to use Dataproc for designing a new analytics processes.
3) The solution is really very limited if you will extend it in the future and add new data sources or create new aggregate tables. Where they should be created?
4) There is no info on which version is on prem MySQL database (I am not an expert in MySql) but I can imagine there might be compartibility issue for backup / restore between different releases

Comment 2.1.1

ID: 459632 User: hellofrnds Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sat 09 Oct 2021 14:56 Selected Answer: - Upvotes: 1

" Dataproc makes open source data and analytics processing fast, easy, and more secure in the cloud ". Please refer this google link.
https://cloud.google.com/blog/products/data-analytics/genomics-data-analytics-with-cloud-pt2

Comment 2.1.1.1

ID: 470142 User: sergio6 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 30 Oct 2021 10:10 Selected Answer: - Upvotes: 1

The link titles "Genomics analysis with Hail, BIGQUERY, and Data Proc", the solution describes the use of Bigquery to do analytics

Comment 3

ID: 1604938 User: Bugnumber1 Badges: Most Recent Relative Date: 6 months, 1 week ago Absolute Date: Sun 31 Aug 2025 21:11 Selected Answer: B Upvotes: 1

Keyword is analysis, so Bigquery. What are you analysing with Dataproc?

Comment 4

ID: 1570311 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Mon 19 May 2025 17:56 Selected Answer: D Upvotes: 1

It point that it shouldn't affect processing

Comment 5

ID: 1562093 User: vosang5299 Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Sun 20 Apr 2025 03:44 Selected Answer: B Upvotes: 1

B is correct

Comment 6

ID: 1400184 User: willyunger Badges: - Relative Date: 11 months, 4 weeks ago Absolute Date: Tue 18 Mar 2025 16:03 Selected Answer: D Upvotes: 2

Option D has no impact on operations, uses backups which are already there. Option B with ETL could impact MySQL performance.

Comment 7

ID: 1346968 User: Juanesdelacruz97 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sun 26 Jan 2025 16:13 Selected Answer: - Upvotes: 1

I think it's B, today BigQuery has multiple connectors that can allow an easy connection to external data sources without impacting the database itself, even if the database was in a SQL instance, MySQL, Federated queries could be used. In my opinion it's B

Comment 8

ID: 1340094 User: Augustax Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Tue 14 Jan 2025 01:10 Selected Answer: D Upvotes: 2

Since the question mentions the nightly backup, why we cannot use it? ETL reduces the impact of the source system but still some impacts. D doesn't add any additional impact.

Comment 9

ID: 1212705 User: mark1223jkh Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 17 May 2024 07:08 Selected Answer: - Upvotes: 1

Answer B:
I don't know why people are choosing D. It is two steps, first cloudsql and then dataproc, a lot of overhead. BigQuery is just perfect fit.

Comment 10

ID: 1166975 User: 0725f1f Badges: - Relative Date: 2 years ago Absolute Date: Wed 06 Mar 2024 08:16 Selected Answer: D Upvotes: 1

This won’t affect processing

Comment 11

ID: 1078305 User: TVH_Data_Engineer Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 23 Nov 2023 10:52 Selected Answer: B Upvotes: 2

Based on these considerations, option B is likely the best approach. By using an ETL tool to load data from MySQL into Google BigQuery, you're leveraging BigQuery's strengths in handling large-scale analytics workloads without impacting the performance of the operational databases. This option provides a clear separation of operational and analytical workloads and takes advantage of BigQuery's fast analytics capabilities.

Comment 12

ID: 1076376 User: axantroff Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 15:37 Selected Answer: B Upvotes: 1

Do not spend much time on in - just B

Comment 13

ID: 1064181 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 06 Nov 2023 20:07 Selected Answer: B Upvotes: 2

Answer is B - Use an ETL tool to load the data from MySQL into Google BigQuery.

* Google BigQuery is a serverless, highly scalable data warehouse that can handle large-scale analytics workloads without impacting your MySQL cluster's performance.
* Using an ETL (Extract, Transform, Load) tool to transfer data from MySQL to BigQuery allows you to maintain a separate analytics environment, ensuring that your operational database remains unaffected.

Option C (connecting an on-premises Apache Hadoop cluster to MySQL and performing ETL) introduces complexity and may not be as scalable as a cloud-based solution.

Option D (mounting backups to Google Cloud SQL and processing the data using Google Cloud Dataproc) could be an option for historical data analysis but might not be the best choice for real-time analytics while the MySQL cluster is under heavy load. Additionally, the backups need to be restored and processed, which might introduce some delay.

Comment 14

ID: 1053164 User: mk_choudhary Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 24 Oct 2023 23:02 Selected Answer: - Upvotes: 2

It's GOOGLE exam where choosing the GCP service shall be first preference.
Now notice the problem statement "perform analytics with minimal impact on operations"
BigQuery is right option for analytic as well as Cloud SQL does provide easy export to GCS where we can query from BigQuery without loading into BQ to save storage cost.

Comment 15

ID: 1050541 User: rtcpost Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 22 Oct 2023 14:17 Selected Answer: B Upvotes: 3

B. Use an ETL tool to load the data from MySQL into Google BigQuery.
* Google BigQuery is a serverless, highly scalable data warehouse that can handle large-scale analytics workloads without impacting your MySQL cluster's performance.
* Using an ETL (Extract, Transform, Load) tool to transfer data from MySQL to BigQuery allows you to maintain a separate analytics environment, ensuring that your operational database remains unaffected.

Option C (connecting an on-premises Apache Hadoop cluster to MySQL and performing ETL) introduces complexity and may not be as scalable as a cloud-based solution.

Option D (mounting backups to Google Cloud SQL and processing the data using Google Cloud Dataproc) could be an option for historical data analysis but might not be the best choice for real-time analytics while the MySQL cluster is under heavy load. Additionally, the backups need to be restored and processed, which might introduce some delay.

Comment 16

ID: 1049205 User: melligeri Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sat 21 Oct 2023 03:33 Selected Answer: B Upvotes: 1

The question clearly says there is load on MYSQL already so doing analytics on it is bad idea. Its bad to run analytics on MYSQL but still a better option to run etl with it to load it to BigQuery.

Comment 17

ID: 1027196 User: imran79 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 07 Oct 2023 10:23 Selected Answer: - Upvotes: 2

B. Use an ETL tool to load the data from MySQL into Google BigQuery. This way, analytics is entirely separated from the operational database, and BigQuery is well-suited for large-scale analytics.

19. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 54

Sequence: 50
Discussion ID: 16670
Source URL: https://www.examtopics.com/discussions/google/view/16670-exam-professional-data-engineer-topic-1-question-54/
Posted By: jvg637
Posted At: March 15, 2020, 4:17 p.m.

Question

Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first. What should you do?

A. Create a file on a shared file and have the application servers write all bid events to that file. Process the file with Apache Hadoop to identify which user bid first.
B. Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
C. Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
D. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.

Community Answer Votes

B: 40
D: 40 most voted
C: 1

Comments 27 comments Click to expand

Comment 1

ID: 64343 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Sun 15 Mar 2020 16:17 Selected Answer: - Upvotes: 66

I'd go with B: real-time is requested, and the only scenario for real time (in the 4 presented) is the use of pub/sub with push.

Comment 1.1

ID: 546971 User: Tanzu Badges: - Relative Date: 4 years ago Absolute Date: Mon 14 Feb 2022 09:17 Selected Answer: - Upvotes: 6

B.
- for realtime pub/sub push is critical. pull creates latency. (eliminates D)
- process by event-time, not by process -time (eliminates D)

Comment 1.1.1

ID: 575145 User: godot Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 25 Mar 2022 16:47 Selected Answer: - Upvotes: 1

no push avail: https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#streaming-pull-migration

Comment 1.1.2

ID: 819103 User: jin0 Badges: - Relative Date: 3 years ago Absolute Date: Thu 23 Feb 2023 12:14 Selected Answer: - Upvotes: 2

The dataflow is designed for realtime processing. and this case should be needed to use dataflow because there is no option to order the data if not using dataflow. So D is answer I think

Comment 1.2

ID: 765696 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 14:35 Selected Answer: - Upvotes: 1

Agree with B

Comment 1.3

ID: 797811 User: donbigi Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 04 Feb 2023 11:15 Selected Answer: - Upvotes: 3

This approach is not ideal because it requires a custom endpoint to write the bid event information into Cloud SQL. This adds additional complexity and potential points of failure to the architecture, as well as adding latency to the processing of bid events, since the data must be written to both Pub/Sub and Cloud SQL. Additionally, it can be more challenging to ensure that bid events are processed in the order they were received, since the data is being written to multiple databases. Finally, using a single database to store bid events could limit scalability and availability, and can also result in slow query performance.

Comment 1.4

ID: 395451 User: ralf_cc Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 01 Jul 2021 05:43 Selected Answer: - Upvotes: 7

Yep, Pub/Sub doesn't have FIFO yet, B is the one that keeps the right order

Comment 1.4.1

ID: 546972 User: Tanzu Badges: - Relative Date: 4 years ago Absolute Date: Mon 14 Feb 2022 09:18 Selected Answer: - Upvotes: 3

it is not a queue, and that is not a issue :)

Comment 1.4.1.1

ID: 546974 User: Tanzu Badges: - Relative Date: 4 years ago Absolute Date: Mon 14 Feb 2022 09:19 Selected Answer: - Upvotes: 3

in a distributed environment, you can not handle this problem wit a queue by the way !

Comment 2

ID: 73139 User: Ganshank Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 11 Apr 2020 02:19 Selected Answer: - Upvotes: 35

D
The need is to collate the messages in real-time. We need to de-dupe the messages based on timestamp of when the event occurred. This can be done by publishing ot Pub-Sub and consuming via Dataflow.

Comment 2.1

ID: 546975 User: Tanzu Badges: - Relative Date: 4 years ago Absolute Date: Mon 14 Feb 2022 09:23 Selected Answer: - Upvotes: 2

Yeap, that's why B is the right one. It has pub/sub push, more real time than pub/sub pull. You need to aware at some point , something has to be pulled which adds a latency.

Comment 2.2

ID: 846601 User: unnamed12355 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 22 Mar 2023 03:31 Selected Answer: - Upvotes: 4

D isnt correct, Pub/sub can send messages out of order, it is no guaranty that the event with lowest timestamp will be processed first
B is correct

Comment 3

ID: 1603600 User: 58606d0 Badges: Most Recent Relative Date: 6 months, 2 weeks ago Absolute Date: Thu 28 Aug 2025 17:06 Selected Answer: B Upvotes: 1

Messages might arrive out of order, so the one that's processed first is not necessarily the first one which was sent.

Comment 4

ID: 1602341 User: forepick Badges: - Relative Date: 6 months, 2 weeks ago Absolute Date: Mon 25 Aug 2025 16:14 Selected Answer: D Upvotes: 1

D - Classical use case for Dataflow session windows with OrderBy timestamp

Comment 5

ID: 1575308 User: theRafael7 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Fri 06 Jun 2025 13:41 Selected Answer: D Upvotes: 1

I would have chosen B because pub/sub push meets the real time option. However, writing to cloudSQL is what makes B wrong. CloudSQL is usually not global and cannot handle the ordering like dataflow. Therefore the answer is D.

Comment 6

ID: 1362572 User: dcruzado Badges: - Relative Date: 1 year ago Absolute Date: Thu 27 Feb 2025 16:03 Selected Answer: B Upvotes: 2

Answer is B
This sentence invalidates D
"Give the bid for each item to the user in the bid event that is processed first."

Comment 7

ID: 1337511 User: manikolbe Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 07 Jan 2025 10:48 Selected Answer: B Upvotes: 4

process by event-time, not by process -time (eliminates D

Comment 8

ID: 1336410 User: Ronn27 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sat 04 Jan 2025 15:56 Selected Answer: D Upvotes: 1

Writing directly to Cloud SQL in real time can cause bottlenecks as Cloud SQL is not designed for high-frequency, low-latency writes from multiple sources.

Answer D is right as Dataflow and Pubsub has the realtime capability

Comment 9

ID: 1328890 User: DGames Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 19 Dec 2024 08:42 Selected Answer: D Upvotes: 1

Bid event time and Pull Subscription is important part.

Comment 10

ID: 1287738 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sun 22 Sep 2024 15:40 Selected Answer: D Upvotes: 1

It feels like it depends what's actually in the dataflow pipeline. D I believe is the answer they intend, even if messages are pulled out of order.

Comment 11

ID: 1249925 User: manel_bhs Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 17 Jul 2024 21:02 Selected Answer: D Upvotes: 1

While using Cloud Pub/Sub for real-time event streaming is a good choice, pushing events to a custom endpoint that writes to Cloud SQL introduces additional complexity.
Custom endpoints need to be maintained, and the process of writing to Cloud SQL might not be as efficient as using a purpose-built data processing service.

Comment 12

ID: 1241345 User: Snnnnneee Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 12:53 Selected Answer: B Upvotes: 2

In D the user gets it where the data is ingested first. That can be wrong for a global auction solution

Comment 13

ID: 1207260 User: yassoraa88 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 06 May 2024 10:58 Selected Answer: D Upvotes: 2

This is the most suitable solution for the requirements. Google Cloud Pub/Sub can handle high throughput and low-latency data ingestion. Coupled with Google Cloud Dataflow, which can process data streams in real time, this setup allows for immediate processing of bid events. Dataflow can also handle ordering and timestamp extraction, crucial for determining which bid came first. This architecture supports scalability and real-time analytics, which are essential for a global auction system.

Comment 14

ID: 1207134 User: teka112233 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 06 May 2024 04:44 Selected Answer: D Upvotes: 3

the Answer should be D for the following
Real-time Processing
Centralized Processing
Winner Determination
also, B is unsuitable as While Pub/Sub can ingest data, Cloud SQL is a relational database not designed for real-time processing at this scale. Maintaining a custom endpoint adds complexity.

Comment 15

ID: 1173244 User: I__SHA1234567 Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Thu 14 Mar 2024 09:37 Selected Answer: D Upvotes: 3

Google Cloud Pub/Sub is a scalable and reliable messaging service that can handle high volumes of data and deliver messages in real-time. By having each application server publish bid events to Cloud Pub/Sub, you ensure that all bid events are collected centrally.

Using Cloud Dataflow with a pull subscription allows you to process the bid events in real-time. Cloud Dataflow provides a managed service for stream and batch processing, and it can handle the real-time processing requirements efficiently.

By processing the bid events with Cloud Dataflow, you can determine which user bid first by applying the appropriate logic within your Dataflow pipeline. This approach ensures scalability, reliability, and real-time processing capabilities, making it suitable for handling bid events from multiple application servers.

Comment 16

ID: 1135962 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 30 Jan 2024 17:48 Selected Answer: - Upvotes: 1

B should be the answer, because it writes the bid into Cloud SQL to a distributed system. This way the customer know if they get the bid or not, immediately.
Also, push requests are faster than pull requests, hence they are better for realtime experience.

Comment 17

ID: 1104557 User: arpana_naa Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 24 Dec 2023 12:13 Selected Answer: D Upvotes: 1

pub/sub for entry time stamp + event time
dataflow for processing and dataflow is better for real time

20. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 109

Sequence: 53
Discussion ID: 79780
Source URL: https://www.examtopics.com/discussions/google/view/79780-exam-professional-data-engineer-topic-1-question-109/
Posted By: AWSandeep
Posted At: Sept. 3, 2022, 2:09 p.m.

Question

You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Cloud Logging. What are the two most likely causes of this problem? (Choose two.)

A. Publisher throughput quota is too small.
B. Total outstanding messages exceed the 10-MB maximum.
C. Error handling in the subscriber code is not handling run-time errors properly.
D. The subscriber code cannot keep up with the messages.
E. The subscriber code does not acknowledge the messages that it pulls.

Community Answer Votes

DE: 5
CE: 5 most voted
CD: 1

Comments 16 comments Click to expand

Comment 1

ID: 668637 User: TNT87 Badges: Highly Voted Relative Date: 3 years, 5 months ago Absolute Date: Wed 14 Sep 2022 07:45 Selected Answer: - Upvotes: 13

Answer C E
By not acknowleding the pulled message, this result in it be putted back in Cloud Pub/Sub, meaning the messages accumulate instead of being consumed and removed from Pub/Sub. The same thing can happen ig the subscriber maintains the lease on the message it receives in case of an error. This reduces the overall rate of processing because messages get stuck on the first subscriber. Also, errors in Cloud Function do not show up in Stackdriver Log Viewer if they are not correctly handled.

Comment 2

ID: 1602633 User: forepick Badges: Most Recent Relative Date: 6 months, 2 weeks ago Absolute Date: Tue 26 Aug 2025 09:37 Selected Answer: CE Upvotes: 1

D won't fit here, as the problem is that TOO MANY messages are being processed. This is a common result of unacked messages.

Two common causes - C and E

Comment 3

ID: 1327041 User: clouditis Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sun 15 Dec 2024 21:14 Selected Answer: DE Upvotes: 1

Not C, its talking about the unknown!

Comment 4

ID: 1302166 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 23 Oct 2024 21:11 Selected Answer: DE Upvotes: 1

The issue is that the acknowledgment is not sent back to sub properly. D, E should be correct.

Comment 5

ID: 1288315 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 23 Sep 2024 23:18 Selected Answer: DE Upvotes: 2

D. The subscriber code cannot keep up with the messages.

If the processing rate of the subscriber (Cloud Functions) is lower than the incoming message rate, it can lead to a backlog of messages. This would result in higher-than-expected message rates, as messages accumulate while waiting to be processed.

E. The subscriber code does not acknowledge the messages that it pulls.

If messages are not acknowledged properly, Pub/Sub will keep retrying to deliver them, which can lead to the same messages being sent repeatedly. This could also contribute to the perception that the message processing rate is very high.

Both of these issues can lead to unanticipated behavior in your message processing pipeline without generating errors that would be logged in Cloud Logging.

Comment 6

ID: 1264419 User: JamesKarianis Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 12 Aug 2024 04:14 Selected Answer: CD Upvotes: 1

The code in the CF can't keep up with the amount of messages, thus C D is a better fit

Comment 7

ID: 881067 User: mialll Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 26 Apr 2023 05:39 Selected Answer: DE Upvotes: 1

Ref chatgpt
Option C, "Error handling in the subscriber code is not handling run-time errors properly," suggests that the subscriber code may not be correctly handling errors that occur during message processing. If the subscriber code encounters an error that it cannot handle, such as a syntax error or a network issue, it may stop processing messages, leading to a slowdown in message processing.

However, the lack of error logs in Cloud Logging suggests that there are no errors being logged, which makes it less likely that this is the primary cause of the observed behavior. Additionally, while incorrect error handling could contribute to the issue, it may not be the primary reason why the message processing rate is much higher than anticipated.

Comment 7.1

ID: 1067882 User: GCPete Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sat 11 Nov 2023 15:08 Selected Answer: - Upvotes: 1

Chat says about Option C: "it may stop processing messages, leading to a slowdown in message processing" - but is doesn't say there's a slowdown in the question. It says it's increased.

I would replace C with D. If the Cloud Function isn't capable of processing messages as quickly as they arrive, the backlog will grow, leading to higher processing rates as the function continuously tries to catch up. This scenario might not generate errors in Cloud Logging if the function is simply falling behind.

Comment 8

ID: 834636 User: midgoo Badges: - Relative Date: 3 years ago Absolute Date: Fri 10 Mar 2023 05:29 Selected Answer: CE Upvotes: 1

C - as no error shown in Cloud Logging
Between D & E, both could lead to the problem. I have worked with lot of PubSub issues, most of them are due to the bottleneck at the code where it takes too long to process 1 message and causes backlog. E could lead to backlog too, but it is too obvious and not likely to happen in reality.
However, when I ask AI the same question, it said C and E

Comment 8.1

ID: 927507 User: cetanx Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Mon 19 Jun 2023 15:00 Selected Answer: - Upvotes: 1

C. Error handling in the subscriber (Cloud Functions) code is not handling run-time errors properly.

This would mean to have error logs in Cloud Logging as CF by default logs to it.

Comment 9

ID: 810712 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Thu 16 Feb 2023 14:49 Selected Answer: - Upvotes: 3

Answer D&E
I am not in the favour of C, error handling is a side factor but not the primary cause.
First check the configuration access.
Does subscriber has enough acknowledge policies (option E)
Does sub have ability to keep up the message( enough network, cpu and capable codes) (option D)
option C is just a part of option D somewhere showing incapable handling

Comment 10

ID: 781619 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 19 Jan 2023 22:23 Selected Answer: - Upvotes: 3

My question is: 'What is the actual problem?'
- That there is no logs in Cloud Logging?
- That Pub/Sub is having a problem?
- Or there an actual problem?
- Is there an actual error?

So what is Pub/Sub the message processing rate is high...Does that mean there is a problem?

Thoughts?

Comment 10.1

ID: 985466 User: squishy_fishy Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sun 20 Aug 2023 01:18 Selected Answer: - Upvotes: 1

Like TNT87 mentioned the message processing rate is high "meaning the messages accumulate instead of being consumed and removed from Pub/Sub."

Comment 11

ID: 762275 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 30 Dec 2022 20:52 Selected Answer: - Upvotes: 1

C, E seems correct

Comment 12

ID: 666534 User: MounicaN Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 12 Sep 2022 03:48 Selected Answer: - Upvotes: 3

D might also be right?
Subscriber might not be provisioned enough

Comment 13

ID: 658419 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 14:09 Selected Answer: CE Upvotes: 3

C. Error handling in the subscriber code is not handling run-time errors properly.
E. The subscriber code does not acknowledge the messages that it pulls.

21. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 87

Sequence: 58
Discussion ID: 16572
Source URL: https://www.examtopics.com/discussions/google/view/16572-exam-professional-data-engineer-topic-1-question-87/
Posted By: madhu1171
Posted At: March 14, 2020, 2:44 p.m.

Question

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

A. Increase the size of your parquet files to ensure them to be 1 GB minimum.
B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.

Community Answer Votes

D: 23 most voted
A: 16
C: 10

Comments 30 comments Click to expand

Comment 1

ID: 65087 User: rickywck Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Tue 17 Mar 2020 09:14 Selected Answer: - Upvotes: 70

Should be A:

https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files
https://www.dremio.com/tuning-parquet/

C & D will improve performance but need to pay more $$

Comment 1.1

ID: 458353 User: diluvio Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Wed 06 Oct 2021 18:16 Selected Answer: - Upvotes: 5

It is A . please read the links above

Comment 1.2

ID: 737845 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 13:26 Selected Answer: - Upvotes: 1

https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance

Comment 1.3

ID: 416201 User: raf2121 Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Wed 28 Jul 2021 14:55 Selected Answer: - Upvotes: 8

Point for discussion - Another reason why it can't be C or D.
SSD's are not available on pre-emptible Worker nodes (answers didn't say whether they wanted to switch from HDD to SDD for Master nodes)
https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs

Comment 1.3.1

ID: 624757 User: rr4444 Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 29 Jun 2022 17:46 Selected Answer: - Upvotes: 1

You can have local SSDs for the dataproc normal or preemptible VMs https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-pd-ssd

Comment 1.3.2

ID: 416203 User: raf2121 Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Wed 28 Jul 2021 15:01 Selected Answer: - Upvotes: 1

Also for Shuffling Operations, one need to override the preemptible VMs configuration to increase boot disk size.
(Second half of answer D is correct but first half is wrong)

Comment 1.4

ID: 825629 User: jin0 Badges: - Relative Date: 3 years ago Absolute Date: Wed 01 Mar 2023 09:05 Selected Answer: - Upvotes: 3

you are right C&D will pay more $. the reason of this questions is shuffling I think. and to reduce shuffling between jobs then make file size larger

Comment 2

ID: 63867 User: madhu1171 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Sat 14 Mar 2020 14:44 Selected Answer: - Upvotes: 12

Answer should be D

Comment 2.1

ID: 66249 User: jvg637 Badges: - Relative Date: 5 years, 11 months ago Absolute Date: Fri 20 Mar 2020 12:18 Selected Answer: - Upvotes: 15

D: # By default, preemptible node disk sizes are limited to 100GB or the size of the non-preemptible node disk sizes, whichever is smaller. However you can override the default preemptible disk size to any requested size. Since the majority of our cluster is using preemptible nodes, the size of the disk used for caching operations will see a noticeable performance improvement using a larger disk. Also, SSD's will perform better than HDD. This will increase costs slightly, but is the best option available while maintaining costs.

Comment 2.1.1

ID: 115455 User: ch3n6 Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Sun 21 Jun 2020 12:13 Selected Answer: - Upvotes: 3

C is correct. D is wrong. they are using 'dataproc and GCS', not related to boot disk at all .

Comment 2.1.1.1

ID: 139639 User: VishalB Badges: - Relative Date: 5 years, 7 months ago Absolute Date: Mon 20 Jul 2020 16:52 Selected Answer: - Upvotes: 1

C is recommended only -
If you have many small files, consider copying files for processing to the local HDFS and then copying the results back

Comment 2.1.1.1.1

ID: 159104 User: FARR Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Sun 16 Aug 2020 10:53 Selected Answer: - Upvotes: 3

File sizes are already within the expected range for GCS (128MB-1GB) so not C.
D seems most feasible

Comment 3

ID: 1571241 User: AdriHubert Badges: Most Recent Relative Date: 9 months, 3 weeks ago Absolute Date: Thu 22 May 2025 09:12 Selected Answer: A Upvotes: 1

Why this is the better choice:
Larger Parquet files reduce the number of splits and metadata overhead.
This leads to fewer tasks, less shuffle, and better parallelism.
It does not increase infrastructure cost—just improves how data is structured.
It aligns with Google’s best practices for Spark on Dataproc.

Comment 4

ID: 1561782 User: rajshiv Badges: - Relative Date: 10 months, 4 weeks ago Absolute Date: Sat 19 Apr 2025 01:43 Selected Answer: C Upvotes: 1

C is correct. It cannot be D as increasing boot disk size does not impact shuffle performance much. We need local SSDs specifically attached for shuffle storage (temporary fast storage), not just a bigger persistent boot disk.
C is the correct answer because SSD + HDFS shuffle layer = fastest for Spark shuffle-heavy jobs, while still using preemptibles and keeping costs down. The job as mentioned is shuffle-intensive. In Spark, shuffling (moving data between nodes) is heavily disk I/O intensive. Faster local storage (i.e., SSDs) can dramatically speed up shuffle operations compared to using standard HDDs. GCS is great for object storage, not shuffle storage.

Comment 5

ID: 1561781 User: rajshiv Badges: - Relative Date: 10 months, 4 weeks ago Absolute Date: Sat 19 Apr 2025 01:40 Selected Answer: C Upvotes: 1

C is the correct answer

Comment 6

ID: 1400883 User: oussama7 Badges: - Relative Date: 11 months, 4 weeks ago Absolute Date: Thu 20 Mar 2025 02:29 Selected Answer: C Upvotes: 2

Improves shuffle management by using HDFS instead of GCS.
SSDs speed up access to temporary data.
Compatible with Dataproc's preemptible cost model, without requiring more non-preemptible workers.

Comment 7

ID: 1398911 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 16:23 Selected Answer: C Upvotes: 2

Preemptible Cost Considerations

Using preemptibles (with 2 non-preemptible workers) is cost-effective, but shuffle operations still need fast local storage.
SSDs improve reliability without increasing instance costs significantly

Comment 8

ID: 1337485 User: f74ca0c Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 07 Jan 2025 09:03 Selected Answer: A Upvotes: 1

A.
Not D because it doesn't make sens to move to SSD when cost-senstive

Comment 8.1

ID: 1602422 User: forepick Badges: - Relative Date: 6 months, 2 weeks ago Absolute Date: Mon 25 Aug 2025 20:22 Selected Answer: - Upvotes: 1

Maybe that's the reason they mentioned that only TWO servers are not spot-VMs. Spot VMs are stateless while the others are stateful = have HDs for suffling.

Comment 9

ID: 1306439 User: Javakidson Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Sun 03 Nov 2024 07:00 Selected Answer: - Upvotes: 1

A is the answer

Comment 10

ID: 1302126 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 23 Oct 2024 18:39 Selected Answer: A Upvotes: 1

I think either A or C. The problem is occured by I/O performance. Option A is feasible, which reduces the number of files leading better parallel processing. Option C tries to handle I/O performance issue.
Taking other factors like budget and no mention of HDD/SSD, option A is possible the correct answer.

Comment 11

ID: 1288158 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 23 Sep 2024 16:23 Selected Answer: A Upvotes: 1

There's no mention of a drive type used, only GCS. That means A is the only sensible option.

Comment 12

ID: 1255092 User: 987af6b Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 25 Jul 2024 18:49 Selected Answer: A Upvotes: 2

Question doesn't actually say they are using HDD in the scenario, for that reason I choose A

Comment 13

ID: 1145420 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 09 Feb 2024 12:49 Selected Answer: - Upvotes: 2

A
We don't know if HDD was used, so we can know what to do about that, but we know that the parquet files are small and much, and we can act on that by increasing the sizes to have lesser number of it.

Comment 14

ID: 1086069 User: rocky48 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 02 Dec 2023 11:43 Selected Answer: A Upvotes: 1

Should be A:
https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files

Comment 14.1

ID: 1087630 User: rocky48 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 04 Dec 2023 13:32 Selected Answer: - Upvotes: 1

Given the scenario and the cost-sensitive nature of your organization, the best option would be:

C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job, and copy results back to GCS.

Option C allows you to leverage the benefits of SSDs and HDFS while minimizing costs by continuing to use Dataproc on preemptible VMs. This approach optimizes both performance and cost-effectiveness for your analytical workload on Google Cloud.

Comment 15

ID: 960306 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 11:51 Selected Answer: A Upvotes: 1

https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files

Cost effective is the key in the question.

Comment 16

ID: 847057 User: Nandhu95 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 22 Mar 2023 13:35 Selected Answer: D Upvotes: 1

Preemptible VMs can't be used for HDFS storage.
As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads.

Comment 17

ID: 826584 User: midgoo Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 08:15 Selected Answer: D Upvotes: 1

Should NOT be A as:
1. The file size is already at the optimal size
2. If the current file size works well in the current Hadoop, it is expected to have similar performance in Dataproc

The only difference between the current and Dataproc is that Dataproc is using preemptible nodes. So yes, it may incur a bit more cost by using SSD but assuming using the preemptible already save most of it, so we want to save less to improve the performance

Comment 17.1

ID: 961827 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 18:43 Selected Answer: - Upvotes: 1

Optimal size is 1GB

22. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 255

Sequence: 63
Discussion ID: 130206
Source URL: https://www.examtopics.com/discussions/google/view/130206-exam-professional-data-engineer-topic-1-question-255/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 4:38 p.m.

Question

You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure. What should you do?

A. Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery.
B. Create a Pub/Sub subscription to write to BigQuery directly. Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.
C. Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.
D. Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.

Community Answer Votes

D: 16 most voted

Comments 4 comments Click to expand

Comment 1

ID: 1114150 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Fri 05 Jul 2024 00:49 Selected Answer: D Upvotes: 11

- Datastream is a serverless and easy-to-use change data capture (CDC) and replication service.
- You would create a Datastream service that sources from your Oracle database and targets BigQuery, with private connectivity configuration to the same VPC.
- This option is designed to minimize the need to manage infrastructure and is a fully managed service.

Comment 2

ID: 1598528 User: Zek Badges: Most Recent Relative Date: 6 months, 4 weeks ago Absolute Date: Sat 16 Aug 2025 15:17 Selected Answer: D Upvotes: 1

https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/oracle-to-bigquery

This tutorial shows you how to deploy a job that continuously replicates changed data from an Oracle database to a BigQuery dataset, using Cloud Data Fusion Replication. This feature is powered by Datastream.

Comment 3

ID: 1154489 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Tue 20 Aug 2024 04:36 Selected Answer: D Upvotes: 2

D. Datastream

Comment 4

ID: 1112898 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 15:38 Selected Answer: D Upvotes: 2

D. Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.

23. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 57

Sequence: 65
Discussion ID: 16673
Source URL: https://www.examtopics.com/discussions/google/view/16673-exam-professional-data-engineer-topic-1-question-57/
Posted By: jvg637
Posted At: March 15, 2020, 5:05 p.m.

Question

Your company is currently setting up data pipelines for their campaign. For all the Google Cloud Pub/Sub streaming data, one of the important business requirements is to be able to periodically identify the inputs and their timings during their campaign. Engineers have decided to use windowing and transformation in Google Cloud Dataflow for this purpose. However, when testing this feature, they find that the Cloud Dataflow job fails for the all streaming insert. What is the most likely cause of this problem?

A. They have not assigned the timestamp, which causes the job to fail
B. They have not set the triggers to accommodate the data coming in late, which causes the job to fail
C. They have not applied a global windowing function, which causes the job to fail when the pipeline is created
D. They have not applied a non-global windowing function, which causes the job to fail when the pipeline is created

Community Answer Votes

D: 11 most voted
A: 10
C: 1
B: 1

Comments 24 comments Click to expand

Comment 1

ID: 68712 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 28 Mar 2020 03:42 Selected Answer: - Upvotes: 67

Answer: D
Description: Caution: Beam’s default windowing behavior is to assign all elements of a PCollection to a single, global window and discard late data, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must do at least one of the following:
—->>>>>>Set a non-global windowing function. See Setting your PCollection’s windowing function.
Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur.
—->>>>If you don’t set a non-global windowing function or a non-default trigger for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your job will fail.
So it looks like D

Comment 1.1

ID: 784881 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 03:39 Selected Answer: - Upvotes: 1

Why not C?
Because I think that the most likely cause of the problem is C. They have not applied a global windowing function, which causes the job to fail when the pipeline is created.

In Dataflow, windowing is used to divide the input data into smaller time intervals, called windows. Without a windowing function, all the data may be treated as part of the same window and the pipeline may not be able to process the data correctly. In this specific scenario, the engineers are trying to use windowing and transformation in Google Cloud Dataflow to periodically identify the inputs and their timings during the campaign, so it's likely that they need to use a windowing function to divide the data into smaller time intervals in order to process it correctly. Not applying a windowing function, or applying the wrong one can cause the job to fail.

Someone Clarify? Am I missing an important point?

Comment 1.1.1

ID: 959526 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 22 Jul 2023 15:01 Selected Answer: - Upvotes: 4

You are missing that the global window is the default window that we typically use for batch processing. The global window by default waits until all data is available before processing it so if you want to use it with streaming you need to set some custom trigger so that we don't wait indefinitely to wait until we aggregate. All in all it doesn't sound right.

https://www.youtube.com/watch?v=oJ-LueBvOcM
https://www.youtube.com/watch?v=MuFA6CSti6M

Comment 2

ID: 64363 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Sun 15 Mar 2020 17:05 Selected Answer: - Upvotes: 15

Global windowing is the default behavior, so I don't think C is right.
An error can occur if a non-global window or a non-default trigger is not set.
I would say D.
(https://beam.apache.org/documentation/programming-guide/#windowing)

Comment 3

ID: 1585821 User: imrane1995 Badges: Most Recent Relative Date: 8 months ago Absolute Date: Sat 12 Jul 2025 14:24 Selected Answer: A Upvotes: 1

Timestamps are required for streaming + windowing. Missing timestamps can crash the job.

Comment 4

ID: 1398894 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 15:50 Selected Answer: A Upvotes: 2

Google Cloud Dataflow requires event timestamps when using windowing in streaming mode.
By default, Pub/Sub messages do not have timestamps; they need to be assigned manually using withTimestampFn()

Comment 5

ID: 1345294 User: Yad_datatonic Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Thu 23 Jan 2025 11:38 Selected Answer: B Upvotes: 1

The job fails because triggers are not set to handle late-arriving data, causing the pipeline to mishandle or drop delayed records.

Comment 6

ID: 1331730 User: Rav761 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 26 Dec 2024 00:43 Selected Answer: A Upvotes: 2

A. They have not assigned the timestamp, which causes the job to fail

Analysis: Cloud Dataflow relies on timestamps to perform windowing operations. Without proper event-time timestamps, windowing cannot be applied correctly, and the job may fail or behave unpredictably. This is a common issue when processing streaming data from Google Cloud Pub/Sub, as timestamps must be explicitly assigned if not already embedded in the data.
This is the most likely cause.

Comment 7

ID: 1306749 User: Erg_de Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 04 Nov 2024 04:03 Selected Answer: A Upvotes: 3

This option is very likely, as without timestamps assigned to streaming data, the system cannot properly process time windows. Timestamps are crucial for the correct time handling in Dataflow pipelines

Comment 8

ID: 1214365 User: 39405bb Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 20 May 2024 15:50 Selected Answer: - Upvotes: 6

The most likely cause of this problem is A. They have not assigned the timestamp, which causes the job to fail.

Here's why:

Importance of Timestamps in Windowing: Windowing in Dataflow relies on timestamps to group elements into windows. If timestamps are not explicitly assigned or extracted from the data, Dataflow cannot determine which elements belong to which windows, leading to failures in the job.
Let's look at the other options:

B. They have not set the triggers to accommodate the data coming in late: While triggers are important for managing late data, not setting them would not cause the job to fail for all streaming inserts. It might affect the accuracy of the results, but the job would still run.
C & D. Global vs. Non-global Windowing: The choice between global and non-global windowing depends on the specific requirements of the analysis. While incorrect windowing choices can lead to unexpected results, they would not typically cause the job to fail completely.

Comment 9

ID: 1136760 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 31 Jan 2024 14:05 Selected Answer: - Upvotes: 1

D
You have to apply a non-global windowing function because the global windowing function is a default windowing function for every pub/sub stream or batch data.

Comment 10

ID: 1017059 User: MikkelRev Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Mon 25 Sep 2023 19:27 Selected Answer: - Upvotes: 2

option B: They have not set the triggers to accommodate the data coming in late, which causes the job to fail.

In a streaming data processing pipeline, it's common to encounter data that arrives late, meaning it arrives after the event time has passed for the associated window. If you don't handle late data appropriately by setting triggers, it can cause issues in your pipeline, including job failures.

Comment 11

ID: 879912 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 05:58 Selected Answer: A Upvotes: 2

gpt: The most likely cause of the problem is A, that they have not assigned the timestamp.
In streaming data processing, timestamps are essential for proper windowing and triggering of data. Without timestamps, the system cannot correctly determine which window a particular piece of data belongs to, or when it is safe to trigger processing of a window. If the engineers did not assign timestamps to the data, the Cloud Dataflow job would not be able to process the data correctly, and it would fail.

Option B, not setting triggers to accommodate late data, is also an important consideration for streaming data processing. However, it is less likely to cause the job to fail outright than missing timestamps.

Option C, not applying a global windowing function, and Option D, not applying a non-global windowing function, are also important considerations for windowing in Cloud Dataflow. However, neither of these would cause the job to fail when the pipeline is created. Instead, they would affect the performance and correctness of the data processing.

Comment 11.1

ID: 880059 User: muhusman Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 09:01 Selected Answer: - Upvotes: 4

without a correct timestamp, the pipeline still run fine with the default timestamp. The result maybe incorrect but the job will not fail.

Comment 11.1.1

ID: 882514 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 27 Apr 2023 12:28 Selected Answer: - Upvotes: 2

okay, so D maybe

Comment 11.1.2

ID: 882521 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 27 Apr 2023 12:35 Selected Answer: - Upvotes: 1

gpt pt2: For example, if your use case requires you to calculate a running average of values over a fixed time interval, you would likely use a non-global windowing function with a fixed time interval. On the other hand, if you need to perform a computation on the entire stream of data at once, a global windowing function might be more appropriate.

So, the choice of windowing function should be based on the specific requirements of the data processing task at hand, and it may or may not be important to apply a non-global windowing function when the pipeline is created.
------
question says that we need to identify streaming input, time, so non-global needed, let it be d...

Comment 11.1.3

ID: 882520 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 27 Apr 2023 12:35 Selected Answer: - Upvotes: 1

- is it impotant to apply a non-global windowing function when the pipeline is created?
GPT: It is important to choose the appropriate windowing function for your data processing needs, but whether it should be a global or non-global windowing function depends on the requirements of your specific use case.
A global windowing function considers all data elements within a bounded time interval as a single window, whereas a non-global windowing function divides the data stream into smaller windows based on specified criteria (such as a fixed time interval or a number of elements).

Comment 11.2

ID: 1073610 User: emmylou Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 17 Nov 2023 21:15 Selected Answer: - Upvotes: 1

Which is the moment I decided that AI was nothing to fear

Comment 12

ID: 843506 User: lucaluca1982 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 19 Mar 2023 08:01 Selected Answer: - Upvotes: 2

what about A? This can cause the job to fail

Comment 13

ID: 823400 User: midgoo Badges: - Relative Date: 3 years ago Absolute Date: Mon 27 Feb 2023 09:32 Selected Answer: D Upvotes: 1

A: note that without a correct timestamp, the pipeline still run fine with the default timestamp. The result maybe incorrect but the job will not fail.
D: For unbound collection, this will fail if any aggregation function is done.

Comment 14

ID: 808334 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Tue 14 Feb 2023 12:55 Selected Answer: - Upvotes: 2

Answer: A
All Streaming Insert failed, because there is no TimeStamp added, otherwise there is already a DEFAULT global windowing function and can execute without assigning any windowing function.
I mean first there should be Timestamp in the data, then according to our aggregation outcome either its full time (global) or batch/chunks time aggregation(non global) will be performed.

Comment 15

ID: 747156 User: DipT Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 13:15 Selected Answer: D Upvotes: 1

https://beam.apache.org/documentation/programming-guide/#windowing
Beam’s default windowing behavior is to assign all elements of a PCollection to a single, global window and discard late data, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must do at least one of the following:

Set a non-global windowing function. See Setting your PCollection’s windowing function.
Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur.

Comment 16

ID: 669980 User: Ray0506 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 15 Sep 2022 15:42 Selected Answer: D Upvotes: 1

Answer is D

Comment 17

ID: 665287 User: TOXICcharlie Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 10 Sep 2022 11:55 Selected Answer: D Upvotes: 2

Correct answer is D. C does not make sense because for unbounded source like Pub/Sub, the global functions are applied by default. The reason for failure would be they are using specific aggregations that require non-global window functions, e.g. tumbling or hopping windows.

24. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 176

Sequence: 74
Discussion ID: 79501
Source URL: https://www.examtopics.com/discussions/google/view/79501-exam-professional-data-engineer-topic-1-question-176/
Posted By: PhuocT
Posted At: Sept. 2, 2022, 7:15 p.m.

Question

You have uploaded 5 years of log data to Cloud Storage. A user reported that some data points in the log data are outside of their expected ranges, which indicates errors. You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons. What should you do?

A. Import the data from Cloud Storage into BigQuery. Create a new BigQuery table, and skip the rows with errors.
B. Create a Compute Engine instance and create a new copy of the data in Cloud Storage. Skip the rows with errors.
C. Create a Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage.
D. Create a Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to the same dataset in Cloud Storage.

Community Answer Votes

C: 15 most voted

Comments 7 comments Click to expand

Comment 1

ID: 657711 User: AWSandeep Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 21:25 Selected Answer: C Upvotes: 9

C. Create a Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage.

You can't filter out data using BQ load commands. You must imbed the logic to filter out data (i.e. time ranges) in another decoupled way (i.e. Dataflow, Cloud Functions, etc.). Therefore, A and B add additional complexity and deviates from the Data Lake design paradigm. D is wrong as the question strictly implies that the existing data set needs to be retained for compliance.

Comment 2

ID: 982970 User: FP77 Badges: Highly Voted Relative Date: 2 years ago Absolute Date: Fri 16 Feb 2024 23:04 Selected Answer: - Upvotes: 6

Strange answers... Since when does cloud storage have datasets? Lol
Keeping this in mind, the answer must be C, but none is really correcg

Comment 3

ID: 1580992 User: Ben_oso Badges: Most Recent Relative Date: 8 months, 2 weeks ago Absolute Date: Fri 27 Jun 2025 05:05 Selected Answer: C Upvotes: 1

with C the user dont see the data with errors, all clean

Comment 4

ID: 1147996 User: ea2023 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 12 Aug 2024 10:53 Selected Answer: - Upvotes: 1

why not D if the versioning is activated while creating your bucket ?

Comment 5

ID: 1101928 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 20 Jun 2024 20:43 Selected Answer: C Upvotes: 3

Option C is the best approach in this situation. Here is why:

Option A would remove data which may be needed for compliance reasons. Keeping the original data is preferred.
Option B makes a copy of the data but still removes potentially useful records. Additional storage costs would be incurred as well.
Option C uses Dataflow to clean the data by setting out of range values while keeping the original data intact. The fixed records are written to a new location for further analysis. This meets the requirements.
Option D writes the fixed data back to the original location, overwriting the original data. This would violate the compliance needs to keep the original data untouched.
So option C leverages Dataflow to properly clean the data while preserving the original data for compliance, at reasonable operational costs. This best achieves the stated requirements.

Comment 6

ID: 763270 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sat 01 Jul 2023 17:04 Selected Answer: - Upvotes: 2

C. Create a Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage.

Comment 7

ID: 657641 User: PhuocT Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 20:15 Selected Answer: C Upvotes: 2

C is correct

25. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 216

Sequence: 75
Discussion ID: 129863
Source URL: https://www.examtopics.com/discussions/google/view/129863-exam-professional-data-engineer-topic-1-question-216/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:41 a.m.

Question

You are developing an Apache Beam pipeline to extract data from a Cloud SQL instance by using JdbcIO. You have two projects running in Google Cloud. The pipeline will be deployed and executed on Dataflow in Project A. The Cloud SQL. instance is running in Project B and does not have a public IP address. After deploying the pipeline, you noticed that the pipeline failed to extract data from the Cloud SQL instance due to connection failure. You verified that VPC Service Controls and shared VPC are not in use in these projects. You want to resolve this error while ensuring that the data does not go through the public internet. What should you do?

A. Set up VPC Network Peering between Project A and Project B. Add a firewall rule to allow the peered subnet range to access all instances on the network.
B. Turn off the external IP addresses on the Dataflow worker. Enable Cloud NAT in Project A.
C. Add the external IP addresses of the Dataflow worker as authorized networks in the Cloud SQL instance.
D. Set up VPC Network Peering between Project A and Project B. Create a Compute Engine instance without external IP address in Project B on the peered subnet to serve as a proxy server to the Cloud SQL database.

Community Answer Votes

A: 24 most voted
D: 24

Comments 19 comments Click to expand

Comment 1

ID: 1268972 User: aoifneofi_ef Badges: Highly Voted Relative Date: 1 year, 6 months ago Absolute Date: Tue 20 Aug 2024 02:13 Selected Answer: D Upvotes: 7

It is a tie between A and D.
Option A will definitely provide necessary connectivity but is less secure as access is enabled to "all instances". Which i feel is unnecessary considering industry best practices.

Option D provides the necessary connectivity but brings in the unnecessary overhead of managing an extra VM and introduces a bit of extra complexity.

Since the question emphasises on data not going through public internet(which is satisfied in both options), i would prioritise security over simplicity and choose option D in this case.

Comment 2

ID: 1326651 User: clouditis Badges: Highly Voted Relative Date: 1 year, 2 months ago Absolute Date: Sun 15 Dec 2024 02:41 Selected Answer: A Upvotes: 5

A looks to be the best out of the 4, D is complicated involving Compute Engine which is unnecessary making it cumbersome to address the problem

Comment 3

ID: 1581013 User: Ben_oso Badges: Most Recent Relative Date: 8 months, 2 weeks ago Absolute Date: Fri 27 Jun 2025 07:45 Selected Answer: A Upvotes: 1

VPC Peering allow to two private routes can communicate each other.

Comment 4

ID: 1345058 User: 71083a7 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Thu 23 Jan 2025 03:44 Selected Answer: D Upvotes: 2

"all instances" freaks me out

Comment 5

ID: 1324451 User: julydev82 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Tue 10 Dec 2024 09:48 Selected Answer: A Upvotes: 2

vpc create a subnet beetwen sql Cloud and the dataflow.
https://cloud.google.com/sql/docs/mysql/private-ip#multiple_vpc_connectivity

Comment 6

ID: 1283781 User: fadlkhafdofpew Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sat 14 Sep 2024 23:30 Selected Answer: A Upvotes: 1

The answer is A. While D might work, it adds unnecessary complexity. Setting up a proxy is an extra layer of infrastructure that isn’t required

Comment 7

ID: 1267377 User: Saaaurabh Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 17 Aug 2024 02:21 Selected Answer: A Upvotes: 1

If properly implemented with the right routing and firewall rules, Option A can be the correct and most straightforward solution, as it leverages VPC Peering to maintain internal traffic without going through the public internet.

Comment 8

ID: 1263553 User: meh_33 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 16:43 Selected Answer: - Upvotes: 1

A is correct

Comment 9

ID: 1246293 User: kk1211 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 11 Jul 2024 21:48 Selected Answer: - Upvotes: 1

still confused between A and D

Comment 10

ID: 1243085 User: Lenifia Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 06 Jul 2024 01:45 Selected Answer: A Upvotes: 2

A is correct

Comment 11

ID: 1240731 User: kajitsu Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 02 Jul 2024 13:31 Selected Answer: A Upvotes: 1

no proxy needed

Comment 12

ID: 1226605 User: Lestrang Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 08 Jun 2024 09:14 Selected Answer: A Upvotes: 1

People referencing "VPC Network Peering does not provide transitive routing. For example, if VPC networks net-a and net-b are connected using VPC Network Peering, and VPC networks net-a and net-c are also connected using VPC Network Peering, VPC Network Peering does not provide connectivity between net-b and net-c."

the question states that cloud sql is running in project B.
Which means the instance is already part of the VPC in project B, so with Network Peering workers from A can definitely access data in B. No proxy is needed.

Comment 13

ID: 1217262 User: fabiogoma Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 09:01 Selected Answer: A Upvotes: 2

Why so many people are voting for D? There's no need for a proxy, the peering is enough to allow network traffic between subnets.

Comment 13.1

ID: 1217270 User: fabiogoma Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 09:14 Selected Answer: - Upvotes: 2

Now I see why, I put this on ChatGPT and it thinks the right answer is D. I'm pretty sure that's a hallucination.

Comment 13.1.1

ID: 1576514 User: Positron75 Badges: - Relative Date: 9 months ago Absolute Date: Wed 11 Jun 2025 12:30 Selected Answer: - Upvotes: 1

A says to allow access to "all instances in the network", which is excessive and not good practice security-wise.

This is yet another question where no answer is fully correct and it's up to you to choose which one is less wrong.

Comment 14

ID: 1214840 User: ccpmad Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 11:38 Selected Answer: - Upvotes: 2

Proxy? no, it is not necessary..

A

Comment 15

ID: 1213420 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 18 May 2024 20:12 Selected Answer: D Upvotes: 2

https://cloud.google.com/sql/docs/mysql/connect-multiple-vpcs

Comment 16

ID: 1160007 User: chrissamharris Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 20:05 Selected Answer: A Upvotes: 4

A - The requirement for a proxy is un-necessary:
https://cloud.google.com/sql/docs/mysql/private-ip#multiple_vpc_connectivity

Comment 17

ID: 1153955 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Mon 19 Feb 2024 14:50 Selected Answer: - Upvotes: 2

Option D. Source: https://cloud.google.com/sql/docs/mysql/private-ip#multiple_vpc_connectivity

26. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 289

Sequence: 78
Discussion ID: 130292
Source URL: https://www.examtopics.com/discussions/google/view/130292-exam-professional-data-engineer-topic-1-question-289/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 11:12 a.m.

Question

You have data located in BigQuery that is used to generate reports for your company. You have noticed some weekly executive report fields do not correspond to format according to company standards. For example, report errors include different telephone formats and different country code identifiers. This is a frequent issue, so you need to create a recurring job to normalize the data. You want a quick solution that requires no coding. What should you do?

A. Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.
B. Use Dataflow SQL to create a job that normalizes the data, and that after the first run of the job, schedule the pipeline to execute recurrently.
C. Create a Spark job and submit it to Dataproc Serverless.
D. Use BigQuery and GoogleSQL to normalize the data, and schedule recurring queries in BigQuery.

Community Answer Votes

A: 19 most voted
D: 4

Comments 15 comments Click to expand

Comment 1

ID: 1121903 User: Matt_108 Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 18:23 Selected Answer: A Upvotes: 8

Definitely A, cloud data fusion and wrangler to setup the clean up pipeline with no coding required

Comment 2

ID: 1571661 User: 22c1725 Badges: Most Recent Relative Date: 9 months, 3 weeks ago Absolute Date: Fri 23 May 2025 18:47 Selected Answer: A Upvotes: 1

A. Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.

Comment 3

ID: 1337358 User: marlon.andrei Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Mon 06 Jan 2025 21:48 Selected Answer: D Upvotes: 2

The question say "You want a quick solution that requires no coding.". The data is in BQ, then is most easy normalize the data, and schedule recurring queries in BigQuery.

Comment 3.1

ID: 1574218 User: 22c1725 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Mon 02 Jun 2025 15:49 Selected Answer: - Upvotes: 2

No code, SQL sure is coding.

Comment 4

ID: 1252611 User: 987af6b Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 21 Jul 2024 18:59 Selected Answer: A Upvotes: 3

A. Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.

Explanation
No Coding Required: Cloud Data Fusion's Wrangler offers a no-code interface for data transformation tasks. You can visually design data normalization workflows without writing any code.
Recurring Jobs: Cloud Data Fusion allows you to schedule these data normalization tasks to run on a recurring basis, meeting your need for automation.

Comment 5

ID: 1249158 User: carmltekai Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 16 Jul 2024 20:41 Selected Answer: D Upvotes: 2

The best solution here is D. Use BigQuery and GoogleSQL to normalize the data, and schedule recurring queries in BigQuery.

Here's why:

* No-code solution: BigQuery's built-in capabilities and GoogleSQL offer a no-code way to transform and standardize data. You can leverage functions like REGEXP_REPLACE to normalize phone numbers and FORMAT to ensure consistent formatting across fields.
* Recurring jobs: BigQuery allows you to schedule queries to run regularly, which is perfect for maintaining data consistency over time.
* Quick and efficient: BigQuery is designed for large-scale data processing, making it fast and efficient for normalization tasks.

Comment 5.1

ID: 1249160 User: carmltekai Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 16 Jul 2024 20:42 Selected Answer: - Upvotes: 1

Why other options aren't as suitable:

A. Cloud Data Fusion and Wrangler: While powerful, these tools might be overkill for a simple normalization task and could involve a steeper learning curve.
B. Dataflow SQL: Dataflow is primarily for stream processing and might not be the most efficient for batch transformations on data already in BigQuery.
C. Dataproc Serverless: This involves using a Spark job, which requires coding and might be more complex than necessary for this task.

Comment 6

ID: 1231876 User: fitri001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 17 Jun 2024 12:33 Selected Answer: A Upvotes: 2

https://cloud.google.com/data-fusion/docs

Comment 7

ID: 1158648 User: SohiniV Badges: - Relative Date: 2 years ago Absolute Date: Sun 25 Feb 2024 12:12 Selected Answer: - Upvotes: 1

As per chatGPT, Option D allows you to utilize BigQuery's SQL capabilities to write queries that normalize the data according to company standards.
You can then schedule these queries to run on a recurring basis using BigQuery's scheduled queries feature. This feature allows you to specify a schedule (e.g., weekly) for executing SQL queries automatically.
This approach requires no additional setup or coding outside of BigQuery, making it a quick and straightforward solution to address the issue of data normalization.

Comment 7.1

ID: 1158649 User: SohiniV Badges: - Relative Date: 2 years ago Absolute Date: Sun 25 Feb 2024 12:13 Selected Answer: - Upvotes: 1

Any views on this ?

Comment 7.1.1

ID: 1159597 User: RenePetersen Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 11:29 Selected Answer: - Upvotes: 6

Wouldn't writing the SQL transformation be considered coding? The question specifically states that a solution requiring no coding is needed.

Comment 7.1.1.1

ID: 1177023 User: jreale64 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 19 Mar 2024 07:13 Selected Answer: - Upvotes: 1

While Cloud Data Fusion with Wrangler offers a visual interface for data wrangling, it requires setting up the environment and potentially writing code for ransformations. So it its not appropriate. I think D

Comment 8

ID: 1155704 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 17:35 Selected Answer: A Upvotes: 1

Option A

Comment 9

ID: 1118387 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 10:04 Selected Answer: A Upvotes: 2

Cloud Data Fusion and Wrangler

Comment 10

ID: 1113523 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 11:12 Selected Answer: A Upvotes: 2

A. Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.

27. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 313

Sequence: 80
Discussion ID: 152659
Source URL: https://www.examtopics.com/discussions/google/view/152659-exam-professional-data-engineer-topic-1-question-313/
Posted By: mcdaley
Posted At: Dec. 7, 2024, 2:45 p.m.

Question

You want to migrate an Apache Spark 3 batch job from on-premises to Google Cloud. You need to minimally change the job so that the job reads from Cloud Storage and writes the result to BigQuery. Your job is optimized for Spark, where each executor has 8 vCPU and 16 GB memory, and you want to be able to choose similar settings. You want to minimize installation and management effort to run your job. What should you do?

A. Execute the job as part of a deployment in a new Google Kubernetes Engine cluster.
B. Execute the job from a new Compute Engine VM.
C. Execute the job in a new Dataproc cluster.
D. Execute as a Dataproc Serverless job.

Community Answer Votes

D: 12 most voted
C: 11

Comments 10 comments Click to expand

Comment 1

ID: 1324614 User: chicity_de Badges: Highly Voted Relative Date: 1 year, 3 months ago Absolute Date: Tue 10 Dec 2024 16:06 Selected Answer: D Upvotes: 11

Priority is "minimize installation and management effort" which is done via Dataproc Serverless. Furthermore, with Dataproc serverless you can still specify resource settings for your job, such as the number of vCPUs and memory per executor (https://cloud.google.com/dataproc-serverless/docs/concepts/properties)

Comment 2

ID: 1571090 User: 22c1725 Badges: Most Recent Relative Date: 9 months, 3 weeks ago Absolute Date: Wed 21 May 2025 22:16 Selected Answer: C Upvotes: 3

I would go with (C) : "and you want to be able to choose similar settings" not applicable for (D)

Comment 2.1

ID: 1574988 User: Positron75 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Thu 05 Jun 2025 10:03 Selected Answer: - Upvotes: 1

Dataproc Serverless allows configuring those parameters: https://cloud.google.com/dataproc-serverless/docs/concepts/properties

Comment 3

ID: 1563194 User: gabbferreira Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Wed 23 Apr 2025 23:35 Selected Answer: C Upvotes: 1

" where each executor has 8 vCPU and 16 GB memory, and you want to be able to choose similar settings."

minimize effort by using dataproc not GKE or vms..

Comment 4

ID: 1560284 User: rajshiv Badges: - Relative Date: 11 months ago Absolute Date: Sun 13 Apr 2025 13:15 Selected Answer: D Upvotes: 1

D is the best answer - It allows Dataproc serverless allows to specify executor configurations like vCPU and memory settings (e.g., executor cores and memory) to match the current setup as is specified.
C is a valid but sub-optimal choice — while we can specify vCPUs and memory similar to our on-prem setup but it requires provisioning and managing clusters, which we want to avoid. it requires slightly more effort compared to Dataproc Serverless.

Comment 5

ID: 1361118 User: gabazzzo Badges: - Relative Date: 1 year ago Absolute Date: Mon 24 Feb 2025 17:52 Selected Answer: - Upvotes: 1

I agree that minimizing installation and management means using Dataproc Serverles.
Also, Serverles can be configured with up to 16 VPU and up to 29696m of memory in for the premium tier. https://cloud.google.com/dataproc-serverless/docs/concepts/properties#:~:text=Total%20driver%20memory%20per%20driver%20core%2C%20including%20driver%20memory%20overhead%2C%20which%20must%20be%20between%201024m%20and%207424m%20for%20the%20Standard%20compute%20tier%20(24576m%20for%20the%20Premium%20compute%20tier).%20For%20example%2C%20if%20spark.driver.cores%20%3D%204%2C%20then%204096m%20%3C%3D%20spark.driver.memory%20%2B%20spark.driver.memoryOverhead%20%3C%3D%2029696m.

Comment 6

ID: 1356130 User: a494e30 Badges: - Relative Date: 1 year ago Absolute Date: Thu 13 Feb 2025 14:39 Selected Answer: C Upvotes: 2

Needs to be able to configure "similar settings"

Comment 7

ID: 1351289 User: plum21 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Tue 04 Feb 2025 11:23 Selected Answer: C Upvotes: 2

It's not possible to specify a machine type using Dataproc Serverless

Comment 8

ID: 1341263 User: marlon.andrei Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Wed 15 Jan 2025 22:50 Selected Answer: C Upvotes: 2

I choice "C", just: "where each executor has 8 vCPU and 16 GB memory, and you want to be able to choose similar settings"

Comment 9

ID: 1323120 User: mcdaley Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sat 07 Dec 2024 14:45 Selected Answer: C Upvotes: 1

Dataproc supports Spark 3, ensuring compatibility with your existing job.

It also allows you to customize the cluster configuration, including the number of executors, vCPUs, and memory per executor, to match your on-premises setup (8 vCPU and 16 GB memory)

28. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 173

Sequence: 82
Discussion ID: 79524
Source URL: https://www.examtopics.com/discussions/google/view/79524-exam-professional-data-engineer-topic-1-question-173/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 7:51 p.m.

Question

You are designing a pipeline that publishes application events to a Pub/Sub topic. Although message ordering is not important, you need to be able to aggregate events across disjoint hourly intervals before loading the results to BigQuery for analysis. What technology should you use to process and load this data to
BigQuery while ensuring that it will scale with large volumes of events?

A. Create a Cloud Function to perform the necessary data processing that executes using the Pub/Sub trigger every time a new message is published to the topic.
B. Schedule a Cloud Function to run hourly, pulling all available messages from the Pub/Sub topic and performing the necessary aggregations.
C. Schedule a batch Dataflow job to run hourly, pulling all available messages from the Pub/Sub topic and performing the necessary aggregations.
D. Create a streaming Dataflow job that reads continually from the Pub/Sub topic and performs the necessary aggregations using tumbling windows.

Community Answer Votes

D: 13 most voted

Comments 14 comments Click to expand

Comment 1

ID: 747583 User: Atnafu Badges: Highly Voted Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 21:50 Selected Answer: - Upvotes: 11

D

TUMBLE=> fixed windows.
HOP=> sliding windows.
SESSION=> session windows.

Comment 2

ID: 820990 User: musumusu Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Fri 24 Feb 2023 23:06 Selected Answer: - Upvotes: 7

why not c ? as data is arriving hourly why we can use batch processing rather than streaming with 1 hour fixed window?

Comment 2.1

ID: 870112 User: MrMone Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 14 Apr 2023 12:11 Selected Answer: - Upvotes: 2

"you need to be able to aggregate events across disjoint hourly intervals" does not means data is arriving hourly. however, it's tricky! Answer D

Comment 2.2

ID: 1056759 User: ga8our Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 29 Oct 2023 11:44 Selected Answer: - Upvotes: 2

I second your question. Noone who suggests Dataflow streaming (D) has given an explanation why an hourly batch job is insufficient.

Comment 2.3

ID: 1056755 User: ga8our Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 29 Oct 2023 11:42 Selected Answer: - Upvotes: 2

I second your question. Noone who suggests C has given an explanation why an hourly batch job is insufficient.

Comment 3

ID: 1573521 User: 22c1725 Badges: Most Recent Relative Date: 9 months, 2 weeks ago Absolute Date: Fri 30 May 2025 14:06 Selected Answer: D Upvotes: 1

I would go with "D" not (C) for this: while ensuring that it will scale with large volumes of events. Since the main reason people would with streaming job is 4v conisdertion (Volume, Vairity..etc)

Comment 4

ID: 1292216 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 02 Oct 2024 07:07 Selected Answer: D Upvotes: 4

Just to provide clarity to people asking "why not C" - the source is a pub/sub. Pub/Sub has a limit of 10 MB or 1000 messages for a single batch publish request, which means that batch dataflow will not necessarily be able to retrieve all messages. If the question had said "there will always be less than 1000 messages and less than 10mb", only then would batch be acceptable.

Comment 5

ID: 1278661 User: mayankazyour Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 05 Sep 2024 06:51 Selected Answer: D Upvotes: 1

The question asks for future scalability for large volumes of events, its better to go with streaming dataflow job.

Comment 6

ID: 1076576 User: emmylou Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 20:20 Selected Answer: - Upvotes: 2

I just do not understand why this needs to be streamed. I understand that there might be a slight delay using batch processing but there is no indication this is critical data. Can someone please provide your thinking?

Comment 7

ID: 961965 User: vamgcp Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 21:20 Selected Answer: - Upvotes: 1

We can use TUMBLE(1 HOUR) to create hourly windows, where each window contains events from a specific hour.

Comment 8

ID: 961961 User: vamgcp Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 21:17 Selected Answer: D Upvotes: 1

Option D : A streaming Dataflow job is the best way to process and load data from Pub/Sub to BigQuery in real time. This is because streaming Dataflow jobs can scale to handle large volumes of data, and they can perform aggregations using tumbling windows.

Comment 9

ID: 696706 User: devaid Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Mon 17 Oct 2022 02:43 Selected Answer: D Upvotes: 2

Answer D
Tumbling Windows = Fixed Windows

Comment 10

ID: 675950 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 22 Sep 2022 11:46 Selected Answer: D Upvotes: 2

Answer D

Comment 11

ID: 657679 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 19:51 Selected Answer: D Upvotes: 2

D. Create a streaming Dataflow job that reads continually from the Pub/Sub topic and performs the necessary aggregations using tumbling windows.

A tumbling window represents a consistent, disjoint time interval in the data stream.

Reference:
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#tumbling-windows

29. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 225

Sequence: 84
Discussion ID: 129872
Source URL: https://www.examtopics.com/discussions/google/view/129872-exam-professional-data-engineer-topic-1-question-225/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:51 a.m.

Question

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

A. Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
B. Migrate your data to Cloud Storage and register the bucket as a Dataplex asset. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
C. Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.
D. Migrate your data to BigLake. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc on Compute Engine.

Community Answer Votes

A: 18 most voted
C: 15
B: 9
D: 1

Comments 20 comments Click to expand

Comment 1

ID: 1113799 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 16:19 Selected Answer: A Upvotes: 7

- This option involves moving Parquet files to Cloud Storage, which is a common and cost-effective storage solution for big data and is compatible with Spark jobs.
- Using Dataproc Metastore to manage metadata allows us to keep Hadoop ecosystem's structural information.
- Running Spark jobs on Dataproc Serverless takes advantage of managed Spark services without managing clusters.
- Once the data is in Cloud Storage, you can also easily load it into BigQuery for further analysis.

Comment 2

ID: 1352109 User: skhaire Badges: Highly Voted Relative Date: 1 year, 1 month ago Absolute Date: Wed 05 Feb 2025 23:28 Selected Answer: B Upvotes: 5

BigQuery Integration: The requirement is to make data available in BigQuery. Dataplex has built-in integration with BigQuery. It can automatically discover data in Cloud Storage and create external tables in BigQuery, making the data readily queryable. DPMS doesn't have this direct integration with BigQuery.

Comment 3

ID: 1573096 User: 22c1725 Badges: Most Recent Relative Date: 9 months, 2 weeks ago Absolute Date: Wed 28 May 2025 18:53 Selected Answer: A Upvotes: 1

Would Go with A

Comment 4

ID: 1357582 User: 380e3c6 Badges: - Relative Date: 1 year ago Absolute Date: Mon 17 Feb 2025 06:08 Selected Answer: A Upvotes: 1

A is correctbecause it minimizes ETL changes, keeps Parquet data in Cloud Storage (cost-effective and Spark-compatible), and integrates with BigQuery via external tables. C is flawed** since moving directly to BigQuery requires refactoring Spark jobs, increasing complexity and costs. B adds unnecessary governance overhead, and D focuses on infrastructure instead of pipeline efficiency.

Comment 5

ID: 1354328 User: plum21 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 10 Feb 2025 08:41 Selected Answer: D Upvotes: 1

The requirement:
"You want to use managed services"
excludes Dataproc Serverless.
Dataproc on Compute Engine remains.
Next requirement:
"BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery" -> BigLake
Next requirement:
"while minimizing ETL data processing changes and overhead costs" -> Refactor Spark pipelines to write and read data on Cloud Storage

Notes
1. Dataproc Metastore (DPMS) could be used on Dataproc to read data from BQ but not the other way round.

Comment 5.1

ID: 1570153 User: Positron75 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Mon 19 May 2025 11:21 Selected Answer: - Upvotes: 2

How is Dataproc Serverless not a managed service, but running Dataproc on Compute Engine is? D is the first answer to rule out.

Comment 6

ID: 1346689 User: LP_PDE Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 25 Jan 2025 23:30 Selected Answer: A Upvotes: 1

Both Spark and BigQuery can directly access data in Cloud Storage.

Comment 7

ID: 1313140 User: hrishi19 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sat 16 Nov 2024 18:03 Selected Answer: C Upvotes: 4

The question states that the data should be available on BigQuery and only option C meets this requirement.

Comment 8

ID: 1265115 User: JamesKarianis Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 13 Aug 2024 12:59 Selected Answer: A Upvotes: 1

A is correct

Comment 9

ID: 1225349 User: Anudeep58 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Thu 06 Jun 2024 11:55 Selected Answer: A Upvotes: 3

Option B: Registering the bucket as a Dataplex asset adds an additional layer of data governance and management. While useful, it may not be necessary for your immediate migration needs and can introduce additional complexity.
Option C: Migrating data directly to BigQuery would require significant changes to your Spark pipelines since they would need to be refactored to read from and write to BigQuery instead of Parquet files. This approach could introduce higher costs due to BigQuery storage and querying.
Option D: Using BigLake and Dataproc on Compute Engine is more complex and requires more management compared to Dataproc Serverless. Additionally, it might not be as cost-effective as leveraging Cloud Storage and Dataproc Serverless.

Comment 9.1

ID: 1269437 User: aoifneofi_ef Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Tue 20 Aug 2024 14:09 Selected Answer: - Upvotes: 2

Just adding further commentary on why A is correct while why other options are incorrect is explained above.
Parquet files have schema engrained in them. Hence Spark pipelines on Hadoop Cluster may not have needed tables at all. Hence the simplest solution would be to move it to Cloud Storage instead of BigQuery and this way there would be minimal changes to the ETL pipelines - just change HDFS file system pointer to GCS file system for read writes and no need for any additional tables

Comment 10

ID: 1213456 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 18 May 2024 21:58 Selected Answer: A Upvotes: 1

The question says "You want to use managed services, while minimizing ETL data processing changes and overhead costs". Dataproc is a managed service that doesn't need to refactor the data transformation Spark code you already have (you will have to refactor only the wrtie and read code), an it has a Big Query connector for future use. https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery

Comment 11

ID: 1167829 User: 52ed0e5 Badges: - Relative Date: 2 years ago Absolute Date: Thu 07 Mar 2024 10:22 Selected Answer: C Upvotes: 3

Migrate your data directly to BigQuery.
Refactor Spark pipelines to read from and write to BigQuery.
Run the Spark jobs on Dataproc Serverless.
The best choice for ensuring data availability in BigQuery. It allows seamless integration with BigQuery and minimizes ETL changes.

Comment 12

ID: 1159924 User: Ramon98 Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 17:21 Selected Answer: C Upvotes: 4

A tricky one, because of "you need to ensure that your data is available in BigQuery". The easiest and most straight forward migration seems answer A to me, and then you can use external tables to make the parquet data directly available in BigQuery.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet

However creating the external tables is an extra step? So therefore maybe C is the answer?

Comment 13

ID: 1159300 User: Moss2011 Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 03:11 Selected Answer: C Upvotes: 1

I think the key phrase here is "you need to ensure that your data is available in BigQuery" that's why I thing C it's the best option

Comment 14

ID: 1152651 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 16:31 Selected Answer: C Upvotes: 3

I think it's C.

Dataproc can use BigQuery to read and write data.
Dataproc's BigQuery connector is a library that allows Spark and Hadoop applications to process and write data from BigQuery.

Here's how Dataproc can be used with BigQuery:
Process large datasets: Use Spark to process data stored in BigQuery.
Write results: Write the results back to BigQuery or other data storage for further analysis.
Read data: The BigQuery connector can read data from BigQuery into a Spark DataFrame.
Write data: The connector writes data to BigQuery by buffering all the data into a Cloud Storage temporary table.

Comment 14.1

ID: 1152653 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 16:33 Selected Answer: - Upvotes: 3

As per question.. "BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services (DATAPROC), while minimizing ETL data processing changes and overhead costs."

Comment 15

ID: 1150351 User: matiijax Badges: - Relative Date: 2 years ago Absolute Date: Wed 14 Feb 2024 18:03 Selected Answer: B Upvotes: 4

I think its B and the reason is that egistering the data as a Dataplex asset enables seamless integration with BigQuery later on. Dataplex simplifies data discovery and lineage tracking, making it easier to prepare your data for BigQuery transformations.

Comment 16

ID: 1144372 User: saschak94 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 08 Feb 2024 12:02 Selected Answer: - Upvotes: 3

Why would I select A here? Why not moving the data to BigQuery and running Dataproc Serverless jobs accessing the data in BigQuery?

Comment 17

ID: 1109553 User: e70ea9e Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:51 Selected Answer: A Upvotes: 3

Managed Services: Leverages Dataproc Serverless for a fully managed Spark environment, reducing overhead and administrative tasks.
Minimal Data Processing Changes: Keeps Spark pipelines largely intact by working with Parquet files on Cloud Storage, minimizing refactoring efforts.
BigQuery Integration: Dataproc Serverless can directly access BigQuery, enabling future transformation pipelines without additional data movement.
Cost-Effective: Serverless model scales resources only when needed, optimizing costs for intermittent workloads.

30. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 226

Sequence: 85
Discussion ID: 129873
Source URL: https://www.examtopics.com/discussions/google/view/129873-exam-professional-data-engineer-topic-1-question-226/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:52 a.m.

Question

Your organization has two Google Cloud projects, project A and project B. In project A, you have a Pub/Sub topic that receives data from confidential sources. Only the resources in project A should be able to access the data in that topic. You want to ensure that project B and any future project cannot access data in the project A topic. What should you do?

A. Add firewall rules in project A so only traffic from the VPC in project A is permitted.
B. Configure VPC Service Controls in the organization with a perimeter around project A.
C. Use Identity and Access Management conditions to ensure that only users and service accounts in project A. can access resources in project A.
D. Configure VPC Service Controls in the organization with a perimeter around the VPC of project A.

Community Answer Votes

B: 25 most voted
C: 13

Comments 13 comments Click to expand

Comment 1

ID: 1123599 User: datapassionate Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 20:36 Selected Answer: C Upvotes: 11

And I would agree with GPT. The question is about that who can do what within GCP environment. It's all about permissions and access management, not about networking.

Comment 2

ID: 1113819 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 16:40 Selected Answer: B Upvotes: 8

Option B:
-It allows us to create a secure boundary around all resources in Project A, including the Pub/Sub topic.
- It prevents data exfiltration to other projects and ensures that only resources within the perimeter (Project A) can access the sensitive data.
- VPC Service Controls are specifically designed for scenarios where you need to secure sensitive data within a specific context or boundary in Google Cloud.

Comment 3

ID: 1573097 User: 22c1725 Badges: Most Recent Relative Date: 9 months, 2 weeks ago Absolute Date: Wed 28 May 2025 18:56 Selected Answer: B Upvotes: 1

Since such case would not be stright forward as C

Comment 4

ID: 1265090 User: MithunDesai Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 13 Aug 2024 11:59 Selected Answer: B Upvotes: 3

The best solution to prevent project B and any future projects from accessing data in project A's Pub/Sub topic is B. Configure VPC Service Controls in the organization with a perimeter around project A.

Comment 5

ID: 1263531 User: meh_33 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 15:56 Selected Answer: B Upvotes: 3

B is correct Raaad is always right

Comment 6

ID: 1217307 User: fabiogoma Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 24 May 2024 10:17 Selected Answer: B Upvotes: 4

Setting up a perimeter around project A is future proof, the question asks to "ensure that project B and any future project cannot access data in the project A topic", IAM is not future proof.

Reference: https://cloud.google.com/vpc-service-controls/docs/overview#isolate

p.s: VPC Service Controls is not the same thing as VPC, instead its a security layer on top of a VPC and it should be used together with IAM, not one or the other (https://cloud.google.com/vpc-service-controls/docs/overview#how-vpc-service-controls-works)

Comment 7

ID: 1214279 User: virat_kohli Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 20 May 2024 12:19 Selected Answer: C Upvotes: 2

C. Use Identity and Access Management conditions to ensure that only users and service accounts in project A. can access resources in project A. [SIMPLE!!!]

Comment 8

ID: 1152685 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 17:24 Selected Answer: B Upvotes: 2

I'll go with "B. Configure VPC Service Controls in the organization with a perimeter around project A."

Comment 9

ID: 1123586 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 20:21 Selected Answer: - Upvotes: 1

GPT:
C. Use Identity and Access Management conditions to ensure that only users and service accounts in project A can access resources in project A.

Analysis: This is the most appropriate option. IAM allows you to define who (which users or service accounts) has what access to your GCP resources. By setting IAM policies with conditions specific to Project A, you can ensure that only designated entities within Project A have access to its resources, including the Pub/Sub topic.
D. Configure VPC Service Controls in the organization with a perimeter around the VPC of project A.

Comment 9.1

ID: 1123587 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 20:22 Selected Answer: - Upvotes: 1

A. Add firewall rules in project A so only traffic from the VPC in project A is permitted.

Analysis: Firewall rules in GCP are used to control traffic to and from instances within Google Cloud Virtual Private Clouds (VPCs). However, they don't specifically control access to Pub/Sub resources. Pub/Sub access is managed through IAM, not VPC firewall rules.

Comment 9.1.1

ID: 1123588 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 20:23 Selected Answer: - Upvotes: 1

B. Configure VPC Service Controls in the organization with a perimeter around project A.

Analysis: VPC Service Controls provide a security perimeter for your data, but they are more focused on preventing data exfiltration; this might be more complex and broader than necessary for the specific requirement of restricting access to a Pub/Sub topic.

Comment 9.1.1.1

ID: 1123589 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 16 Jan 2024 14:42 Selected Answer: - Upvotes: 1

D. Configure VPC Service Controls in the organization with a perimeter around the VPC of project A.

Analysis: Similar to option B, this is focused on securing network boundaries rather than specific resource access within GCP. While it could provide an additional layer of security, it's not the most direct way to control access to a specific Pub/Sub topic.

Comment 10

ID: 1109554 User: e70ea9e Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:52 Selected Answer: B Upvotes: 4

VPC Service Controls enforce a security perimeter around entire projects, ensuring that resources within project A (including the Pub/Sub topic) are inaccessible from any other project, including project B and future projects.
This aligns with the requirement to prevent cross-project access.

31. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 311

Sequence: 89
Discussion ID: 132198
Source URL: https://www.examtopics.com/discussions/google/view/132198-exam-professional-data-engineer-topic-1-question-311/
Posted By: AllenChen123
Posted At: Jan. 26, 2024, 5:50 a.m.

Question

Your chemical company needs to manually check documentation for customer order. You use a pull subscription in Pub/Sub so that sales agents get details from the order. You must ensure that you do not process orders twice with different sales agents and that you do not add more complexity to this workflow. What should you do?

A. Use a Deduplicate PTransform in Dataflow before sending the messages to the sales agents.
B. Create a transactional database that monitors the pending messages.
C. Use Pub/Sub exactly-once delivery in your pull subscription.
D. Create a new Pub/Sub push subscription to monitor the orders processed in the agent's system.

Community Answer Votes

C: 15 most voted
A: 2

Comments 9 comments Click to expand

Comment 1

ID: 1134931 User: JimmyBK Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Mon 29 Jan 2024 13:25 Selected Answer: C Upvotes: 7

I remember seeing this in the exam.

Comment 1.1

ID: 1136241 User: Jordan18 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 31 Jan 2024 00:15 Selected Answer: - Upvotes: 3

how many questions were from here?

Comment 1.1.1

ID: 1260549 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 04 Aug 2024 09:15 Selected Answer: - Upvotes: 3

also got this one. about 70%

Comment 2

ID: 1571094 User: 22c1725 Badges: Most Recent Relative Date: 9 months, 3 weeks ago Absolute Date: Wed 21 May 2025 22:25 Selected Answer: C Upvotes: 1

"that you do not add more complexity to this workflow".
I would go with C

Comment 3

ID: 1571092 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Wed 21 May 2025 22:23 Selected Answer: C Upvotes: 1

I would go with pup/sup even thou ack might be cuse message to be send twice.

Comment 4

ID: 1254088 User: cien91 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 24 Jul 2024 05:57 Selected Answer: A Upvotes: 2

Why not C - Exactly-once delivery in Pub/Sub guarantees that a message is delivered to a subscriber exactly once. However, it doesn't prevent multiple subscribers from processing the same message.

Comment 5

ID: 1156327 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Thu 22 Feb 2024 12:32 Selected Answer: C Upvotes: 1

Option C

Comment 6

ID: 1138529 User: Sofiia98 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 02 Feb 2024 13:44 Selected Answer: C Upvotes: 1

C, of course

Comment 7

ID: 1132258 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 05:50 Selected Answer: C Upvotes: 4

Straightforward.
https://cloud.google.com/pubsub/docs/exactly-once-delivery

32. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 296

Sequence: 101
Discussion ID: 129909
Source URL: https://www.examtopics.com/discussions/google/view/129909-exam-professional-data-engineer-topic-1-question-296/
Posted By: chickenwingz
Posted At: Dec. 30, 2023, 9:07 p.m.

Question

Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on- premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?

A. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.
B. Use a proxy host in the VPC in Google Cloud connecting to Kafka. Write a Dataflow pipeline, read data from the proxy host, and write the data to BigQuery.
C. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.
D. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Write a Dataflow pipeline, read the data from Pub/Sub, and write the data to BigQuery.

Community Answer Votes

C: 14 most voted
A: 6
D: 2

Comments 23 comments Click to expand

Comment 1

ID: 1560340 User: rajshiv Badges: - Relative Date: 11 months ago Absolute Date: Sun 13 Apr 2025 16:38 Selected Answer: A Upvotes: 2

I think it is A and not C. While Dataflow can read from Kafka directly, it works best for Kafka clusters hosted in Google Cloud. Reading from an on-prem Kafka over Interconnect directly from Dataflow is not recommended due to latency, firewall/NAT issues, and network complexity. Most important this option is Not optimal for performance and reliability across hybrid environments.

Comment 2

ID: 1263326 User: meh_33 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 07:39 Selected Answer: - Upvotes: 1

Going with C

Comment 3

ID: 1243137 User: Anudeep58 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 06 Jul 2024 04:54 Selected Answer: C Upvotes: 2

Latency: Option C, with direct integration between Kafka and Dataflow, offers lower latency by eliminating intermediate steps.
Flexibility: Custom Dataflow pipelines (Option C) provide more control over data processing and optimization compared to using a pre-built template.

Comment 4

ID: 1193996 User: anushree09 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 22:19 Selected Answer: - Upvotes: 1

per the text below at https://cloud.google.com/dataflow/docs/kafka-dataflow -

"Alternatively, you might have an existing Kafka cluster that resides outside of Google Cloud. For example, you might have an existing workload that is deployed on-premises or in another public cloud."

Comment 5

ID: 1163135 User: Moss2011 Badges: - Relative Date: 2 years ago Absolute Date: Fri 01 Mar 2024 12:47 Selected Answer: C Upvotes: 2

From my point of view, the best option is C taking into account this doc: https://cloud.google.com/dataflow/docs/kafka-dataflow

Comment 6

ID: 1160025 User: MaxNRG Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 20:45 Selected Answer: D Upvotes: 2

Based on the key requirements highlighted:
• Interconnect link between GCP and on-prem Kafka
• High throughput streaming pipeline
• Minimal latency
• Data to be stored in BigQuery
D - The key reasons this meets the requirements:
• Kafka connect provides a reliable bridge to Pub/Sub over the interconnect
• Reading from Pub/Sub minimizes latency vs reading directly from Kafka
• Dataflow provides a high throughput streaming engine
• Writing to BigQuery gives scalable data storage
By leveraging these fully managed GCP services over the dedicated interconnect, a low latency streaming pipeline from on-prem Kafka into BigQuery can be implemented rapidly.
Options A/B/C have higher latencies or custom code requirements, so do not meet the minimal latency criteria as well as option D.

Comment 6.1

ID: 1160029 User: MaxNRG Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 20:47 Selected Answer: - Upvotes: 1

Why choose option D over A?
The key advantage with option D is that by writing a custom Dataflow pipeline rather than using a Google provided template, there is more flexibility to customize performance tuning and optimization for lowest latency.
• Some potential optimizations:
• Fine tuning number of workers, machine types to meet specific throughput targets
• Custom data parsing/processing logic if applicable
• Experimenting with autoscaling parameters or triggers

Comment 6.1.1

ID: 1160030 User: MaxNRG Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 20:47 Selected Answer: - Upvotes: 1

The Google template may be easier to set up initially, but a custom pipeline provides more control over optimizations specifically for low latency requirements stated in the question.
That being said, option A would still work reasonably well - but option D allows squeezing out that extra bit of performance if low millisecond latency is absolutely critical in the pipeline through precise tuning.
So in summary, option A is easier to implement but option D provides more optimization flexibility for ultra low latency streaming requirements.

Comment 6.2

ID: 1160026 User: MaxNRG Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 20:47 Selected Answer: - Upvotes: 2

Why not C:
At first option C (using a Dataflow pipeline to directly read from Kafka and write to BigQuery) seems reasonable.
However, the key requirement stated in the question is to have minimal latency for the streaming pipeline.
By reading directly from Kafka within Dataflow, there can be additional latency and processing overhead compared to reading from Pub/Sub, for a few reasons:
1. Pub/Sub acts as a buffer and handles scaling/reliability of streaming data automatically. This reduces processing burden on the pipeline.
2. Network latency can be lower by leveraging Pub/Sub instead of making constant pull requests for data from Kafka within the streaming pipeline.
3. Any failures have to be handled within the pipeline code itself when reading directly from Kafka. With Pub/Sub, reliability is built-in.

Comment 6.2.1

ID: 1160028 User: MaxNRG Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 20:47 Selected Answer: - Upvotes: 2

So in summary, while option C is technically possible, option D introduces Pub/Sub as a streaming buffer which reduces overall latency for the pipeline, allowing the key requirement of minimal latency to be better satisfied.

Comment 6.2.2

ID: 1183491 User: SanjeevRoy91 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 26 Mar 2024 18:38 Selected Answer: - Upvotes: 3

You are adding an intermediate hop in between on prem kafka and Dataflow ( pubsub ). Why won't this add additional latency.

Comment 7

ID: 1155765 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 19:08 Selected Answer: - Upvotes: 2

A Vs C -- Not sure which one would have low latency.

Points related to option C:
"Yes, Dataflow can read events from Kafka. Dataflow is a fully-managed, serverless streaming analytics service that supports both batch and stream processing. It can read events from Kafka, process them, and write the results to a BigQuery table for further analysis. "

"Dataflow supports Kafka support, which was added to Apache Beam in 2016. Google provides a Dataflow template that configures a Kafka-to-BigQuery pipeline. The template uses the BigQueryIO connector provided in the Apache Beam SDK. "

Comment 7.1

ID: 1156049 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Thu 22 Feb 2024 03:59 Selected Answer: - Upvotes: 2

Going with C

Comment 7.1.1

ID: 1156493 User: DarkLord2104 Badges: - Relative Date: 2 years ago Absolute Date: Thu 22 Feb 2024 16:42 Selected Answer: - Upvotes: 2

Final???

Comment 8

ID: 1138703 User: T2Clubber Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 02 Feb 2024 17:12 Selected Answer: C Upvotes: 3

Option C makes more sense to me because of the "minimal latency as possible".
I would have chosen option A if it were "less CODING as possible".

Comment 9

ID: 1121922 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 18:39 Selected Answer: A Upvotes: 4

Option A, leverage dataflow template for Kafka https://cloud.google.com/dataflow/docs/kafka-dataflow

Comment 9.1

ID: 1127700 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 21 Jan 2024 09:02 Selected Answer: - Upvotes: 1

Agree. "Google provides a Dataflow template that configures a Kafka-to-BigQuery pipeline. The template uses the BigQueryIO connector provided in the Apache Beam SDK."

Comment 9.2

ID: 1153507 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 20:02 Selected Answer: - Upvotes: 1

But it includes setting up a Kafka Connect bridge while an interconnect link has already been set up. https://cloud.google.com/dataflow/docs/kafka-dataflow#connect_to_an_external_cluster

Comment 10

ID: 1113611 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 12:10 Selected Answer: C Upvotes: 4

C. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.

Comment 11

ID: 1109949 User: chickenwingz Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 21:07 Selected Answer: C Upvotes: 3

Dataflow has templates to read from Kafka. Other options are too complicated
https://cloud.google.com/dataflow/docs/kafka-dataflow

Comment 11.1

ID: 1118808 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 17:16 Selected Answer: - Upvotes: 2

so, this is the answer A, whe C?

Comment 11.1.1

ID: 1121921 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 18:39 Selected Answer: - Upvotes: 1

Yeah, the answer is A. C requires you to develop the pipeline yourself and ensure minimal latency, which means that you perform better than a pre-built template from Google...not really the case most of the times :)

Comment 11.1.1.1

ID: 1135060 User: saschak94 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 29 Jan 2024 15:55 Selected Answer: - Upvotes: 7

but Option A introduces additional replication into Pub/Sub and the question states with minimal latency. In my opinion subscribing to Kafka via Dataflow has a lower latency than replicating the messages first to Pub/Sub and subscribing with Dataflow to it.

33. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 65

Sequence: 107
Discussion ID: 16476
Source URL: https://www.examtopics.com/discussions/google/view/16476-exam-professional-data-engineer-topic-1-question-65/
Posted By: madhu1171
Posted At: March 13, 2020, 1:45 p.m.

Question

You are building a data pipeline on Google Cloud. You need to prepare data using a casual method for a machine-learning process. You want to support a logistic regression model. You also need to monitor and adjust for null values, which must remain real-valued and cannot be removed. What should you do?

A. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataproc job.
B. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.
C. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataprep job.
D. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to 0 using a custom script.

Community Answer Votes

B: 15 most voted
D: 3

Comments 21 comments Click to expand

Comment 1

ID: 64738 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Mon 16 Mar 2020 15:31 Selected Answer: - Upvotes: 40

real-valued can not be null N/A or empty, have to be “0”, so it has to be B.

Comment 2

ID: 65900 User: Snobid Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Thu 19 Mar 2020 07:43 Selected Answer: - Upvotes: 8

Instead of having to write the custom script from scratch (option D), dataprep already has preconfigured tools for your use to perform the necessary data wrangling. As mentioned by jvg637, real-values have to be "0". Considering both points above, answer should be 'B'

Comment 3

ID: 1410086 User: monyu Badges: Most Recent Relative Date: 11 months, 3 weeks ago Absolute Date: Tue 25 Mar 2025 16:21 Selected Answer: B Upvotes: 1

Usually, None values are converted to 0 in data cleaning and preparation process. The key point here is, we don't require any other tool than DataPrep to identify and modify the value

Comment 4

ID: 1398897 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 15:56 Selected Answer: D Upvotes: 1

Cloud Dataflow is ideal for scalable data processing and allows for real-time transformations.
Logistic regression requires numerical (real-valued) inputs, and null values cannot remain as they are.

Comment 5

ID: 1306761 User: Erg_de Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 04 Nov 2024 04:59 Selected Answer: D Upvotes: 2

Option D:
Using null value conversion to 0 is the most correct practice for this case. Accompanying it with a script allows us to implement the necessary logic to handle null cases properly, adapting to the model while maintaining data integrity.

Comment 5.1

ID: 1322430 User: certs4pk Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 05 Dec 2024 17:08 Selected Answer: - Upvotes: 1

y use a data flow job when it can b done via data prep (much simpler & straight forward, less resource intensive)..

Comment 6

ID: 1189406 User: AjoeT Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 04 Apr 2024 16:35 Selected Answer: B Upvotes: 2

B. Dataprep has the feature to convert it into 0.

Comment 7

ID: 1163235 User: niru12376 Badges: - Relative Date: 2 years ago Absolute Date: Fri 01 Mar 2024 06:31 Selected Answer: - Upvotes: 1

0 is still a value, which can add bias in the model and the model will take that into account while making predictions so 'none'

Comment 8

ID: 1092512 User: Nandababy Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sun 10 Dec 2023 14:32 Selected Answer: - Upvotes: 1

Why not D? keyword is Monitor, B would replace all empty fields and also cause unintended bias.

Comment 8.1

ID: 1092516 User: Nandababy Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sun 10 Dec 2023 14:38 Selected Answer: - Upvotes: 1

However, Sergiomujica is right. If we need to prepare data using a casual method then its B "Dataprep".

Comment 9

ID: 998937 User: sergiomujica Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Tue 05 Sep 2023 03:30 Selected Answer: - Upvotes: 1

The questions says "You need to prepare data using a casual method ", thats dataprep and values should be 0 so the right answer is B

Comment 10

ID: 959630 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 22 Jul 2023 16:39 Selected Answer: B Upvotes: 2

No brainer. We need a real value and Dataprep is made for this. Dataflow is mainly for pre-processing before BigQuery ingests the data.

Comment 11

ID: 954555 User: theseawillclaim Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 17 Jul 2023 21:09 Selected Answer: B Upvotes: 2

Dataprep is made for this kind of stuff, no reason to use a streaming service such as Dataflow.

Comment 12

ID: 879966 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 07:19 Selected Answer: B Upvotes: 1

gpt:Cloud Dataprep is a data preparation service that can be used to transform, clean and shape data in a visually interactive way. It provides an easy-to-use interface to find and replace null values.

Cloud Dataflow is a fully-managed service for executing data processing pipelines, which allows for parallel execution of data processing tasks. However, it requires more expertise to set up and operate than Cloud Dataprep, and is usually used for more complex data processing needs.

Therefore, option B is the most suitable approach for the given requirements.

Comment 13

ID: 784939 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 04:24 Selected Answer: - Upvotes: 2

Seems to me like Answers are both B and D.
B is faster to implement while D takes time.
Doesnt mean that it's wrong though. I m not sure why everyone has picked just B. Why not D? D works and does the same job. And also having custom script provides more flexibility and control over the data processing tasks and it allows you to handle missing values in a more flexible and efficient way.

Comment 13.1

ID: 898965 User: rajm893 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 16 May 2023 09:43 Selected Answer: - Upvotes: 2

The "casual way" or easy way to convert to to 0 is using Dataprep job rather than using the custom script.

Comment 13.2

ID: 905930 User: AmmarFasih Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 15:36 Selected Answer: - Upvotes: 1

A simple rule. Whenever any service is available by GCP for a task, always recommend to use GCP service over any other.

Comment 14

ID: 781914 User: GCPpro Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 20 Jan 2023 07:00 Selected Answer: - Upvotes: 1

B is the correct answer.

Comment 15

ID: 766094 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 22:06 Selected Answer: - Upvotes: 3

Answer is Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.

Key phrases are "casual method", "need to replace null with real values", "logistic regression". Logistic regression works on numbers. Null need to be replaced with a number. And Cloud dataprep is best casual tool out of given options.

Comment 16

ID: 745530 User: DGames Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 15 Dec 2022 00:41 Selected Answer: B Upvotes: 1

real value 0

Comment 17

ID: 531339 User: byash1 Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 24 Jan 2022 14:52 Selected Answer: B Upvotes: 2

It is B

34. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 43

Sequence: 112
Discussion ID: 16819
Source URL: https://www.examtopics.com/discussions/google/view/16819-exam-professional-data-engineer-topic-1-question-43/
Posted By: rickywck
Posted At: March 17, 2020, 4:39 a.m.

Question

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?

A. Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.
B. Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.
C. Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.
D. Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.

Community Answer Votes

A: 29 most voted
B: 19
C: 12
D: 4

Comments 37 comments Click to expand

Comment 1

ID: 67019 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sun 22 Mar 2020 17:45 Selected Answer: - Upvotes: 68

Answer will be A because when you create View it does not store extra space and its a logical representation, for rest of the option you need to write large code and extra processing for dataflow/dataproc

Comment 1.1

ID: 711962 User: beowulf_kat Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 05 Nov 2022 19:58 Selected Answer: - Upvotes: 2

I agree that A is correct. Also, I think B is wrong as the UPDATE statement is used to update values in existing columns, not to create a new column.

Comment 1.1.1

ID: 721708 User: ovokpus Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 19 Nov 2022 02:56 Selected Answer: - Upvotes: 2

Of course, you use UPDATE after creating the new column, that is what the option said

Comment 1.2

ID: 294489 User: funtoosh Badges: - Relative Date: 5 years ago Absolute Date: Fri 19 Feb 2021 18:48 Selected Answer: - Upvotes: 18

cannot be 'A'as it clearly says that you need to change the schema and data.

Comment 1.2.1

ID: 530206 User: exnaniantwort Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 03:13 Selected Answer: - Upvotes: 15

Your primary task is to "make data available".
Changing the schema is just the request from the member "A member of IT is building an application and ***asks you to modify the schema and data*** in BigQuery". You don't have to follow it if it does not make sense.

Comment 1.2.1.1

ID: 530211 User: exnaniantwort Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 03:19 Selected Answer: - Upvotes: 6

There is always different application requirement to use different format. That way you will just creating more and more redundant columns in different formats. That is tedious.

Comment 1.2.1.2

ID: 656315 User: YorelNation Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 01 Sep 2022 15:47 Selected Answer: - Upvotes: 4

A yes, That make a lot of sense and also if you update the table only once with UPDATE if there is a new employee it will not be up to date with the new column, if the app use a view it will be up to date every time it query.
But in any case the cost will not be minimized.

Comment 1.3

ID: 67428 User: [Removed] Badges: - Relative Date: 5 years, 11 months ago Absolute Date: Tue 24 Mar 2020 04:00 Selected Answer: - Upvotes: 12

Because views are not materialized, the query that defines the view is run each time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query

Comment 1.3.1

ID: 152459 User: lgdantas Badges: - Relative Date: 5 years, 7 months ago Absolute Date: Fri 07 Aug 2020 11:25 Selected Answer: - Upvotes: 3

Wouldn´t "total amount of data in all table fields referenced directly or indirectly by the top-level query" be FirstName and LastName?

Comment 1.3.1.1

ID: 363749 User: lollo1234 Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Sat 22 May 2021 16:55 Selected Answer: - Upvotes: 4

You're right, BigQuery bills on number of bytes processed, regardless of them being materialized. If you don't create a new column and use a view instead, you will probably have a small performance hit but query costs would be the same and storage cost wouldn't increase (unlike storing a new column)

Comment 1.3.1.1.1

ID: 443566 User: yoshik Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Sun 12 Sep 2021 19:12 Selected Answer: - Upvotes: 12

You are asked to modify the schema and data. By using a view, the underlined table remains intact.

Comment 1.3.1.1.1.1

ID: 663709 User: HarshKothari21 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 16:48 Selected Answer: - Upvotes: 1

good catch, yoshik.

Comment 1.3.2

ID: 599586 User: alecuba16 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Tue 10 May 2022 14:34 Selected Answer: - Upvotes: 3

Views are cached the same as regular tables are, so I don't get the point of billing. It will cost the same as query to a regular table.

Comment 1.3.2.1

ID: 729459 User: ovokpus Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 28 Nov 2022 18:13 Selected Answer: - Upvotes: 3

the point of billing is extra storage costs for a new concatenated column

Comment 1.3.3

ID: 1318859 User: cloud_rider Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 27 Nov 2024 20:29 Selected Answer: - Upvotes: 1

This was try in Oracle era, BQ prunes the query before running, so having a view as an intermediate layer does not have any impact, unless there is a heavy filtering happening within the view definition.

Comment 2

ID: 309281 User: BhupiSG Badges: Highly Voted Relative Date: 5 years ago Absolute Date: Sat 13 Mar 2021 00:37 Selected Answer: - Upvotes: 47

Correct: B
BigQuery has no quota on the DML statements. (Search Google - does bigquery have quota for update).
Why not C: This is a one time activity and SQL is the easiest way to program it. DataFlow is way overkill for this. You will need to find an engineer who can develop DataFlow pipelines. Whereas, SQL is so much more widely known and easier. One of the great features about BigQuery is its SQL interface. Even for BigQueryML services.

Comment 2.1

ID: 744487 User: DGames Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 13 Dec 2022 23:04 Selected Answer: - Upvotes: 2

But you need to maintain table means regularly you have to execute the update query whenever new data comes.

Comment 2.2

ID: 363754 User: lollo1234 Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Sat 22 May 2021 17:01 Selected Answer: - Upvotes: 8

I will also add that B would imply changing upstream workloads to write the new field every time a records gets added

Comment 2.3

ID: 363750 User: lollo1234 Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Sat 22 May 2021 16:56 Selected Answer: - Upvotes: 5

DML statements don't increase costs, but storing a new column does. I see A is correct (also see my comment above)

Comment 2.3.1

ID: 530207 User: exnaniantwort Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 03:15 Selected Answer: - Upvotes: 3

Exactly. Cost is the reason to reject B.
How come so many people vote for this wrong option?

Comment 2.3.1.1

ID: 765324 User: ler_mp Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 07:50 Selected Answer: - Upvotes: 8

Storage is cheap compared to computation

Comment 3

ID: 1400620 User: willyunger Badges: Most Recent Relative Date: 11 months, 4 weeks ago Absolute Date: Wed 19 Mar 2025 18:53 Selected Answer: A Upvotes: 1

Minimal cost: no extra space, no cost to set up, no need to write code, rest of applications see no change, no need to offload/reprocess/reload (although batch load is free).

Comment 4

ID: 1346652 User: LP_PDE Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 25 Jan 2025 21:52 Selected Answer: B Upvotes: 1

I would say A but since it specifically says "modify" then the answer is B.

Comment 5

ID: 1259069 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 31 Jul 2024 21:31 Selected Answer: - Upvotes: 2

E. Say to the IT specialist to take care of it on the app side...
B would work for historical data if we had an underlying change made to automate the concatenation for new records. It is not clear, so I would say A is a quick solution.

Comment 6

ID: 1204882 User: Ramanaiah Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Wed 01 May 2024 08:16 Selected Answer: B Upvotes: 1

Requirement is to be able to filter on full name. So, you would be querying all data unless you have materialized full name column.

Comment 7

ID: 1134098 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 28 Jan 2024 14:25 Selected Answer: - Upvotes: 1

Definitely A

Comment 8

ID: 1076522 User: axantroff Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 18:54 Selected Answer: A Upvotes: 2

The question might be outdated, but I would like to offer my perspective:

1. Ideally, I would opt for a materialized view to avoid updating pipelines
2. In 2023, I see no concerns regarding the costs involved in storing denormalized data for analytical needs
3. Regarding this question I would choose option A, although the concern about extra costs due to recalculations is valid for me

Comment 8.1

ID: 1100468 User: LaxmanTiwari Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 19 Dec 2023 10:26 Selected Answer: - Upvotes: 2

Did u pass the exam ?

Comment 9

ID: 1066235 User: steghe Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 09 Nov 2023 09:11 Selected Answer: - Upvotes: 1

Answer should be A 'cos the First request is: make that data available.

Comment 10

ID: 973158 User: alihabib Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 05 Aug 2023 18:18 Selected Answer: - Upvotes: 1

Its A ..... "asked to change schema" is a trick to test your skills. Better to make use of MV's if anyhow the application is gonna query repeatedly. MV's will rebuild itself, if query invalidates from cache results

Comment 11

ID: 967605 User: nescafe7 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 31 Jul 2023 02:05 Selected Answer: A Upvotes: 2

In the case of B, the data pipeline that adds new employee information must also be modified, which is not the correct answer in terms of cost minimization.

Comment 12

ID: 961381 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 11:19 Selected Answer: A Upvotes: 1

It's A. If you add a column to the table, you will be billed every time you query that new column. The same way you would be billed with the view created by A.

B,C and D create a new column. A does not create a new column. It just provides the interface for the application to access the data. B,C and D will have to be rerun to compute the column value of new customers.

A is done only once, costs 0 for storage, and is charged about the same as all the others when it comes to compute because even if you choose B C and D you would have to query the data in the end anyway.

Comment 13

ID: 959404 User: autumn2005 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 22 Jul 2023 12:02 Selected Answer: C Upvotes: 1

modify the schema

Comment 14

ID: 954502 User: theseawillclaim Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 17 Jul 2023 20:12 Selected Answer: - Upvotes: 1

Can you code a script for a BQ Column? I don't think it's "B", it is pretty tricky

Comment 15

ID: 935272 User: KC_go_reply Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 27 Jun 2023 13:07 Selected Answer: A Upvotes: 1

Everything but A) new view is wrong.

B) sounds okay, but introduces a new column which means more storage, thus increasing cost.
C) Dataflow is obvious overkill for a simple task such as concatenating two strings.
D) Starting up a Dataproc cluster just for string concatenation is super overkill.

Comment 16

ID: 903287 User: vaga1 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sun 21 May 2023 15:42 Selected Answer: A Upvotes: 1

if a new field is only necessary for one project, and it is only the concatenation of two existing fields, it is ok to create a view that gets used for a specific task.

Comment 17

ID: 880757 User: Jarek7 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 20:39 Selected Answer: A Upvotes: 2

I'd go for A.
The main issue with answers B,C,D is that they are just temporary solution. Whenever a new employee comes in (there are 400.000 of them at the moment, so we can expect every day a few new guys) we need to update the fullname table/field again. Additionally each of these answers need twice as much capacity (BigQuery stores data in a columnar format, so optimizing is not possible). Although the price for the needed capacity will be far below 0.01$/month.
The main argument against A is that compute power costs more than capacity. Please look how BQ is priced: https://cloud.google.com/bigquery/pricing#query_pricing
In the default On-demand compute pricing it is charged for "the number of bytes processed by each query" so there will be no any difference in computing costs for any option.
Yeah, there is also this argument about modyfing schema in the requirements. Lets be professional - it is not a requirement for OUR schema. If you can resolve the issue with 0 change to YOUR schema then you are more than ok. And anyway, from requestor point of view, the schema HE uses in his app will be modifed as he needed.

35. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 120

Sequence: 117
Discussion ID: 17245
Source URL: https://www.examtopics.com/discussions/google/view/17245-exam-professional-data-engineer-topic-1-question-120/
Posted By: -
Posted At: March 22, 2020, 12:46 p.m.

Question

You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud
Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?

A. An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/ used_bytes for the destination
B. An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination
C. An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/ num_undelivered_messages for the destination
D. An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/ num_undelivered_messages for the destination

Community Answer Votes

B: 17 most voted
A: 2

Comments 27 comments Click to expand

Comment 1

ID: 120548 User: dambilwa Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Sun 26 Dec 2021 15:58 Selected Answer: - Upvotes: 28

You would want to get alerted only if Pipeline fails & not if it is running fine. I think Option [B] is correct, because in event of Pipeline failure :
1) subscription/ num_undelivered_messages would pile up at a constant rate as the source has consistent throughput
2) instance/storage/ used_bytes will get closer to zero. Hence need to monitor it's rate of change

Comment 1.1

ID: 124771 User: Barniyah Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sun 02 Jan 2022 09:46 Selected Answer: - Upvotes: 5

Yes, you are right, it should be B:
Thank you

Comment 1.2

ID: 500464 User: marioferrulli Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 13 Jun 2023 08:35 Selected Answer: - Upvotes: 1

Why would the instance/storage/used_bytes get closer to zero? If there's an error at a certain point, wouldn't we just see that the used_bytes remain constant while the num_undelivered_messages increases? I don't get why the destination's used bytes should decrease.

Comment 1.2.1

ID: 504975 User: baubaumiaomiao Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Mon 19 Jun 2023 15:58 Selected Answer: - Upvotes: 2

"If there's an error at a certain point, wouldn't we just see that the used_bytes remain constant while the num_undelivered_messages increases?"
It's the rate of change, not the absolute value

Comment 1.2.2

ID: 504187 User: szefco Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 18 Jun 2023 12:06 Selected Answer: - Upvotes: 6

"rate of change decrease of instance/storage/ used_bytes" - if rate of instance/storage/ used_bytes decreases that means less data is written - so something is wrong with the pipeline.
It's not used bytes that decreases - it's rate of change decreases.
Example: if everything works fine your pipeline writes 5MB/s to the sink. If it decreases to 0.1MB/s it means something is wrong

Comment 2

ID: 837541 User: midgoo Badges: Highly Voted Relative Date: 1 year, 6 months ago Absolute Date: Fri 13 Sep 2024 02:04 Selected Answer: B Upvotes: 16

For those who may get confuse at the start by the term 'subscription/num_undelivered_messages', it is not a division. It is the full path of the metric. So we should just read it as 'num_undelivered_messages'. The same for 'used_bytes'.

So if we see the source have more backlog (more num_undelivered_messages), or the destination ultilization going down, that is the indicator of something going wrong

Comment 2.1

ID: 1399491 User: desertlotus1211 Badges: - Relative Date: 12 months ago Absolute Date: Mon 17 Mar 2025 02:18 Selected Answer: - Upvotes: 1

the Answer is A. You want to see it working.

Comment 2.1.1

ID: 1399492 User: desertlotus1211 Badges: - Relative Date: 12 months ago Absolute Date: Mon 17 Mar 2025 02:19 Selected Answer: - Upvotes: 1

you're looking for evidence that it's working:
' that it is processing data....'

Comment 2.2

ID: 919342 User: kryzo Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Mon 09 Dec 2024 16:14 Selected Answer: - Upvotes: 2

great explanation thanks !

Comment 3

ID: 1399490 User: desertlotus1211 Badges: Most Recent Relative Date: 12 months ago Absolute Date: Mon 17 Mar 2025 02:17 Selected Answer: A Upvotes: 1

It should be Answer A. You want to see in being processed versus looking for a bottleneck.

Comment 4

ID: 820782 User: musumusu Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 24 Aug 2024 17:33 Selected Answer: - Upvotes: 3

Answer B:
Trick: In stackdriver always put Alert for Subscriber + CPU
Subscriber - num of undelivered message INCREASE alert
CPU - Instance or storage DECREASE alert.
Make sense right !

Comment 5

ID: 770242 User: atlan Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 10:28 Selected Answer: - Upvotes: 1

Nobody seems to pay attention to instance/storage/used_bytes. I only find this metric for Spanner.
https://cloud.google.com/monitoring/api/metrics_gcp#gcp-spanner

While Dataflow processes and stores everything in Cloud Storage, Spanner could only be the source.
https://cloud.google.com/spanner/docs/change-streams

Also, if it is either A or B, the instance/storage/used_bytes metric does not make sense for the destination, which is Cloud Storage.

Can anyone help me understand?

Comment 5.1

ID: 781796 User: desertlotus1211 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 20 Jul 2024 02:23 Selected Answer: - Upvotes: 2

look here: https://cloud.google.com/monitoring/api/metrics_gcp

instance/storage/used_bytes GA
Storage used.

Comment 6

ID: 764009 User: AzureDP900 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 02 Jul 2024 21:24 Selected Answer: - Upvotes: 1

B. An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination

Comment 7

ID: 762445 User: AzureDP900 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 02:50 Selected Answer: - Upvotes: 1

B is right

Comment 8

ID: 760799 User: Catweazle1983 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 29 Jun 2024 09:00 Selected Answer: A Upvotes: 1

An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/ used_bytes for the destination

10 subscriptions / 1 undelivered messages = 10
10 subscriptions / 5 undelivered messages = 2
You clearly want to be alerted when the number of undelivered messages increases. The ratio then decreases. In my example from 10 to 2.

Comment 8.1

ID: 844237 User: squishy_fishy Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Thu 19 Sep 2024 20:47 Selected Answer: - Upvotes: 1

subscription/num_undelivered_messages is a path, not a division.

Comment 9

ID: 649065 User: A1000 Badges: - Relative Date: 2 years ago Absolute Date: Mon 19 Feb 2024 20:45 Selected Answer: B Upvotes: 1

Increase subscription/num delivered message
decrease instance/storage/used bytes

Comment 10

ID: 487104 User: JG123 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 26 May 2023 06:01 Selected Answer: - Upvotes: 2

Correct: B

Comment 11

ID: 476198 User: Abhi16820 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 11 May 2023 12:28 Selected Answer: - Upvotes: 2

isn't B and C are same.

Comment 12

ID: 475154 User: JayZeeLee Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 10 May 2023 01:27 Selected Answer: - Upvotes: 3

B.
It's useful to monitor the source that keeps sending data while the destination that doesn't take anything in.

Comment 13

ID: 458737 User: squishy_fishy Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 07 Apr 2023 14:29 Selected Answer: - Upvotes: 2

The answer is B.
subscription/num_undelivered_messages: the number of messages that subscribers haven't processed https://cloud.google.com/pubsub/docs/monitoring#monitoring_forwarded_undeliverable_messages

Comment 14

ID: 443512 User: squishy_fishy Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Mar 2023 17:07 Selected Answer: - Upvotes: 2

Silly question: what is subscription/ num_undelivered_messages, it is divided by? or per subscription per num_undelivered_messages?

Comment 14.1

ID: 579934 User: 910 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Mon 02 Oct 2023 18:17 Selected Answer: - Upvotes: 3

yes is misleading:
the metric "subscription/num_undelivered_messages" is just the path of the API URL

actions.googleapis.com/...subscription/num_undelivered_messages

ref: https://cloud.google.com/monitoring/api/metrics_gcp#pubsub/subscription/num_undelivered_messages

Comment 15

ID: 397161 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 23:38 Selected Answer: - Upvotes: 3

Looks B

Comment 16

ID: 301558 User: gcper Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 01 Sep 2022 19:08 Selected Answer: - Upvotes: 5

A

"An alert based on a decrease of subscription/num_undelivered_messages for the source"
The more we have undelivered messages, the worse. Thus we want to be alerted when the ratio goes down as the denominator goes up.

"A rate of change increase of instance/storage/used_bytes for the destination"
An increase in the rate of change of how much we are storing per instance storage.

Comment 17

ID: 216556 User: Alasmindas Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Tue 10 May 2022 11:32 Selected Answer: - Upvotes: 5

The correct answer is Option B : increase of subscription/num_undelivered_messages and decrease of instance/storage/ used_bytes, reason as follows:-
- The first question we should ask is - why do we want to monitor things - this his very subject, one can say - we want to monitor to check - if everything is running "OK" or we want to monitor things to check if everything is running "NOT OK" .

Generally, we would go with the second point - i.e.- we want to monitor things - to check what is NOT OK. if everything works fine - may be we should monitor.

Going with that logic - Option B standouts - i.e.- the more we have undelivered messages in subscriber and less we have data in the sync (cloud storage) - means things are not OK and that why we want to monitor it .

As mentioned - this approach is subject and different people may have different approach in deciding why we monitor

36. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 292

Sequence: 123
Discussion ID: 130298
Source URL: https://www.examtopics.com/discussions/google/view/130298-exam-professional-data-engineer-topic-1-question-292/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 11:33 a.m.

Question

You have terabytes of customer behavioral data streaming from Google Analytics into BigQuery daily. Your customers’ information, such as their preferences, is hosted on a Cloud SQL for MySQL database. Your CRM database is hosted on a Cloud SQL for PostgreSQL instance. The marketing team wants to use your customers’ information from the two databases and the customer behavioral data to create marketing campaigns for yearly active customers. You need to ensure that the marketing team can run the campaigns over 100 times a day on typical days and up to 300 during sales. At the same time, you want to keep the load on the Cloud SQL databases to a minimum. What should you do?

A. Create BigQuery connections to both Cloud SQL databases. Use BigQuery federated queries on the two databases and the Google Analytics data on BigQuery to run these queries.
B. Create a job on Apache Spark with Dataproc Serverless to query both Cloud SQL databases and the Google Analytics data on BigQuery for these queries.
C. Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.
D. Create a Dataproc cluster with Trino to establish connections to both Cloud SQL databases and BigQuery, to execute the queries.

Community Answer Votes

C: 21 most voted

Comments 10 comments Click to expand

Comment 1

ID: 1119679 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 11 Jan 2024 13:02 Selected Answer: C Upvotes: 11

- Datastream: It's a fully managed, serverless service for real-time data replication. It allows to stream data from various sources, including Cloud SQL, into BigQuery.
- Reduced Load on Cloud SQL: By replicating the required tables from both Cloud SQL databases into BigQuery, you minimize the load on the Cloud SQL instances. The marketing team's queries will be run against BigQuery, which is designed to handle high-volume analytics workloads.
- Frequency of Queries: BigQuery can easily handle the high frequency of queries (100 times daily, up to 300 during sales events) due to its powerful data processing capabilities.
- Combining Data Sources: Once the data is in BigQuery, you can efficiently combine it with the Google Analytics data for comprehensive analysis and campaign planning.

Comment 1.1

ID: 1178437 User: SanjeevRoy91 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 20 Mar 2024 16:21 Selected Answer: - Upvotes: 1

Why not A ? Federrated queries will downgrade Cloud SQL perf?

Comment 2

ID: 1382931 User: Blackstile Badges: Most Recent Relative Date: 1 year ago Absolute Date: Mon 10 Mar 2025 16:14 Selected Answer: C Upvotes: 1

To Replication data, use datastream

Comment 3

ID: 1252616 User: 987af6b Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 21 Jul 2024 19:11 Selected Answer: C Upvotes: 3

Initially I said A, but this question was how I learned about Datastream, which I think would be the better solution in this scenario. So my answer is C

Comment 4

ID: 1224774 User: AlizCert Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 05 Jun 2024 16:33 Selected Answer: C Upvotes: 1

C, noting that federated queries on read replicas would be the ideal solution

Comment 5

ID: 1193510 User: joao_01 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 09:03 Selected Answer: - Upvotes: 1

Its option C.

"Performance. A federated query is likely to not be as fast as querying only BigQuery storage. BigQuery needs to wait for the source database to execute the external query and temporarily move data from the external data source to BigQuery. Also, the source database might not be optimized for complex analytical queries."

So, it will load the Cloud SQL external sources with the queries, impacting performance on those.

Link: https://cloud.google.com/bigquery/docs/federated-queries-intro

Comment 6

ID: 1170867 User: datasmg Badges: - Relative Date: 2 years ago Absolute Date: Mon 11 Mar 2024 08:38 Selected Answer: C Upvotes: 1

C is make sense

Comment 7

ID: 1155713 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 17:51 Selected Answer: C Upvotes: 1

Option C

Comment 8

ID: 1113544 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 11:33 Selected Answer: C Upvotes: 3

C. Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.

Comment 8.1

ID: 1116026 User: Smakyel79 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 17:57 Selected Answer: - Upvotes: 3

Datastream is a serverless, easy-to-use change data capture (CDC) and replication service. By replicating the necessary tables from the Cloud SQL databases to BigQuery, you can offload the query load from the Cloud SQL databases. The marketing team can then run their queries directly on BigQuery, which is designed for large-scale data analytics. This approach seems to balance both efficiency and performance, minimizing load on the Cloud SQL instances.

37. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 263

Sequence: 125
Discussion ID: 130215
Source URL: https://www.examtopics.com/discussions/google/view/130215-exam-professional-data-engineer-topic-1-question-263/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 6:09 p.m.

Question

You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?

A. Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.
B. Insert output sinks after each key processing step, and observe the writing throughput of each block.
C. Log debug information in each ParDo function, and analyze the logs at execution time.
D. Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks.

Community Answer Votes

A: 18 most voted
B: 2

Comments 10 comments Click to expand

Comment 1

ID: 1114596 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 16:21 Selected Answer: A Upvotes: 9

- The Reshuffle operation is used in Dataflow pipelines to break fusion and redistribute elements, which can sometimes help improve parallelization and identify bottlenecks.
- By inserting Reshuffle after each processing step and observing the pipeline's performance in the Dataflow console, you can potentially identify stages that are disproportionately slow or stalled.
- This can help in pinpointing the step where the bottleneck might be occurring.

Comment 1.1

ID: 1147095 User: srivastavas08 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 11 Feb 2024 10:13 Selected Answer: - Upvotes: 1

ince we don't know for sure if fusion is the culprit, detailed debug logging is still the top choice to find the precise slow operation(s).

Comment 2

ID: 1381950 User: Blackstile Badges: Most Recent Relative Date: 1 year ago Absolute Date: Mon 10 Mar 2025 14:49 Selected Answer: A Upvotes: 1

Reshuffle is the key.

Comment 3

ID: 1325893 User: m_a_p_s Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 12 Dec 2024 23:33 Selected Answer: A Upvotes: 1

Looks like A. However, this option does not provide any option of identifying the underlying cause. https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion

Comment 4

ID: 1293757 User: f6bc4a0 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sun 06 Oct 2024 08:14 Selected Answer: B Upvotes: 1

B identifies where the problem lies.

Comment 5

ID: 1154760 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 14:53 Selected Answer: A Upvotes: 1

Option A

Comment 6

ID: 1146450 User: srivastavas08 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 10 Feb 2024 18:23 Selected Answer: - Upvotes: 2

It should be C

Comment 7

ID: 1135906 User: tibuenoc Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 30 Jan 2024 16:06 Selected Answer: B Upvotes: 1

The best option is B
Because create additional output to capturing and processing error data, will get error each step that allows you to observe the writing throughput of each block, which can help identify specific processing steps causing bottlenecks.

Option A also is valid but can not directly address all bottlenecks, especially if the graph was merged.

Comment 8

ID: 1117475 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 13:37 Selected Answer: A Upvotes: 4

From the Dataflow documentation: "There are a few cases in your pipeline where you may want to prevent the Dataflow service from performing fusion optimizations. These are cases in which the Dataflow service might incorrectly guess the optimal way to fuse operations in the pipeline, which could limit the Dataflow service's ability to make use of all available workers.
You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshuffle is supported by Dataflow even though it is marked deprecated in the Apache Beam documentation."

Comment 9

ID: 1112953 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 18:09 Selected Answer: A Upvotes: 2

A. Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.

38. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 23

Sequence: 128
Discussion ID: 16931
Source URL: https://www.examtopics.com/discussions/google/view/16931-exam-professional-data-engineer-topic-1-question-23/
Posted By: -
Posted At: March 18, 2020, 4:38 p.m.

Question

You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time. What should you do?

A. Send the data to Google Cloud Datastore and then export to BigQuery.
B. Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.
C. Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.
D. Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.

Community Answer Votes

B: 13 most voted

Comments 17 comments Click to expand

Comment 1

ID: 249268 User: DataExpert Badges: Highly Voted Relative Date: 4 years, 8 months ago Absolute Date: Mon 21 Jun 2021 11:51 Selected Answer: - Upvotes: 9

B is more correct than other options so B is the answer. But if this is actual use case you have to deal with use Cloud BigTable instead of bigquery. So the pipeline will be like this. IOT-Devices -> Cloud Pub/Sub -> Cloud BigTable -> Cloud Data Studio (For real-time analytics)

Comment 2

ID: 1364871 User: Abizi Badges: Most Recent Relative Date: 1 year ago Absolute Date: Tue 04 Mar 2025 12:15 Selected Answer: B Upvotes: 1

B is the correct answer

Comment 3

ID: 1195530 User: regal_2010 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 14 Oct 2024 15:20 Selected Answer: B Upvotes: 1

Answer is B

Comment 4

ID: 1076336 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 13:57 Selected Answer: B Upvotes: 1

In short, B is less complex and more recommended other than D

Comment 5

ID: 1050527 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 14:00 Selected Answer: B Upvotes: 1

B. Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.

Here's why this approach is preferred:

Google Cloud Pub/Sub allows for efficient ingestion and real-time data streaming.
Google Cloud Dataflow can process and transform the streaming data in real-time.
Google BigQuery is a fully managed, highly scalable data warehouse that is well-suited for real-time analysis and querying of large datasets.

Comment 6

ID: 975408 User: GCP_PDE_AG Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 08 Feb 2024 12:47 Selected Answer: - Upvotes: 1

Obviously B.

Comment 7

ID: 909752 User: Maurilio_Cardoso Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 30 Nov 2023 01:05 Selected Answer: B Upvotes: 2

PubSub for queue in real time, Dataflow for processing (pipeline) and Bigquery for analyses.

Comment 8

ID: 835672 User: bha11111 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 11 Sep 2023 05:04 Selected Answer: B Upvotes: 1

B is correct

Comment 9

ID: 743416 User: DGames Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Mon 12 Jun 2023 23:30 Selected Answer: B Upvotes: 2

GCP recommend best practice for streaming data pipeline as option B - pub/sub, dataflow & Bigquery

Comment 10

ID: 741909 User: Nirca Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sun 11 Jun 2023 16:54 Selected Answer: B Upvotes: 1

B. Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.

Comment 11

ID: 731543 User: gitaexams Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 13:48 Selected Answer: - Upvotes: 1

B aris tqve yleebo

Comment 12

ID: 688845 User: devaid Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 07 Apr 2023 19:35 Selected Answer: B Upvotes: 1

B of course

Comment 13

ID: 641759 User: Dip1994 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 03 Feb 2023 13:15 Selected Answer: - Upvotes: 1

B is the correct answer

Comment 14

ID: 616937 User: noob_master Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 15 Dec 2022 22:58 Selected Answer: B Upvotes: 1

Answer: B

Deafult ETL streaming process: Pub/Sub + Dataflow + Bigquery.

Comment 15

ID: 615438 User: nexus1_ Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 12 Dec 2022 20:09 Selected Answer: - Upvotes: 1

Definitely B

Comment 16

ID: 586419 User: vw13 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 15 Oct 2022 18:55 Selected Answer: B Upvotes: 1

B is the only option for real time process & analysis

Comment 17

ID: 580912 User: devric Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 04 Oct 2022 22:37 Selected Answer: - Upvotes: 1

The most appropriate is B but BQ can't solve Analyzing data in RT.

39. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 68

Sequence: 129
Discussion ID: 16478
Source URL: https://www.examtopics.com/discussions/google/view/16478-exam-professional-data-engineer-topic-1-question-68/
Posted By: madhu1171
Posted At: March 13, 2020, 2:01 p.m.

Question

You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?

A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.

Community Answer Votes

C: 30 most voted
D: 1
B: 1

Comments 21 comments Click to expand

Comment 1

ID: 63491 User: madhu1171 Badges: Highly Voted Relative Date: 5 years, 6 months ago Absolute Date: Sun 13 Sep 2020 13:01 Selected Answer: - Upvotes: 36

Answer should be C

Comment 2

ID: 139531 User: VishalB Badges: Highly Voted Relative Date: 5 years, 1 month ago Absolute Date: Wed 20 Jan 2021 16:09 Selected Answer: - Upvotes: 11

Correct Answer: C
Explanation:-This option is correct as Dataflow, provides a cost-effective solution to perform transformations on the streaming data, with autoscaling provides scaling without any intervention. System lag with
Stackdriver provides monitoring for the streaming data. With autoscaling enabled, the Cloud Dataflow service automatically chooses the appropriate number of worker instances required to run your job.

Comment 3

ID: 1365392 User: Abizi Badges: Most Recent Relative Date: 1 year ago Absolute Date: Wed 05 Mar 2025 13:46 Selected Answer: C Upvotes: 1

C for me

Comment 4

ID: 1208786 User: yassoraa88 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Sat 09 Nov 2024 12:51 Selected Answer: C Upvotes: 2

using Cloud Dataflow for transformations with monitoring via Stackdriver and leveraging its default autoscaling settings, is the best choice. Cloud Dataflow is purpose-built for this type of workload, providing seamless scalability and efficient processing capabilities for streaming data. Its autoscaling feature minimizes manual intervention and helps manage costs by dynamically adjusting resources based on the actual processing needs, which is crucial for handling fluctuating data volumes efficiently and cost-effectively.

Comment 5

ID: 1021628 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 30 Mar 2024 18:39 Selected Answer: D Upvotes: 1

Option C suggests using Cloud Dataflow to run the transformations and monitoring the job system lag with Stackdriver while using the default autoscaling setting for worker instances.

While using Cloud Dataflow is a suitable choice for processing data from Cloud Pub/Sub to BigQuery, and monitoring with Stackdriver provides valuable insights, the specific emphasis on configuring non-default Compute Engine machine types (as mentioned in option D) gives you more control over cost optimization and performance tuning.

By configuring non-default machine types, you can precisely tailor the computational resources to match the specific requirements of your workload. This fine-grained control allows you to optimize costs further by avoiding over-provisioning of resources, especially if your workload is memory-intensive, CPU-bound, or requires specific configurations that are not met by the default settings.

Comment 5.1

ID: 1021629 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 30 Mar 2024 18:39 Selected Answer: - Upvotes: 2

Additionally, having the flexibility to adjust machine types based on workload characteristics ensures that you can achieve the desired performance levels without overspending on unnecessary resources. This level of customization is not provided by simply relying on the default autoscaling settings, making option D a more comprehensive and cost-effective solution for managing varying data volumes.

Comment 6

ID: 959641 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 22 Jan 2024 18:01 Selected Answer: B Upvotes: 1

At first I answered C. However, Dataproc is indeed cheaper than Dataflow. And both of them can scale automatically horizontically.

Dataflow horizontal scaling applies to both primary and secondary nodes. Scaling secondary nodes scales up CPU/compute and scaling primary nodes scales up both memory and CPU/compute.

I don't quite understand the second part of answer B where it says I should allocate resources accordingly. I guess I could do that, but auto-scaling should be enough.

Comment 7

ID: 820124 User: AbdullahAnwar Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 24 Aug 2023 05:07 Selected Answer: - Upvotes: 2

Answer should be C

Comment 8

ID: 785743 User: samdhimal Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 19:10 Selected Answer: - Upvotes: 3

C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.

Cloud Dataflow is a managed service that allows you to write and execute data transformations in a highly scalable and fault-tolerant way. By default, it will automatically scale the number of worker instances based on the input data volume and job performance, which can help minimize costs. Monitoring the job system lag with Stackdriver can help you identify any issues that may be impacting performance and take action as needed. Additionally, using the default autoscaling setting for worker instances can help you minimize manual intervention and ensure that resources are used efficiently.

Comment 9

ID: 737728 User: odacir Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 10:21 Selected Answer: C Upvotes: 11

@admin why all the answers are wrong. I paid 30 euros for this web and its garbage.
Dataproc has no sense in this scenario, because you want to have minimal intervention/operation. D is not a good practice, the answer is C.

Comment 10

ID: 517606 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 05 Jul 2022 15:13 Selected Answer: C Upvotes: 4

C only as referred by MaxNRG

Comment 11

ID: 506284 User: MaxNRG Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 21 Jun 2022 16:52 Selected Answer: C Upvotes: 9

C.
Dataproc does not seem to be a good solution in this case as it always require a manual intervention to adjust resources.
Autoscaling with dataflow will automatically handle changing data volumes with no manual intervention, and monitoring through Stackdriver can be used to spot bottleneck. Total execution time is not good there as it does not provide a precise view on potential bottleneck.

Comment 12

ID: 489813 User: StefanoG Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Sun 29 May 2022 10:59 Selected Answer: C Upvotes: 3

Dataflow, Stackdriver and autoscaling

Comment 13

ID: 476770 User: victorlie Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Thu 12 May 2022 07:53 Selected Answer: - Upvotes: 4

Admin, please take a look on the comments. Almost all answers are wrong

Comment 14

ID: 445007 User: nguyenmoon Badges: - Relative Date: 3 years, 12 months ago Absolute Date: Tue 15 Mar 2022 08:30 Selected Answer: - Upvotes: 4

Answer should be C as dataflow is unpredictable size ( input that will vary in size), dataproc is with known size

Comment 14.1

ID: 548812 User: Tanzu Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Tue 16 Aug 2022 17:58 Selected Answer: - Upvotes: 1

dataflow over dataproc is always the preferred way in gcp. use dataproc only there is specific client requirements such as existing hadoop workloads, etc..

Comment 15

ID: 421715 User: sandipk91 Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 08 Feb 2022 18:53 Selected Answer: - Upvotes: 3

Option C is the answer

Comment 16

ID: 393302 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Wed 29 Dec 2021 01:56 Selected Answer: - Upvotes: 1

Vote for C

Comment 17

ID: 255474 User: apnu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Wed 30 Jun 2021 09:18 Selected Answer: - Upvotes: 2

B , it is correct , as it says minimum service cost, dataflow is more expansive than dataproc.

Comment 17.1

ID: 307817 User: daghayeghi Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Sat 11 Sep 2021 11:40 Selected Answer: - Upvotes: 1

but it said "with minimal manual intervention" and for Dataproc you need to manage cluster manually, then C is the best option.

Comment 17.2

ID: 332563 User: Believerath Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sun 10 Oct 2021 15:19 Selected Answer: - Upvotes: 1

You have to transform the JSON messages. Hence, you need to use dataflow.

40. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 274

Sequence: 132
Discussion ID: 130515
Source URL: https://www.examtopics.com/discussions/google/view/130515-exam-professional-data-engineer-topic-1-question-274/
Posted By: Smakyel79
Posted At: Jan. 7, 2024, 5:14 p.m.

Question

You have a BigQuery table that ingests data directly from a Pub/Sub subscription. The ingested data is encrypted with a Google-managed encryption key. You need to meet a new organization policy that requires you to use keys from a centralized Cloud Key Management Service (Cloud KMS) project to encrypt data at rest. What should you do?

A. Use Cloud KMS encryption key with Dataflow to ingest the existing Pub/Sub subscription to the existing BigQuery table.
B. Create a new BigQuery table by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.
C. Create a new Pub/Sub topic with CMEK and use the existing BigQuery table by using Google-managed encryption key.
D. Create a new BigQuery table and Pub/Sub topic by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.

Community Answer Votes

B: 25 most voted
D: 23

Comments 26 comments Click to expand

Comment 1

ID: 1117775 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 13:30 Selected Answer: B Upvotes: 11

- New BigQuery Table with CMEK: This option involves creating a new BigQuery table configured to use a CMEK from Cloud KMS. It directly addresses the need to use a CMEK for data at rest in BigQuery.
- Migrate Data: Migrating data from the old table (encrypted with a Google-managed key) to the new table (encrypted with CMEK) ensures that all existing data complies with the new policy.

Comment 1.1

ID: 1121820 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 16:54 Selected Answer: - Upvotes: 6

But also pub/sub has some data at rest, e.g. messages with retention period.
To comply with the organisation policy, we need to adapt also pub/sub

Comment 1.1.1

ID: 1127639 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 21 Jan 2024 06:27 Selected Answer: - Upvotes: 2

No, "The ingested data is encrypted with a Google-managed encryption key", target is ingested data in BigQuery.

Comment 1.1.1.1

ID: 1153234 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 12:45 Selected Answer: - Upvotes: 3

Correct, but the question states 'use keys from a centralized Cloud KMS project', so only D is correct.

Comment 1.1.1.1.1

ID: 1319681 User: cloud_rider Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Fri 29 Nov 2024 12:15 Selected Answer: - Upvotes: 1

PubSub is an application and holds data on the fly, this data does not mean data at rest. The data that is ingested in GSC only means data at rest so B is the right answer.

Comment 1.1.1.1.1.1

ID: 1361292 User: Blackstile Badges: - Relative Date: 1 year ago Absolute Date: Tue 25 Feb 2025 03:24 Selected Answer: - Upvotes: 1

The question did not say anything about the retention policy. Therefore, the correct answer is B.
A tip for the exam: never answer what was not asked.

Comment 2

ID: 1121819 User: Matt_108 Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 16:54 Selected Answer: D Upvotes: 9

Option D - I get the discussion about B and D, but also pub/sub has some data at rest, e.g. messages with retention period.
To comply with the organisation policy, we need to adapt also pub/sub

Comment 3

ID: 1361291 User: Blackstile Badges: Most Recent Relative Date: 1 year ago Absolute Date: Tue 25 Feb 2025 03:23 Selected Answer: - Upvotes: 1

The question did not say anything about the retention policy. Therefore, the correct answer is B.
A tip for the exam: never answer what was not asked.

Comment 4

ID: 1348923 User: plum21 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Thu 30 Jan 2025 07:54 Selected Answer: D Upvotes: 2

There is data at rest in Pub/Sub, which is stated here in the docs: https://cloud.google.com/pubsub/docs/encryption
At rest data -> Application layer -> CMEK encryption

Comment 5

ID: 1325850 User: m_a_p_s Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 12 Dec 2024 21:42 Selected Answer: B Upvotes: 1

B. You don't need to create a new topic in order to use the new CMEK. Existing topic can be updated to use the new key: https://cloud.google.com/pubsub/docs/encryption#update_cmek_for_a_topic

Comment 6

ID: 1305202 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 30 Oct 2024 22:52 Selected Answer: B Upvotes: 2

should be B. Pub/Sub is not designed for storing data at rest.

Comment 7

ID: 1302745 User: gr3yWind Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Fri 25 Oct 2024 05:26 Selected Answer: B Upvotes: 2

Agree with raaad

Comment 8

ID: 1271755 User: shanks_t Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 24 Aug 2024 18:54 Selected Answer: D Upvotes: 1

Requirement for Cloud KMS keys: The new organization policy requires using keys from a centralized Cloud KMS project for encrypting data at rest. This necessitates the use of customer-managed encryption keys (CMEK).
BigQuery table encryption: The existing BigQuery table is encrypted with a Google-managed key. To meet the new policy, a new table needs to be created with CMEK.
Pub/Sub topic encryption: Since the data is ingested directly from a Pub/Sub subscription, the Pub/Sub topic also needs to use CMEK to ensure end-to-end encryption with customer-managed keys.
Data migration: The existing data in the old BigQuery table needs to be migrated to the new CMEK-encrypted table to ensure all data complies with the new policy

Comment 9

ID: 1248043 User: carmltekai Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 15 Jul 2024 05:05 Selected Answer: B Upvotes: 2

"The best solution here is B. Create a new BigQuery table by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.

Here's why:

Customer-Managed Encryption Keys (CMEK): CMEKs allow you to have granular control over your encryption keys, complying with the organization's policy to use keys from a centralized Cloud KMS project.
Data Migration: Since the data in the existing table is already encrypted with a Google-managed key, you cannot retroactively change the encryption key for that table. Migrating the data to a new table with the correct encryption is the most efficient way to meet compliance.

Comment 9.1

ID: 1248044 User: carmltekai Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 15 Jul 2024 05:06 Selected Answer: - Upvotes: 1

Why other options aren't suitable:

A: Dataflow can't retroactively change the encryption of data that's already in BigQuery.
C: Creating a new Pub/Sub topic with CMEK wouldn't address the data that's already in BigQuery.
D: While creating a new Pub/Sub topic might be useful in the long run, it's not necessary for solving the immediate compliance issue with the existing data."

Comment 9.1.1

ID: 1260032 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Fri 02 Aug 2024 22:50 Selected Answer: - Upvotes: 1

You have some data in Pub/Sub at rest as well which is immediate compliance issue.

Comment 10

ID: 1230884 User: Anudeep58 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 15 Jun 2024 12:06 Selected Answer: D Upvotes: 3

D. Create a new BigQuery table and Pub/Sub topic by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.

This approach comprehensively addresses the requirement to use CMEK from a centralized Cloud KMS project for encrypting data at rest:

Create a new Pub/Sub topic configured to use CMEK from the centralized Cloud KMS project.
Create a new BigQuery table with CMEK enabled, using the same centralized Cloud KMS project.
Update the ingestion process to use the new Pub/Sub topic to feed data into the new BigQuery table.
Migrate existing data from the old BigQuery table to the new BigQuery table to ensure all data complies with the new encryption policy.

Comment 11

ID: 1224738 User: AlizCert Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 05 Jun 2024 15:23 Selected Answer: B Upvotes: 2

B, been there, done that...

Comment 11.1

ID: 1224739 User: AlizCert Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 05 Jun 2024 15:24 Selected Answer: - Upvotes: 2

sry, I mean D

Comment 12

ID: 1219040 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sun 26 May 2024 18:13 Selected Answer: D Upvotes: 3

BigQuery and Pub/Sub shall be encrypted using CMEK using new versions of each one.
https://cloud.google.com/pubsub/docs/encryption#using-cmek

Comment 13

ID: 1213240 User: chrissamharris Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 18 May 2024 11:22 Selected Answer: B Upvotes: 2

Data at rest in requirement = Big Query ONLY.

Pub/Sub is data in movement - overkill for the solution

Comment 14

ID: 1213151 User: f74ca0c Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 18 May 2024 07:59 Selected Answer: D Upvotes: 1

D- BigQuery and Pub/sub are automatically encrypted but here we need to apply a more secured policy by using CMEK so we need to use it for bigquery and pub/sub to meet this policy

Comment 15

ID: 1201879 User: LaxmanTiwari Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 25 Apr 2024 11:26 Selected Answer: B Upvotes: 2

B. Create a new BigQuery table by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table. Most Voted

Comment 15.1

ID: 1201881 User: LaxmanTiwari Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 25 Apr 2024 11:29 Selected Answer: - Upvotes: 1

it should be B as the data in pub sub is already encrypted , please read it carefully and use Copilot or chat gpt to have confirmation.

Comment 16

ID: 1189107 User: amanbawa96 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 04 Apr 2024 07:30 Selected Answer: B Upvotes: 1

BigQuery allows you to encrypt data at rest using either Google-managed encryption keys or customer-managed encryption keys (CMEK) from Cloud KMS.
Since the new policy requires using keys from a centralized Cloud KMS project, you need to create a new BigQuery table that is configured to use CMEK for encryption.
After creating the new table with CMEK, you can migrate the data from the old table (encrypted with Google-managed keys) to the new table (encrypted with CMEK).
This approach ensures that the data in the BigQuery table is encrypted using the required CMEK while preserving the existing data.

Creating a new BigQuery table and Pub/Sub topic with CMEK is not necessary because the focus is on encrypting the data at rest in BigQuery. The existing Pub/Sub subscription can still be used to ingest data into the new BigQuery table.

Comment 17

ID: 1180961 User: Izzyt99 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 23 Mar 2024 15:58 Selected Answer: - Upvotes: 3

D - 'as new organization policy that requires you to use keys from a centralized Cloud Key Management Service (Cloud KMS) project to encrypt data at rest.' Therefore, the Pub/Sub default Google-managed encryption key is not sufficient as the organization requires it's own CMEK that is to be generated from a centralized Cloud KMS project.

41. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 38

Sequence: 135
Discussion ID: 16657
Source URL: https://www.examtopics.com/discussions/google/view/16657-exam-professional-data-engineer-topic-1-question-38/
Posted By: jvg637
Posted At: March 15, 2020, 1:16 p.m.

Question

MJTelco Case Study -

Company Overview -
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.

Company Background -
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.

Solution Concept -
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments `" development/test, staging, and production `" to meet the needs of running experiments, deploying new features, and serving production customers.

Business Requirements -
✑ Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed research workers
✑ Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.

Technical Requirements -
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.

CEO Statement -
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.

CTO Statement -
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.

CFO Statement -
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

A. The zone
B. The number of workers
C. The disk size per worker
D. The maximum number of workers

Community Answer Votes

D: 15 most voted

Comments 18 comments Click to expand

Comment 1

ID: 221620 User: Radhika7983 Badges: Highly Voted Relative Date: 4 years, 9 months ago Absolute Date: Tue 18 May 2021 06:25 Selected Answer: - Upvotes: 27

The correct answer is D. Please look for the details in below
https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
We need to specify and set execution parameters for cloud data flow .

Also, to enable autoscaling, set the following execution parameters when you start your pipeline:

--autoscaling_algorithm=THROUGHPUT_BASED
--max_num_workers=N
The objective of autoscaling streaming pipelines is to minimize backlog while maximizing worker utilization and throughput, and quickly react to spikes in load. By enabling autoscaling, you don't have to choose between provisioning for peak load and fresh results. Workers are added as CPU utilization and backlog increase and are removed as these metrics come down. This way, you’re paying only for what you need, and the job is processed as efficiently as possible.

Comment 2

ID: 64252 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 15 Sep 2020 12:16 Selected Answer: - Upvotes: 26

D. The maximum number of workers answers to the scale question

Comment 3

ID: 1349801 User: cqrm3n Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sat 01 Feb 2025 10:22 Selected Answer: D Upvotes: 1

The answer is D because Google Dataflow is serverless and auto scales based on demand. To allow it to scale up compute power dynamically, we need to set the maximum number of workers.

Comment 4

ID: 1173148 User: I__SHA1234567 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sat 14 Sep 2024 05:50 Selected Answer: D Upvotes: 2

Cloud Dataflow dynamically scales the number of workers based on the amount of data being processed and the processing requirements. By updating the maximum number of workers, you allow Dataflow to scale up the compute power as needed to handle the workload efficiently. This ensures that the pipeline can adapt to changes in data volume and processing demands.

Comment 5

ID: 1050798 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 17:26 Selected Answer: D Upvotes: 2

D. The maximum number of workers

By increasing the maximum number of workers, you ensure that Cloud Dataflow can scale its compute power to handle the increased data processing load efficiently.

Comment 6

ID: 903228 User: vaga1 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 15:51 Selected Answer: D Upvotes: 2

dataflow auto-scales, then if it is not scaling is because it has reached the maximum number of workers that have been set.

Comment 7

ID: 876123 User: abi01a Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sat 21 Oct 2023 03:48 Selected Answer: - Upvotes: 2

A is the correct answer. Dataflow is Serverless. Specify your Region, autoscaling and other 'knobing' activities that are 'under the hood' will be taken care for you. Remember the company cannot afford to staff an Operations team to monitor data feeds so rely on ...

Comment 8

ID: 835705 User: bha11111 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 11 Sep 2023 06:37 Selected Answer: D Upvotes: 2

this is correct

Comment 9

ID: 779638 User: GCPpro Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 18 Jul 2023 05:38 Selected Answer: - Upvotes: 1

D . is the correct answer

Comment 10

ID: 778771 User: jkh_goh Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 17 Jul 2023 10:02 Selected Answer: - Upvotes: 1

Answer A provided is definitely wrong. Who comes up with these answers?

Comment 11

ID: 681047 User: Ender_H Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Mon 27 Mar 2023 20:09 Selected Answer: D Upvotes: 2

Correct Answer: D

❌ A: The zone has nothing to do with scaling computer power.

❌ B: The key word here is, "Scale its compute power up AS REQUIRED", with this answer, the number of workers would immediately scale the computer power.

❌ C: we need to scale compute power, not storage

✅ D: is the correct answer, changing the Number of Maximum workers will allow Dataflow to add up to that number of workers if required.

https://cloud.google.com/dataflow/docs/reference/pipeline-options#resource_utilization

Comment 12

ID: 523218 User: sraakesh95 Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Thu 14 Jul 2022 00:36 Selected Answer: D Upvotes: 2

@Radhika7983

Comment 13

ID: 516558 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Mon 04 Jul 2022 12:14 Selected Answer: D Upvotes: 2

The correct answer is D.
https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
We need to specify and set execution parameters for cloud data flow .

Also, to enable autoscaling, set the following execution parameters when you start your pipeline:

--autoscaling_algorithm=THROUGHPUT_BASED
--max_num_workers=N

Comment 14

ID: 489933 User: maurodipa Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Sun 29 May 2022 13:58 Selected Answer: - Upvotes: 5

Answer is A: Dataflow is serverless, so no need to specify neither the number of workers, nor the max number of workers. https://cloud.google.com/dataflow

Comment 14.1

ID: 887775 User: Jarek7 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 02 Nov 2023 21:02 Selected Answer: - Upvotes: 1

Have you ever use it? You pay for workers processing, so you specify max number of workers. Here is the doc: https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/run

Comment 15

ID: 461177 User: anji007 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 12 Apr 2022 18:29 Selected Answer: - Upvotes: 1

Ans: D

Comment 16

ID: 401952 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 08 Jan 2022 16:27 Selected Answer: - Upvotes: 3

Vote for D

Comment 17

ID: 313486 User: Lodu_Lalit Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Fri 17 Sep 2021 17:48 Selected Answer: - Upvotes: 3

D, thats because scalability is directly corerlated to max number of workers, size determines the speed of functioning

42. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 117

Sequence: 144
Discussion ID: 16629
Source URL: https://www.examtopics.com/discussions/google/view/16629-exam-professional-data-engineer-topic-1-question-117/
Posted By: madhu1171
Posted At: March 15, 2020, 4:23 a.m.

Question

You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?

A. Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
B. Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
C. Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
D. Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.

Community Answer Votes

D: 12 most voted

Comments 19 comments Click to expand

Comment 1

ID: 64133 User: madhu1171 Badges: Highly Voted Relative Date: 4 years, 12 months ago Absolute Date: Mon 15 Mar 2021 04:23 Selected Answer: - Upvotes: 27

Answer should be D

Comment 2

ID: 455046 User: Chelseajcole Badges: Highly Voted Relative Date: 3 years, 5 months ago Absolute Date: Fri 30 Sep 2022 19:04 Selected Answer: - Upvotes: 8

rule of thumb: If you see Kafka and Pub/Sub, always go with Pub/Sub in Google exam

Comment 2.1

ID: 505220 User: hendrixlives Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 20 Dec 2022 06:18 Selected Answer: - Upvotes: 7

Careful doing that: I got a question where you had to choose between Kafka and Pub/Sub... and the solution required to be able to replay all messages without time limit. So no Pub/Sub there.
This being a Google cert does not mean that they always force Google solutions.

Comment 3

ID: 1342793 User: grshankar9 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sun 19 Jan 2025 00:01 Selected Answer: D Upvotes: 1

Kafka is recommended over Pub/Sub only when there is requirement of high throughput, complex streaming, more flexibility with customization and fine-tuning configurations or when application spans multiple cloud providers and requires more flexibility in deployment across different platforms.

Comment 4

ID: 973042 User: NeoNitin Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 05 Aug 2024 15:02 Selected Answer: - Upvotes: 1

Data proc is serverbased
Dataflow is serverless which is used to run pipelines which uses apache framework in the background. Just
need to mention the number of workers needed.

so question saying we need scale automatically . so dataproc eliminate ho gaya
now Dataflow is correct , pub/sub is recommended for this scenario. D

Comment 5

ID: 760075 User: dconesoko Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 28 Dec 2023 17:20 Selected Answer: D Upvotes: 2

google's preferred choice

Comment 6

ID: 592507 User: VictorBa Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 26 Apr 2023 16:30 Selected Answer: D Upvotes: 1

It cannot be C because Dataproc is more suitable for Hadoop jobs.

Comment 7

ID: 518530 User: medeis_jar Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 06 Jan 2023 20:34 Selected Answer: D Upvotes: 1

Pub/Sub + Dataflow

Comment 8

ID: 517699 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 05 Jan 2023 18:45 Selected Answer: D Upvotes: 4

D: Pub/Sub + Dataflow
https://cloud.google.com/solutions/stream-analytics/
https://cloud.google.com/blog/products/data-analytics/streaming-analytics-now-simpler-more-cost-effective-cloud-dataflow

Comment 9

ID: 504568 User: hendrixlives Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 19 Dec 2022 02:38 Selected Answer: D Upvotes: 3

D: "at least once and must be ordered within windows" means Pub/Sub (at least once) with Dataflow (windows).

Comment 10

ID: 487096 User: JG123 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 26 Nov 2022 06:49 Selected Answer: - Upvotes: 3

Correct: D

Comment 11

ID: 421967 User: sandipk91 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Tue 09 Aug 2022 08:43 Selected Answer: - Upvotes: 2

Answer is D

Comment 12

ID: 399495 User: awssp12345 Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 06 Jul 2022 00:31 Selected Answer: - Upvotes: 2

https://cloud.google.com/architecture/migrating-from-kafka-to-pubsub#comparing_features

Comment 13

ID: 397140 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 02 Jul 2022 21:51 Selected Answer: - Upvotes: 3

Vote for D

Scaling - Dataflow.
Delivery of confimed atleast 1 message - Pub/Sub

Comment 14

ID: 288478 User: Sush12 Badges: - Relative Date: 4 years ago Absolute Date: Fri 11 Feb 2022 19:30 Selected Answer: - Upvotes: 2

Answer is D

Comment 15

ID: 216471 User: Alasmindas Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Wed 10 Nov 2021 10:03 Selected Answer: - Upvotes: 5

Indeed the correct answer is Option D.
Again, not sure why Exam topic answer is deliberately chosen for a wrong answer, for such simple question.

Comment 15.1

ID: 472307 User: szefco Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 03 Nov 2022 23:04 Selected Answer: - Upvotes: 1

To make us think of each question while studying, not just trying to memorize answers :) it looks that "correct" answers are chosen randomly :)

Comment 16

ID: 216033 User: Cloud_Enthusiast Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Tue 09 Nov 2021 16:20 Selected Answer: - Upvotes: 3

D for GCP native solution

Comment 17

ID: 163100 User: haroldbenites Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Sat 21 Aug 2021 19:44 Selected Answer: - Upvotes: 3

D is correct

43. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 11

Sequence: 146
Discussion ID: 79682
Source URL: https://www.examtopics.com/discussions/google/view/79682-exam-professional-data-engineer-topic-1-question-11/
Posted By: AWSandeep
Posted At: Sept. 3, 2022, 6:50 a.m.

Question

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
✑ No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket

✑ Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

A. Use a fixed-time window with a duration of 60 minutes.
B. Use a sliding time window with a duration of 60 minutes.
C. Use a session window with a gap time duration of 60 minutes.
D. Use a global window with a time based trigger with a delay of 60 minutes.

Community Answer Votes

C: 15 most voted

Comments 12 comments Click to expand

Comment 1

ID: 712084 User: vetaal Badges: Highly Voted Relative Date: 3 years, 4 months ago Absolute Date: Sun 06 Nov 2022 01:06 Selected Answer: - Upvotes: 35

There are 3 windowing concepts in dataflow and each can be used for below use case
1) Fixed window
2) Sliding window and
3) Session window.

Fixed window = any aggregation use cases, any batch analysis of data, relatively simple use cases.

Sliding window = Moving averages of data
Session window = user session data, click data and real time gaming analysis.

The question here is about user session data and hence session window.

Reference:
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines

Comment 2

ID: 1342412 User: cqrm3n Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 04:38 Selected Answer: C Upvotes: 4

The answer is C because session window is specifically designed to handle use cases where activity is grouped by gaps.

A. Fixed-time window divides data into non-overlapping, equally-size intervals but do not track gaps in user activity.
B. Sliding-time window process overlapping intervals and are better suited for periodic aggregation.
D. Global windows process all data over the pipeline's lifetime and rely on custom triggers to handle time-based logic. It is technically possible but unnecessarily complex so no.

Comment 3

ID: 1050481 User: rtcpost Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:14 Selected Answer: C Upvotes: 3

C. Use a session window with a gap time duration of 60 minutes.

A session window with a gap time duration of 60 minutes is appropriate for capturing user sessions where there has been no interaction on the site for 1 hour. It allows you to group user activity within a session, and when the session becomes inactive for the defined gap time, you can evaluate whether the user added more than $30 worth of products to the basket and has not completed a transaction.

Options A and B (fixed-time window and sliding time window) might not capture the specific session-based criteria of inactivity and user interaction effectively.

Option D (global window with a time-based trigger) is not suitable for capturing user sessions and checking inactivity based on a specific time duration. It's more appropriate for cases where you need a single global view of the data.

Comment 4

ID: 1065083 User: RT_G Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 07 Nov 2023 19:39 Selected Answer: C Upvotes: 1

Session window since the question specifically talks about a specific user for a fixed duration.

Comment 5

ID: 1061825 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sat 04 Nov 2023 00:30 Selected Answer: C Upvotes: 1

Session window = user session data, click data and real time gaming analysis.

Comment 6

ID: 1027020 User: imran79 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 07 Oct 2023 04:37 Selected Answer: - Upvotes: 2

The basket abandonment system needs to determine if a user hasn't interacted with the site for 1 hour, has added products worth more than $30, and hasn't completed a transaction. Therefore, the pipeline should account for periods of user activity and inactivity. A session-based windowing approach is appropriate here.

The right choice is:

C. Use a session window with a gap time duration of 60 minutes.

Session windows group data based on periods of activity and inactivity. If there's no interaction for the duration of the gap time (in this case, 60 minutes), a new window is started. This would help identify users who haven't interacted with the site for the specified duration, fulfilling the requirement for the basket abandonment system.

Comment 7

ID: 1016893 User: MikkelRev Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Mon 25 Sep 2023 16:34 Selected Answer: C Upvotes: 1

session windows can divide a data stream representing user activity
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#session-windows

Comment 8

ID: 839225 User: Chesternut999 Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Tue 14 Mar 2023 21:05 Selected Answer: C Upvotes: 2

C - The best option for this use case.

Comment 9

ID: 835652 User: bha11111 Badges: - Relative Date: 3 years ago Absolute Date: Sat 11 Mar 2023 05:25 Selected Answer: C Upvotes: 2

Session window is used for these type of scenario

Comment 10

ID: 799143 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 05 Feb 2023 21:25 Selected Answer: - Upvotes: 2

C. Use a session window with a gap time duration of 60 minutes.

A session window would be the most appropriate option to use in this case, as it would allow you to group events into sessions based on time gaps. In this case, the gap time of 60 minutes could be used to define a session, and if there is no interaction from the user for 60 minutes, a new session would be created. By using a session window, you can track the behavior of the user during each session, including the products added to the basket, and determine if the conditions for sending a message have been met (i.e., the user has added more than $30 worth of products to the basket and has not completed a transaction).

Comment 11

ID: 699632 User: kennyloo Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 20 Oct 2022 09:20 Selected Answer: - Upvotes: 1

Only C is feasible for this question

Comment 12

ID: 658078 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 06:50 Selected Answer: C Upvotes: 1

C. Use a session window with a gap time duration of 60 minutes.

44. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 45

Sequence: 148
Discussion ID: 17080
Source URL: https://www.examtopics.com/discussions/google/view/17080-exam-professional-data-engineer-topic-1-question-45/
Posted By: -
Posted At: March 21, 2020, 7:34 a.m.

Question

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud
Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

A. Change the processing job to use Google Cloud Dataproc instead.
B. Manually start the Cloud Dataflow job each morning when you get into the office.
C. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
D. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Community Answer Votes

C: 13 most voted

Comments 18 comments Click to expand

Comment 1

ID: 770374 User: captainbu Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Sun 09 Jul 2023 12:52 Selected Answer: C Upvotes: 6

C was correct but nowadays you'd schedule a Dataflow job with Cloud Scheduler: https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler

Comment 2

ID: 222897 User: Radhika7983 Badges: Highly Voted Relative Date: 4 years, 9 months ago Absolute Date: Wed 19 May 2021 15:47 Selected Answer: - Upvotes: 5

Answer is C. https://cloud.google.com/appengine/docs/flexible/nodejs/scheduling-jobs-with-cron-yaml

Comment 3

ID: 1342433 User: grshankar9 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 07:23 Selected Answer: C Upvotes: 1

App Engine Cron is limited to scheduling tasks within your App Engine application, whereas Cloud Scheduler can trigger actions on various Google Cloud services like Cloud Functions, Pub/Sub topics, or external HTTP endpoints.

Comment 4

ID: 1077180 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 22 May 2024 09:32 Selected Answer: C Upvotes: 1

Service was renamed, but the answer is still - C

Comment 5

ID: 1027684 User: imran79 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 08 Apr 2024 04:41 Selected Answer: - Upvotes: 1

C. Using the Google App Engine Cron Service to run the Cloud Dataflow job allows you to automate the execution of the job. By creating a cron job, you can ensure that the Dataflow job is triggered exactly once per day at a specified time. This approach is automated, reliable, and fits the requirement of processing the log file once per day.

Comment 6

ID: 948112 User: itsmynickname Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 15:40 Selected Answer: - Upvotes: 5

C. For a modern solution, Cloud Scheduler

Comment 7

ID: 909773 User: Maurilio_Cardoso Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 30 Nov 2023 02:11 Selected Answer: C Upvotes: 2

Currently, Cloud Scheduler takes over the scheduling functions.

Comment 8

ID: 817810 User: jin0 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Tue 22 Aug 2023 12:27 Selected Answer: - Upvotes: 2

I don't understand why that dataflow is used for processing? even though it should be processed once per a day?? is it more suitable for processing by using Dataproc instead?

Comment 8.1

ID: 1212812 User: mark1223jkh Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sun 17 Nov 2024 12:15 Selected Answer: - Upvotes: 2

Actually, google recommends Dataflow over Dataproc for both batch and streaming. Dataproc is only recommended if you are coming from hadoop, spark, ....

Comment 9

ID: 681890 User: Ender_H Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 28 Mar 2023 17:55 Selected Answer: C Upvotes: 2

Correct Answer: C.

❌ A: Dataproc is a managed Apache Spark and Apache Hadoop service, makes no sense to use it

❌ B: This might sound as the cheapest, but is highly error prone, besides, anyone in charge of this has a salary and I doubt is a low one.

✅ C: This is the easiest/fastest/cheapest way to trigger job runs, you can even set retry attempts.
source: https://cloud.google.com/appengine/docs/flexible/nodejs/scheduling-jobs-with-cron-yaml.

❌ D: Setting this would be much more expensive than the cron-job

Comment 10

ID: 619274 User: noob_master Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 20 Dec 2022 15:34 Selected Answer: C Upvotes: 1

Answer: C

Comment 11

ID: 462738 User: anji007 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 15 Apr 2022 19:53 Selected Answer: - Upvotes: 2

Ans: C

Comment 12

ID: 461330 User: Chelseajcole Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 13 Apr 2022 03:46 Selected Answer: - Upvotes: 3

I know probably this question is testing on if you know corn.yaml and its function in App Engine. But why B will be more expensive? Human capital cost? Let's say if hiring a person click the button will be cheaper than launch an app engine, should we reconsider B?

Comment 12.1

ID: 612381 User: AmirN Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Tue 06 Dec 2022 16:44 Selected Answer: - Upvotes: 3

Would you rather pay someone $100,000 a year to click 'run' on jobs all day, or have them automate it and do more cutting edge work? This would be opportunity cost.

Comment 13

ID: 454304 User: Chelseajcole Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 29 Mar 2022 18:21 Selected Answer: - Upvotes: 3

Scheduling Jobs with cron.yaml

Free applications can have up to 20 scheduled tasks. Paid applications can have up to 250 scheduled tasks.

Comment 14

ID: 392072 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Mon 27 Dec 2021 16:07 Selected Answer: - Upvotes: 2

Vote for 'C'

Comment 15

ID: 285654 User: naga Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Sat 07 Aug 2021 17:28 Selected Answer: - Upvotes: 3

Correct C

Comment 16

ID: 161148 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Fri 19 Feb 2021 02:20 Selected Answer: - Upvotes: 4

C Correct

45. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 66

Sequence: 152
Discussion ID: 17109
Source URL: https://www.examtopics.com/discussions/google/view/17109-exam-professional-data-engineer-topic-1-question-66/
Posted By: -
Posted At: March 21, 2020, 4:16 p.m.

Question

You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed. What should you do?

A. Create a dedicated service account, and use encryption at rest to reference your data stored in your Compute Engine cluster instances as part of your API service calls.
B. Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
C. Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
D. Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.

Community Answer Votes

B: 13 most voted
D: 2
C: 1

Comments 18 comments Click to expand

Comment 1

ID: 469979 User: SonuKhan1 Badges: Highly Voted Relative Date: 3 years, 10 months ago Absolute Date: Fri 29 Apr 2022 22:19 Selected Answer: - Upvotes: 57

Dear Admin, almost every answer is incorrect . Please check the comments and update your website.

Comment 2

ID: 76619 User: Ganshank Badges: Highly Voted Relative Date: 5 years, 4 months ago Absolute Date: Tue 20 Oct 2020 02:19 Selected Answer: - Upvotes: 8

B.
https://cloud.google.com/compute/docs/disks/customer-managed-encryption

Comment 3

ID: 1336397 User: and88x Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Sat 04 Jan 2025 14:45 Selected Answer: B Upvotes: 1

D is incorrect because referencing keys in API service calls doesn't meet the requirements for encrypting data at rest. This approach is more related to accessing data at runtime, not storing it securely.

Comment 4

ID: 1326401 User: AmitK121981 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sat 14 Dec 2024 10:25 Selected Answer: C Upvotes: 1

CMEK is where customer managers keys, but are still created by Google (this is for KMS). CSEK is where keys are created outside GCP and used by API calls. So if customer has to create keys, it has to be outside KMS

Comment 5

ID: 1325429 User: jatinbhatia2055 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 12 Dec 2024 07:22 Selected Answer: D Upvotes: 2

Best Option: This is the most accurate approach. Cloud KMS provides the ability to create, manage, and rotate encryption keys. You can use the KMS API to reference the keys when encrypting and decrypting your data. In this case, you would integrate the KMS keys into your application logic (e.g., Kafka producers/consumers, Redis clients) to encrypt and decrypt data as it is stored or processed. This approach leverages the full functionality of Cloud KMS, including the ability to rotate and destroy keys as needed.

Comment 6

ID: 1196431 User: zevexWM Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 16 Oct 2024 09:18 Selected Answer: - Upvotes: 1

But KMS doesnt create keys. It only stores them right?

Comment 7

ID: 1096578 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 14 Jun 2024 15:00 Selected Answer: B Upvotes: 1

Google Cloud Key Management Service (KMS) provides a centralized cloud service for managing cryptographic keys. By creating encryption keys in Cloud KMS, you can easily manage the lifecycle of these keys, including creation, rotation, and destruction.
WYY NOT Create Keys Locally and Upload to Cloud KMS?
While it’s possible to create keys locally and then upload them to Cloud KMS, it’s generally simpler and more secure to create the keys directly in Cloud KMS. This reduces the risk associated with transferring keys and leverages the security and compliance features of Cloud KMS.

Comment 8

ID: 1060006 User: emmylou Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Wed 01 May 2024 19:07 Selected Answer: - Upvotes: 2

Help!
I chose "C" because of the statement, "encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed" and read that as needing to generate the keys locally. Can you please explain where I went wrong?

Comment 9

ID: 960615 User: odiez3 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 19:11 Selected Answer: - Upvotes: 1

the answer is C Read the full statement.

" You need to encrypt data at rest with encryption keys that you can create "

Comment 10

ID: 954560 User: theseawillclaim Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 17 Jan 2024 22:22 Selected Answer: B Upvotes: 1

B!
C is useless overhead and you cannot rotate that easily!

Comment 11

ID: 898700 User: Kiroo Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Wed 15 Nov 2023 23:32 Selected Answer: - Upvotes: 1

Well for what I remember from cloud arch and what I found in https://cloud.google.com/compute/docs/disks/customer-managed-encryption

There is two options or the customer manage entirely or he will use the service to generate the keys so based on that is the B

Comment 12

ID: 784942 User: samdhimal Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 03:27 Selected Answer: - Upvotes: 2

B. Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.

Cloud Key Management Service (KMS) is a fully managed service that allows you to create, rotate, and destroy encryption keys as needed. By creating encryption keys in Cloud KMS, you can use them to encrypt your data at rest in the Compute Engine cluster instances, which is running your Redis and Kafka clusters. This ensures that your data is protected even when it is stored on disk.

Comment 12.1

ID: 784943 User: samdhimal Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 03:27 Selected Answer: - Upvotes: 3

Option A: Create a dedicated service account, and use encryption at rest to reference your data stored in your Compute Engine cluster instances as part of your API service calls is not the best option as it does not provide encryption at rest.

Option C: Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances, is not the best option as it does not provide a way to manage the encryption keys centrally.

Option D: Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances, is not the best option as it does not provide encryption at rest, it only secure the data in transit.

Comment 13

ID: 745537 User: DGames Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 14 Jun 2023 23:58 Selected Answer: B Upvotes: 1

B is correct answer generate key using KMS, why locally again it is overhead to upload and use everywhere.

Comment 14

ID: 726509 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 25 May 2023 08:29 Selected Answer: - Upvotes: 1

B
If you use Google Cloud, Cloud Key Management Service lets you create your own encryption keys that you can use to add envelope encryption to your data. Using Cloud KMS, you can create, rotate, track, and delete keys.
https://cloud.google.com/docs/security/encryption/default-encryption#:~:text=If%20you%20use%20Google%20Cloud%2C%20Cloud%20Key%20Management%20Service%20lets%20you%20create%20your%20own%20encryption%20keys%20that%20you%20can%20use%20to%20add%20envelope%20encryption%20to%20your%20data.%20Using%20Cloud%20KMS%2C%20you%20can%20create%2C%20rotate%2C%20track%2C%20and%20delete%20keys.

Comment 15

ID: 517604 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 05 Jul 2022 15:08 Selected Answer: B Upvotes: 1

https://cloud.google.com/security/encryption-at-rest/

Comment 16

ID: 506277 User: MaxNRG Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 21 Jun 2022 16:44 Selected Answer: B Upvotes: 8

A makes no sense, you need to use your own keys.
You don’t create keys locally and upload them, you should import it to make it work..using the kms public key…not just “uploading” it. C is also out.
IT’s between B and D
Cloud KMS is a cloud-hosted key management service that lets you manage cryptographic keys for your cloud services the same way you do on-premises, You can generate, use, rotate, and destroy cryptographic keys from there.
Since you want to encrypt data at rest, is B, you don’t use them for any API calls.
https://cloud.google.com/compute/docs/disks/customer-managed-encryption

Comment 17

ID: 467613 User: lg1234 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Mon 25 Apr 2022 19:38 Selected Answer: - Upvotes: 2

I believe you cannot upload custom keys to KMS for Compute Engine. Only via API Calls. See: https://cloud.google.com/security/encryption/customer-supplied-encryption-keys
With that said, option B

46. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 170

Sequence: 156
Discussion ID: 79515
Source URL: https://www.examtopics.com/discussions/google/view/79515-exam-professional-data-engineer-topic-1-question-170/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 7:38 p.m.

Question

You are updating the code for a subscriber to a Pub/Sub feed. You are concerned that upon deployment the subscriber may erroneously acknowledge messages, leading to message loss. Your subscriber is not set up to retain acknowledged messages. What should you do to ensure that you can recover from errors after deployment?

A. Set up the Pub/Sub emulator on your local machine. Validate the behavior of your new subscriber logic before deploying it to production.
B. Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to re-deliver messages that became available after the snapshot was created.
C. Use Cloud Build for your deployment. If an error occurs after deployment, use a Seek operation to locate a timestamp logged by Cloud Build at the start of the deployment.
D. Enable dead-lettering on the Pub/Sub topic to capture messages that aren't successfully acknowledged. If an error occurs after deployment, re-deliver any messages captured by the dead-letter queue.

Community Answer Votes

B: 17 most voted
A: 3
C: 1
D: 1

Comments 12 comments Click to expand

Comment 1

ID: 657667 User: AWSandeep Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 20:38 Selected Answer: B Upvotes: 13

B. Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to re-deliver messages that became available after the snapshot was created.

According to the second reference in the list below, a concern with deploying new subscriber code is that the new executable may erroneously acknowledge messages, leading to message loss. Incorporating snapshots into your deployment process gives you a way to recover from bugs in new subscriber code.

Answer cannot be C because To seek to a timestamp, you must first configure the subscription to retain acknowledged messages using retain-acked-messages. If retain-acked-messages is set, Pub/Sub retains acknowledged messages for 7 days.

References:
https://cloud.google.com/pubsub/docs/replay-message
https://cloud.google.com/pubsub/docs/replay-overview#seek_use_cases

Comment 1.1

ID: 738065 User: jkhong Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 15:44 Selected Answer: - Upvotes: 1

Don't think we need to configure subscription to retain ack messages. It is defaulted to retain for 7 days

Comment 2

ID: 1335224 User: f74ca0c Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Wed 01 Jan 2025 17:08 Selected Answer: C Upvotes: 1

C. Use a BigQuery view to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.

Explanation:
Preventing Data Skew:

Training-serving skew occurs when the transformations applied to training data are not identically applied to prediction data. Using a BigQuery view ensures consistent preprocessing for both training and prediction.
Advantages of BigQuery Views:

Views encapsulate preprocessing logic, ensuring that the same transformations are applied whenever the view is queried.
By referencing the view during both training and prediction, you eliminate the need for manual transformations and the risk of discrepancies.

Comment 3

ID: 1100954 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 19 Jun 2024 19:53 Selected Answer: B Upvotes: 2

Taking a snapshot allows redelivering messages that were published while any faulty subscriber logic was running.
The seek timestamp would come after deployment so even erroneously acknowledged messages could be recovered.
https://cloud.google.com/pubsub/docs/replay-overview#seek_use_cases
By creating a snapshot of the subscription before deploying new code, you can preserve the state of unacknowledged messages. If after deployment you find that the new subscriber code is erroneously acknowledging messages, you can use the Seek operation with the snapshot to reset the subscription's acknowledgment state to the time the snapshot was created. This would effectively re-deliver messages available since the snapshot, ensuring you can recover from errors. This approach does not require setting up a local emulator and directly addresses the concern of message loss due to erroneous acknowledgments.

Comment 4

ID: 962501 User: vamgcp Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 25 Jan 2024 11:14 Selected Answer: - Upvotes: 2

pls correct me if I am wrong , option B Option B only allows you to re-deliver messages that were available before the snapshot was created. If an error occurs after the snapshot was created, you will not be able to re-deliver those messages.

Comment 5

ID: 943771 User: cetanx Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 16:09 Selected Answer: A Upvotes: 3

Q: You are concerned that upon deployment the subscriber may erroneously acknowledge messages, leading to message loss.
-> So the message is mistakenly acked and removed from topic/subscription. This means even if you have a snapshot of pre-deployment but you don't have a backup or copy of post-deployment messages.

Q: Your subscriber is not set up to retain acknowledged messages.
-> To seek to a time in the past and replay previously-acknowledged messages, "you must first configure message retention on the topic" or "configure the subscription to retain acknowledged messages" (ref: https://cloud.google.com/pubsub/docs/replay-overview#configuring_message_retention)

So B, C, D do not solve the problem of erroneously acked messages as long as you don't have message retention configured on topic/subscription.

Comment 6

ID: 850854 User: lucaluca1982 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Tue 26 Sep 2023 11:22 Selected Answer: D Upvotes: 1

You are updating the code for a subscriber to a Pub/Sub feed. You are concerned that upon deployment the subscriber may erroneously acknowledge messages, leading to message loss. Your subscriber is not set up to retain acknowledged messages. What should you do to ensure that you can recover from errors after deployment?
A. Set up the Pub/Sub emulator on your local machine. Validate the behavior of your new subscriber logic before deploying it to production.
B. Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to re-deliver messages that became available after the snapshot was created.
C. Use Cloud Build for your deployment. If an error occurs after deployment, use a Seek operation to locate a timestamp logged by Cloud Build at the start of the deployment.
D. Enable dead-lettering on the Pub/Sub topic to capture messages that aren't successfully acknowledged. If an error occurs after deployment, re-deliver any messages captured by the dead-letter queue.

Comment 7

ID: 813416 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 19:06 Selected Answer: - Upvotes: 2

Option D:
Dead letter option allow you to recover message from errors after deployment by re-delivering any messages captured by the dead-letter queue.
https://cloud.google.com/pubsub/docs/handling-failures#dead_letter_topic
why not B,
because snapshot is time taking process and if messages were erroneously acknowledged, it will not bring them back. It is useful when you want to secure the current data and want to make changes

Comment 7.1

ID: 833053 User: wjtb Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 08 Sep 2023 14:24 Selected Answer: - Upvotes: 7

Dead letter queue would help if the messages would not get acknowledged, however here they are talking about messages being erroneously acknowledged. Pub/Sub would interpret the message as being succesfully processed -> they would not end up in the dead-letter queue -> D is wrong

Comment 8

ID: 725268 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 16:36 Selected Answer: - Upvotes: 1

B
The Seek feature extends subscriber functionality by allowing you to alter the acknowledgement state of messages in bulk. For example, you can replay previously acknowledged messages or purge messages in bulk. In addition, you can copy the state of one subscription to another by using seek in combination with a Snapshot.

https://cloud.google.com/pubsub/docs/replay-overview

Comment 9

ID: 664257 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Thu 09 Mar 2023 08:47 Selected Answer: B Upvotes: 1

Answer B

Comment 10

ID: 658473 User: PhuocT Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 15:46 Selected Answer: B Upvotes: 1

should be B.

47. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 208

Sequence: 158
Discussion ID: 129855
Source URL: https://www.examtopics.com/discussions/google/view/129855-exam-professional-data-engineer-topic-1-question-208/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:30 a.m.

Question

A live TV show asks viewers to cast votes using their mobile phones. The event generates a large volume of data during a 3-minute period. You are in charge of the "Voting infrastructure" and must ensure that the platform can handle the load and that all votes are processed. You must display partial results while voting is open. After voting closes, you need to count the votes exactly once while optimizing cost. What should you do?

A. Create a Memorystore instance with a high availability (HA) configuration.
B. Create a Cloud SQL for PostgreSQL database with high availability (HA) configuration and multiple read replicas.
C. Write votes to a Pub/Sub topic and have Cloud Functions subscribe to it and write votes to BigQuery.
D. Write votes to a Pub/Sub topic and load into both Bigtable and BigQuery via a Dataflow pipeline. Query Bigtable for real-time results and BigQuery for later analysis. Shut down the Bigtable instance when voting concludes.

Community Answer Votes

D: 14 most voted
C: 3

Comments 7 comments Click to expand

Comment 1

ID: 1115710 User: MaxNRG Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 10:17 Selected Answer: D Upvotes: 7

Since cost optimization and minimal latency are key requirements, option D is likely the best choice to meet all the needs:

The key reasons option D works well:

Using Pub/Sub to ingest votes provides scalable, reliable transport.

Loading into Bigtable and BigQuery provides both:

Low latency reads from Bigtable for real-time results.
Cost effective storage in BigQuery for longer term analysis.
Shutting down Bigtable after voting concludes reduces costs.

BigQuery remains available for cost-optimized storage and analysis.

So you are correct that option D combines the best of real-time performance for queries using Bigtable, with cost-optimized storage in BigQuery.

The only additional consideration may be if 3 minutes of Bigtable usage still incurs higher charges than ingesting directly into BigQuery. But for minimizing latency while optimizing cost, option D is likely the right architectural choice given the requirements.

Comment 2

ID: 1333708 User: f74ca0c Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Sun 29 Dec 2024 21:29 Selected Answer: C Upvotes: 2

Modern Capabilities: BigQuery’s advancements make it suitable for both real-time and historical querying.
Cost Efficiency: No need to spin up and shut down a Bigtable instance.
Simplified Workflow: Real-time and post-event data are stored in the same system, reducing the need to synchronize or transfer data between systems.

Comment 3

ID: 1151093 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 15 Aug 2024 15:34 Selected Answer: D Upvotes: 1

D. Write votes to a Pub/Sub topic and load into both Bigtable and BigQuery via a Dataflow pipeline. Query Bigtable for real-time results and BigQuery for later analysis. Shut down the Bigtable instance when voting concludes.

Comment 4

ID: 1121416 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 08:26 Selected Answer: D Upvotes: 1

D, i do agree with everything MaxNRG said.

Comment 5

ID: 1115662 User: Smakyel79 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 09:06 Selected Answer: C Upvotes: 1

Pub/Sub for sure, and Cloud Functions + BigQuery Streaming seems a good solution. Won't use BigTable as need at least 100GB of data (don't thing a voting system could arrive to that amount of data) and needs to "heat" to work right for >10 minutes... and would be $$$ over C solution

Comment 6

ID: 1112159 User: raaad Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 02 Jul 2024 18:54 Selected Answer: D Upvotes: 2

Answer is D:
- Google Cloud Pub/Sub can manage the high-volume data ingestion.
- Google Cloud Dataflow can efficiently process and route data to both Bigtable and BigQuery.
- Bigtable is excellent for handling high-throughput writes and reads, making it suitable for real-time vote tallying.
- BigQuery is ideal for exact vote counting and deeper analysis once voting concludes.

Comment 7

ID: 1109529 User: e70ea9e Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 08:30 Selected Answer: D Upvotes: 3

Handling High-Volume Data Ingestion:

Pub/Sub: Decouples vote collection from processing, ensuring scalability and resilience under high load.
Dataflow: Efficiently ingests and processes large data streams, scaling as needed.
Real-Time Results with Exactly-Once Processing:

Bigtable: Optimized for low-latency, high-throughput reads and writes, ideal for real-time partial results.
Exactly-Once Semantics: Dataflow guarantees each vote is processed only once, ensuring accurate counts.
Cost Optimization:

Temporary Bigtable Instance: Running Bigtable only during voting minimizes costs.
BigQuery Storage: Cost-effective for long-term storage and analysis.

48. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 286

Sequence: 160
Discussion ID: 130287
Source URL: https://www.examtopics.com/discussions/google/view/130287-exam-professional-data-engineer-topic-1-question-286/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 10:41 a.m.

Question

You have thousands of Apache Spark jobs running in your on-premises Apache Hadoop cluster. You want to migrate the jobs to Google Cloud. You want to use managed services to run your jobs instead of maintaining a long-lived Hadoop cluster yourself. You have a tight timeline and want to keep code changes to a minimum. What should you do?

A. Move your data to BigQuery. Convert your Spark scripts to a SQL-based processing approach.
B. Rewrite your jobs in Apache Beam. Run your jobs in Dataflow.
C. Copy your data to Compute Engine disks. Manage and run your jobs directly on those instances.
D. Move your data to Cloud Storage. Run your jobs on Dataproc.

Community Answer Votes

D: 17 most voted

Comments 9 comments Click to expand

Comment 1

ID: 1332388 User: hussain.sain Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 27 Dec 2024 13:23 Selected Answer: D Upvotes: 2

D is correct.
Dataproc is the most suitable choice for migrating your existing Apache Spark jobs to Google Cloud because it is a fully managed service that supports Apache Spark and Hadoop workloads with minimal changes to your existing code. Moving your data to Cloud Storage and running jobs on Dataproc offers a fast, efficient, and scalable solution for your needs.

Comment 2

ID: 1263366 User: meh_33 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 09:37 Selected Answer: D Upvotes: 1

option D, minimum code changes

Comment 3

ID: 1174325 User: hanoverquay Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Fri 15 Mar 2024 15:56 Selected Answer: D Upvotes: 2

option D, minimum code changes

Comment 4

ID: 1155436 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 11:22 Selected Answer: D Upvotes: 2

Option D

Comment 5

ID: 1153442 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 17:59 Selected Answer: D Upvotes: 3

D) That is what Dataproc is made for. It is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, etc.

Comment 6

ID: 1121892 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 18:11 Selected Answer: D Upvotes: 2

Clearly D

Comment 7

ID: 1118373 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 09:27 Selected Answer: D Upvotes: 2

of course D

Comment 8

ID: 1115965 User: GCP001 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 16:29 Selected Answer: - Upvotes: 3

D. Move your data to Cloud Storage. Run your jobs on Dataproc.
Dataproc is managed service and not needed much code changes.

Comment 9

ID: 1113498 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 10:41 Selected Answer: D Upvotes: 3

D. Move your data to Cloud Storage. Run your jobs on Dataproc.

49. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 294

Sequence: 161
Discussion ID: 130307
Source URL: https://www.examtopics.com/discussions/google/view/130307-exam-professional-data-engineer-topic-1-question-294/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 11:45 a.m.

Question

You work for a large ecommerce company. You are using Pub/Sub to ingest the clickstream data to Google Cloud for analytics. You observe that when a new subscriber connects to an existing topic to analyze data, they are unable to subscribe to older data. For an upcoming yearly sale event in two months, you need a solution that, once implemented, will enable any new subscriber to read the last 30 days of data. What should you do?

A. Create a new topic, and publish the last 30 days of data each time a new subscriber connects to an existing topic.
B. Set the topic retention policy to 30 days.
C. Set the subscriber retention policy to 30 days.
D. Ask the source system to re-push the data to Pub/Sub, and subscribe to it.

Community Answer Votes

B: 20 most voted

Comments 9 comments Click to expand

Comment 1

ID: 1119948 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 11 Jan 2024 17:13 Selected Answer: B Upvotes: 12

- Topic Retention Policy: This policy determines how long messages are retained by Pub/Sub after they are published, even if they have not been acknowledged (consumed) by any subscriber.
- 30 Days Retention: By setting the retention policy of the topic to 30 days, all messages published to this topic will be available for consumption for 30 days. This means any new subscriber connecting to the topic can access and analyze data from the past 30 days.

Comment 2

ID: 1332405 User: hussain.sain Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Fri 27 Dec 2024 13:51 Selected Answer: B Upvotes: 1

B is correct.
By setting the topic retention policy to 30 days, any new subscriber will be able to access the data for the past 30 days, regardless of when they connect. This solution is both cost-effective and efficient for your use case.

Comment 3

ID: 1304931 User: romain773 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 30 Oct 2024 11:01 Selected Answer: - Upvotes: 1

Option B is wrong i think (topic retention) because it only makes unconsumed messages available for 30 days. I propose option A

Option A (creating a new topic and republishing the last 30 days of data for each new subscriber) is actually a better solution to ensure that new subscribers have access to the full 30-day history.

Comment 4

ID: 1302831 User: romain773 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Fri 25 Oct 2024 12:51 Selected Answer: - Upvotes: 1

Option B is wrong (topic retention) because it only makes unconsumed messages available for 30 days.

Option A (creating a new topic and republishing the last 30 days of data for each new subscriber) is actually a better solution to ensure that new subscribers have access to the full 30-day history.

Comment 5

ID: 1193554 User: joao_01 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 09:54 Selected Answer: - Upvotes: 1

Its B. It could be C as well because subscription has message retention. However, in the subscription there is a maximum value for it: 7 days.

Link:https://cloud.google.com/pubsub/docs/subscription-properties

Comment 5.1

ID: 1193556 User: joao_01 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 09:55 Selected Answer: - Upvotes: 1

In a topic the maximum value is 31 days.

Link: https://cloud.google.com/pubsub/docs/topic-properties

Comment 6

ID: 1121912 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 18:32 Selected Answer: B Upvotes: 2

Definitely B

Comment 7

ID: 1118805 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 17:13 Selected Answer: B Upvotes: 4

https://cloud.google.com/blog/products/data-analytics/pubsub-gains-topic-retention-feature

Comment 8

ID: 1113573 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 11:45 Selected Answer: B Upvotes: 1

B. Set the topic retention policy to 30 days.

50. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 314

Sequence: 163
Discussion ID: 153398
Source URL: https://www.examtopics.com/discussions/google/view/153398-exam-professional-data-engineer-topic-1-question-314/
Posted By: m_a_p_s
Posted At: Dec. 24, 2024, 8:29 p.m.

Question

You are configuring networking for a Dataflow job. The data pipeline uses custom container images with the libraries that are required for the transformation logic preinstalled. The data pipeline reads the data from Cloud Storage and writes the data to BigQuery. You need to ensure cost-effective and secure communication between the pipeline and Google APIs and services. What should you do?

A. Disable external IP addresses from worker VMs and enable Private Google Access.
B. Leave external IP addresses assigned to worker VMs while enforcing firewall rules.
C. Disable external IP addresses and establish a Private Service Connect endpoint IP address.
D. Enable Cloud NAT to provide outbound internet connectivity while enforcing firewall rules.

Community Answer Votes

A: 2 most voted

Comments 1 comment Click to expand

Comment 1

ID: 1331220 User: m_a_p_s Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 24 Dec 2024 20:29 Selected Answer: A Upvotes: 2

While option C is technically implementable, option A is a straightforward and a simpler solution.

51. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 146

Sequence: 167
Discussion ID: 17224
Source URL: https://www.examtopics.com/discussions/google/view/17224-exam-professional-data-engineer-topic-1-question-146/
Posted By: -
Posted At: March 22, 2020, 8:31 a.m.

Question

You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in use, and the data format is Optimized Row Columnar (ORC).
All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File System
(HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc? (Choose two.)

A. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.
B. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
C. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
D. Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones.
E. Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.

Community Answer Votes

AD: 9 most voted
CD: 6
DE: 4
BC: 4

Comments 23 comments Click to expand

Comment 1

ID: 501308 User: Sid19 Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Tue 14 Dec 2021 12:09 Selected Answer: - Upvotes: 26

Answer is C and D 100%.
I know it says to transfer all the files but with the options provided c is the best choice.
Explaination
A and B cannot be true as gsutil can copy data to master node and the to hdfs from master node.
C -> works
D->works Recommended by google
E-> Will work but as the question says maximize performance this is not a case. As bigquery hadoop connecter stores all the BQ data to GCS as temp and then processes it to HDFS. As data is already in GCS we donot need to load it to bq and use a connector then unloads it back to GCS and then processes it.

Comment 1.1

ID: 536547 User: Deepakd Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 31 Jan 2022 00:39 Selected Answer: - Upvotes: 2

How can master node store data ? C is wrong.

Comment 1.2

ID: 1047429 User: KLei Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 19 Oct 2023 03:37 Selected Answer: - Upvotes: 1

must go to the master node first...

Comment 1.3

ID: 917539 User: WillemHendr Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 21:05 Selected Answer: - Upvotes: 3

I feel indeed this question is testing if you understand, that gsutil cannot transfer to HDFS directly (eliminate A&B), and need a intermediate step, (making C doable, with a good result). D is found on official google docs. E doesn't have good end result.

Comment 1.4

ID: 921458 User: rohan0411 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Mon 12 Jun 2023 14:47 Selected Answer: - Upvotes: 1

You can copy to worker nodes directly too by specyfing the specific flag.
hdfs://<master node> is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>

So the correct answer is A & D

Comment 2

ID: 163564 User: haroldbenites Badges: Highly Voted Relative Date: 5 years, 6 months ago Absolute Date: Sat 22 Aug 2020 13:41 Selected Answer: - Upvotes: 7

D , E is correct

Comment 3

ID: 1328336 User: shangning007 Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Wed 18 Dec 2024 09:04 Selected Answer: AD Upvotes: 1

https://stackoverflow.com/questions/54429642/how-to-copy-a-file-from-a-gcs-bucket-in-dataproc-to-hdfs-using-google-cloud
Based on here, you can copy a single file from Google Cloud Storage (GCS) to HDFS using the HDFS copy command. There is no need to copy to the master node first.

Comment 4

ID: 1303301 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Sat 26 Oct 2024 16:22 Selected Answer: CD Upvotes: 1

Actually I think A is correct as well.

Comment 5

ID: 1109738 User: patitonav Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 14:07 Selected Answer: DE Upvotes: 2

I think D and E are the best and easy way to go. For sure D, but I think that E can work too, the data can be loaded in BQ as an external table, so at the end the data will be always on the GCS.

Comment 6

ID: 1015467 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 05:43 Selected Answer: DE Upvotes: 1

D. Cloud Storage Connector for Hadoop: You can use the Cloud Storage connector for Hadoop to mount the ORC files stored in Cloud Storage as external Hive tables. This allows you to query the data without copying it to HDFS. You can replicate these external Hive tables to native Hive tables in Cloud Dataproc if needed.

E. Load ORC Files into BigQuery: Another approach is to load the ORC files into BigQuery, Google Cloud's data warehouse. Once the data is in BigQuery, you can use the BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables in Cloud Dataproc. This leverages the power of BigQuery for analytics and allows you to replicate external Hive tables to native ones in Cloud Dataproc.

Comment 7

ID: 1015465 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 05:39 Selected Answer: DE Upvotes: 1

D. Cloud Storage Connector for Hadoop: You can use the Cloud Storage connector for Hadoop to mount the ORC files stored in Cloud Storage as external Hive tables. This allows you to query the data without copying it to HDFS. You can replicate these external Hive tables to native Hive tables in Cloud Dataproc if needed.

E. Load ORC Files into BigQuery: Another approach is to load the ORC files into BigQuery, Google Cloud's data warehouse. Once the data is in BigQuery, you can use the BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables in Cloud Dataproc. This leverages the power of BigQuery for analytics and allows you to replicate external Hive tables to native ones in Cloud Dataproc.

Comment 8

ID: 963292 User: vamgcp Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Wed 26 Jul 2023 04:55 Selected Answer: AD Upvotes: 2

A is the most straightforward way to start using Hive in Cloud Dataproc. You can use the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Then, you can mount the Hive tables locally.

D is another option that you can use to start using Hive in Cloud Dataproc. You can leverage the Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Then, you can replicate the external Hive tables to the native ones.

Comment 9

ID: 947892 User: Qix Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Mon 10 Jul 2023 10:12 Selected Answer: BC Upvotes: 3

Answers are;
B. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
C. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.

You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance. HDFS lies on datanode, data on masternode needs to be copied on datanode.
B for managed hive table option, C for external hive table

Comment 10

ID: 892015 User: izekc Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 08 May 2023 12:31 Selected Answer: AD Upvotes: 1

AD is correct

Comment 11

ID: 888347 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 03 May 2023 11:47 Selected Answer: AD Upvotes: 1

i choose AD.
Searched in other w/s, read discussions here, and guess better AD.

Comment 11.1

ID: 888352 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 03 May 2023 11:49 Selected Answer: - Upvotes: 1

gpt: Yes, that is correct. Option A is a valid way to transfer the ORC files to HDFS, and then mount the Hive tables locally. Option D is also valid, as it suggests using the Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables and then replicating those external Hive tables to native ones.

Chatgpt agreed, after inserting quesion and variants, and said that AD are correct answers. And it agreed. It adds some confidence that these are good, but gpt can make mistakes

Comment 12

ID: 867018 User: streeeber Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 11 Apr 2023 08:46 Selected Answer: AD Upvotes: 1

A will copy to HDFS and so will D

Comment 13

ID: 734225 User: hauhau Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 03 Dec 2022 06:40 Selected Answer: AD Upvotes: 3

C: master node doesn't make sense

Comment 13.1

ID: 734226 User: hauhau Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 03 Dec 2022 06:41 Selected Answer: - Upvotes: 1

B: from the Cloud Storage bucket to any node of the Dataproc cluster
-> still on cloud not maxize the speed

Comment 14

ID: 707935 User: tikki_boy Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 30 Oct 2022 18:12 Selected Answer: - Upvotes: 2

I'll go with DE

Comment 15

ID: 658074 User: ducc Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 06:45 Selected Answer: CD Upvotes: 2

CD is correct

Comment 16

ID: 542960 User: BigDataBB Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 08 Feb 2022 10:22 Selected Answer: BC Upvotes: 1

You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance. HDFS lies on datanode, data on masternode needs to be copied on datanode.
B for managed hive table option, C for external hive table

Comment 17

ID: 519562 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 08 Jan 2022 15:01 Selected Answer: CD Upvotes: 3

as explained by Sid19

52. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 182

Sequence: 169
Discussion ID: 79560
Source URL: https://www.examtopics.com/discussions/google/view/79560-exam-professional-data-engineer-topic-1-question-182/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 9:55 p.m.

Question

You are migrating your data warehouse to Google Cloud and decommissioning your on-premises data center. Because this is a priority for your company, you know that bandwidth will be made available for the initial data load to the cloud. The files being transferred are not large in number, but each file is 90 GB.
Additionally, you want your transactional systems to continually update the warehouse on Google Cloud in real time. What tools should you use to migrate the data and ensure that it continues to write to your warehouse?

A. Storage Transfer Service for the migration; Pub/Sub and Cloud Data Fusion for the real-time updates
B. BigQuery Data Transfer Service for the migration; Pub/Sub and Dataproc for the real-time updates
C. gsutil for the migration; Pub/Sub and Dataflow for the real-time updates
D. gsutil for both the migration and the real-time updates

Community Answer Votes

C: 14 most voted
A: 3

Comments 8 comments Click to expand

Comment 1

ID: 657775 User: AWSandeep Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 22:55 Selected Answer: C Upvotes: 8

C. gsutil for the migration; Pub/Sub and Dataflow for the real-time updates

Use Gsutil when there is enough bandwidth to meet your project deadline for less than 1 TB of data. Storage Transfer Service is for much larger volumes for migration. Moreover, Cloud Data Fusion and Dataproc are not ideal for real-time updates. BigQuery Data Transfer Service does not support all on-prem sources.

Comment 2

ID: 1328897 User: shangning007 Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Thu 19 Dec 2024 09:07 Selected Answer: A Upvotes: 3

According to the latest documentation, "Generally, you should use gcloud storage commands instead of gsutil commands. The gsutil tool is a legacy Cloud Storage CLI and minimally maintained."
We should remove the presence of gsutil in the questions.

Comment 3

ID: 1104545 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 24 Jun 2024 10:31 Selected Answer: C Upvotes: 1

Considering the requirement for handling large files and the need for real-time data integration, Option C (gsutil for the migration; Pub/Sub and Dataflow for the real-time updates) seems to be the most appropriate. gsutil will effectively handle the large file transfers, while Pub/Sub and Dataflow provide a robust solution for real-time data capture and processing, ensuring continuous updates to your warehouse on Google Cloud.

Comment 4

ID: 1102304 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:14 Selected Answer: C Upvotes: 1

Option C is the best choice given the large file sizes for the initial migration and the need for real-time updates after migration.

Specifically:

gsutil can transfer large files in parallel over multiple TCP connections to maximize bandwidth. This works well for the 90GB files during initial migration.
Pub/Sub allows real-time messaging of updates that can then be streamed into Cloud Dataflow. Dataflow provides scalable stream processing to handle transforming and writing those updates into BigQuery or other sinks.

Comment 4.1

ID: 1102305 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:14 Selected Answer: - Upvotes: 2

Option A is incorrect because Storage Transfer Service is better for scheduled batch transfers, not ad hoc large migrations.

Option B is incorrect because BigQuery Data Transfer Service is more focused on scheduled replication jobs, not ad hoc migrations.

Option D would not work well for real-time updates after migration is complete.

So option C leverages the right Google cloud services for the one-time migration and ongoing real-time processing.

Comment 5

ID: 1055166 User: xiangbobopopo Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Sat 27 Apr 2024 08:57 Selected Answer: C Upvotes: 1

agree with C

Comment 6

ID: 662134 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 10:02 Selected Answer: - Upvotes: 4

https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#gsutil_for_smaller_transfers_of_on-premises_data
Answer C

Comment 7

ID: 661080 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 13:20 Selected Answer: C Upvotes: 3

C seems legit

53. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 61

Sequence: 173
Discussion ID: 16747
Source URL: https://www.examtopics.com/discussions/google/view/16747-exam-professional-data-engineer-topic-1-question-61/
Posted By: jvg637
Posted At: March 16, 2020, 3:32 p.m.

Question

Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud
Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google
BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?

A. Migrate the workload to Google Cloud Dataflow
B. Use pre-emptible virtual machines (VMs) for the cluster
C. Use a higher-memory node so that the job runs faster
D. Use SSDs on the worker nodes so that the job can run faster

Community Answer Votes

B: 22 most voted

Comments 18 comments Click to expand

Comment 1

ID: 64739 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Mon 16 Mar 2020 15:32 Selected Answer: - Upvotes: 47

B. (Hadoop/Spark jobs are run on Dataproc, and the pre-emptible machines cost 80% less)

Comment 2

ID: 65042 User: rickywck Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Tue 17 Mar 2020 07:24 Selected Answer: - Upvotes: 18

I think the answer should be B:

https://cloud.google.com/dataproc/docs/concepts/compute/preemptible-vms

Comment 3

ID: 1326397 User: AmitK121981 Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Sat 14 Dec 2024 10:06 Selected Answer: B Upvotes: 1

all are saying its pre-emptibles. but spot VMs can only be used in secondary worker, not on master and primary worker so not sure why this will cause savings, and secondary workers are not mandatory too

Comment 4

ID: 954534 User: theseawillclaim Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 17 Jul 2023 20:37 Selected Answer: - Upvotes: 2

I believe it might be "B", but what if the job is mission critical?
Pre-emptible VMs would be of no use.

Comment 4.1

ID: 1256298 User: enivid007 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 27 Jul 2024 14:41 Selected Answer: - Upvotes: 1

Mission critical workloads can't be needed "weekly"

Comment 5

ID: 878927 User: abi01a Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 24 Apr 2023 03:16 Selected Answer: - Upvotes: 8

I believe Exam Topics ought to provide brief explanation or supporting link to picked correct answers such as this one. Option A may be correct from the view point that Dataflow is a Serverless service that is fast, cost-effective and the fact that Preemptible VMs though can give large price discount may not always be available. It will be great to know the reason(s) behind Exam Topic selected option.

Comment 6

ID: 784897 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 04:02 Selected Answer: - Upvotes: 4

B. Use pre-emptible virtual machines (VMs) for the cluster

Using pre-emptible VMs allows you to take advantage of lower-cost virtual machine instances that may be terminated by Google Cloud after a short period of time, typically after 24 hours. These instances can be a cost-effective way to handle workloads that can be interrupted, such as batch processing jobs like the one described in the question.

Option A is not ideal, as it would require you to migrate the workload to Google Cloud Dataflow, which may cause additional complexity and would not address the issue of cost optimization.
Option C is not ideal, as it would require you to use a higher-memory node which would increase the cost.
Option D is not ideal, as it would require you to use SSDs on the worker nodes which would increase the cost.

Using pre-emptible VMs is a better option as it allows you to take advantage of lower-cost virtual machine instances and handle workloads that can be interrupted, which can help to optimize the cost of the cluster.

Comment 7

ID: 766964 User: Rodolfo_Marcos Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 05 Jan 2023 20:02 Selected Answer: - Upvotes: 2

What is happening with this test "correct answer" a lot of times it doesn't make any sense. As this one... Clear it's B

Comment 8

ID: 747168 User: DipT Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 13:26 Selected Answer: B Upvotes: 2

Using preemtible machines are cost effective , and because is suitable for a job mentioned here as it is fault tolerant .

Comment 9

ID: 745495 User: DGames Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 14 Dec 2022 23:16 Selected Answer: B Upvotes: 1

User Pre-emptible VM machine and save process cost, and question want simple solution.

Comment 10

ID: 737693 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 10:52 Selected Answer: B Upvotes: 1

A- Data flow it's not cost-effective in comparison with dataproc
B- Preemptible VM instances are available at much lower price—a 60-91% discount—compared to the price of standar, so this is the answer
C and D are more expensive.

Comment 11

ID: 662654 User: Remi2021 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Wed 07 Sep 2022 17:40 Selected Answer: B Upvotes: 1

B is right way to go

Comment 12

ID: 609410 User: FrankT2L Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Mon 30 May 2022 20:26 Selected Answer: B Upvotes: 1

Preemptible workers are the default secondary worker type. They are reclaimed and removed from the cluster if they are required by Google Cloud for other tasks. Although the potential removal of preemptible workers can affect job stability, you may decide to use preemptible instances to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost

https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms

Comment 13

ID: 573754 User: Remi2021 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 23 Mar 2022 17:14 Selected Answer: - Upvotes: 4

B is teh right answer. examtopics update your answers or make your site free again.

Comment 14

ID: 568849 User: OmJanmeda Badges: - Relative Date: 3 years, 12 months ago Absolute Date: Wed 16 Mar 2022 08:46 Selected Answer: B Upvotes: 4

B is right answer.
my experience is not good with Examtopics, so many wrong answers.

Comment 15

ID: 540098 User: Yaa Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Fri 04 Feb 2022 00:31 Selected Answer: B Upvotes: 2

B should be the right answer.
I am amazed that almost 60% of the marked answers on the site are wrong.

Comment 16

ID: 531332 User: byash1 Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 24 Jan 2022 14:44 Selected Answer: - Upvotes: 1

Ans : B,
here we are checking on reducing cost, so pre-emptiable machines are best choice

Comment 17

ID: 516761 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 04 Jan 2022 16:18 Selected Answer: B Upvotes: 4

"this workload can run in approximately 30 minutes on a 15-node cluster,"
so you need performance for only 30 mins -> preemptible VMs

https://cloud.google.com/dataproc/docs/concepts/compute/preemptible-vms

54. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 280

Sequence: 175
Discussion ID: 130265
Source URL: https://www.examtopics.com/discussions/google/view/130265-exam-professional-data-engineer-topic-1-question-280/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 5:32 a.m.

Question

You are running a streaming pipeline with Dataflow and are using hopping windows to group the data as the data arrives. You noticed that some data is arriving late but is not being marked as late data, which is resulting in inaccurate aggregations downstream. You need to find a solution that allows you to capture the late data in the appropriate window. What should you do?

A. Use watermarks to define the expected data arrival window. Allow late data as it arrives.
B. Change your windowing function to tumbling windows to avoid overlapping window periods.
C. Change your windowing function to session windows to define your windows based on certain activity.
D. Expand your hopping window so that the late data has more time to arrive within the grouping.

Community Answer Votes

A: 16 most voted

Comments 6 comments Click to expand

Comment 1

ID: 1117884 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 22:11 Selected Answer: A Upvotes: 7

- Watermarks: Watermarks in a streaming pipeline are used to specify the point in time when Dataflow expects all data up to that point to have arrived.
- Allow Late Data: configure the pipeline to accept and correctly process data that arrives after the watermark, ensuring it's captured in the appropriate window.

Comment 2

ID: 1325827 User: m_a_p_s Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Thu 12 Dec 2024 20:39 Selected Answer: A Upvotes: 1

A - https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#watermarks

Comment 3

ID: 1155359 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Wed 21 Aug 2024 08:11 Selected Answer: A Upvotes: 1

Option A

Comment 4

ID: 1121855 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 16:35 Selected Answer: A Upvotes: 3

Option A - https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#watermarks

Comment 5

ID: 1117607 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 15:37 Selected Answer: A Upvotes: 3

https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#watermarks

Comment 6

ID: 1113337 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 04:32 Selected Answer: A Upvotes: 1

A. Use watermarks to define the expected data arrival window. Allow late data as it arrives.

55. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 106

Sequence: 176
Discussion ID: 79777
Source URL: https://www.examtopics.com/discussions/google/view/79777-exam-professional-data-engineer-topic-1-question-106/
Posted By: AWSandeep
Posted At: Sept. 3, 2022, 2:06 p.m.

Question

You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters. What should you do?

A. Increase the cluster size with more non-preemptible workers.
B. Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
C. Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
D. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.

Community Answer Votes

D: 6 most voted
A: 1

Comments 11 comments Click to expand

Comment 1

ID: 762267 User: AzureDP900 Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 20:38 Selected Answer: - Upvotes: 6

D is right
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters#using_graceful_decommissioning

Comment 2

ID: 1090735 User: rocky48 Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Sun 08 Dec 2024 04:02 Selected Answer: D Upvotes: 1

D is right
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters#using_graceful_decommissioning

Comment 3

ID: 753112 User: Prakzz Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 22 Dec 2023 10:06 Selected Answer: A Upvotes: 1

Should be A. You can configure the preemptible worker to gracefull decommission, its for non preemptible worker nodes.

Comment 3.1

ID: 763869 User: wan2three Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 02 Jan 2024 18:10 Selected Answer: - Upvotes: 1

nope, they are not only for non-preeemtible workers

Comment 4

ID: 747003 User: yafsong Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 16 Dec 2023 10:30 Selected Answer: - Upvotes: 2

graceful decommissioning: to finish work in progress on a worker before it is removed from the Cloud Dataproc cluster.
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters

Comment 5

ID: 738169 User: odacir Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 07 Dec 2023 18:27 Selected Answer: D Upvotes: 1

All your workers need to be the same kind. Use Graceful Decommissioning for don't lose any data and add more(increase the cluster) preemptible workers because there are more cost-effective .

Comment 6

ID: 716804 User: skp57 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 12 Nov 2023 17:33 Selected Answer: - Upvotes: 2

A. "graceful decommissioning" is not a configuration value but a parameter passed with scale down action - to decrease the number of workers to save money (see Graceful Decommissioning as an option to use when downsizing a cluster to avoid losing work in progress)

Comment 7

ID: 669044 User: John_Pongthorn Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Thu 14 Sep 2023 15:16 Selected Answer: D Upvotes: 3

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters
Why scale a Dataproc cluster?
to increase the number of workers to make a job run faster
to decrease the number of workers to save money (see Graceful Decommissioning as an option to use when downsizing a cluster to avoid losing work in progress).
to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage

Comment 7.1

ID: 707719 User: hauhau Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 30 Oct 2023 10:05 Selected Answer: - Upvotes: 2

This weird.
The question mentions that increase cluster, but Graceful Decommissioning use in downscale the cluster

Comment 7.1.1

ID: 738167 User: odacir Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 07 Dec 2023 18:27 Selected Answer: - Upvotes: 1

All your workers need to be the same kind. Use Graceful Decommissioning for don't lose any data and add more preemptible workers because there are more cost-effective

Comment 8

ID: 658416 User: AWSandeep Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sun 03 Sep 2023 14:06 Selected Answer: D Upvotes: 1

D. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.

56. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 253

Sequence: 179
Discussion ID: 130204
Source URL: https://www.examtopics.com/discussions/google/view/130204-exam-professional-data-engineer-topic-1-question-253/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 4:27 p.m.

Question

You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuery. The security team has enabled an organizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses. What should you do?

A. Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.
B. Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only internal IPs.
C. Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow, Cloud Storage, and BigQuery as allowed services in the perimeter. Use Dataflow with only internal IP addresses.
D. Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.

Community Answer Votes

D: 11 most voted
C: 4

Comments 14 comments Click to expand

Comment 1

ID: 1114143 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Fri 05 Jul 2024 00:33 Selected Answer: D Upvotes: 5

- Private Google Access for services allows VM instances with only internal IP addresses in a VPC network or on-premises networks (via Cloud VPN or Cloud Interconnect) to reach Google APIs and services.
- When you launch a Dataflow job, you can specify that it should use worker instances without external IP addresses if Private Google Access is enabled on the subnetwork where these instances are launched.
- This way, your Dataflow workers will be able to access Cloud Storage and BigQuery without violating the organizational constraint of no external IPs.

Comment 1.1

ID: 1115542 User: Jordan18 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 01:32 Selected Answer: - Upvotes: 3

why not C?

Comment 1.1.1

ID: 1117069 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 08 Jul 2024 22:23 Selected Answer: - Upvotes: 5

Even if you create VPC service control, your dataflow worker will run on google compute engine instances with private ips only after policy enforcement.
Without external IP addresses, you can still perform administrative and monitoring tasks.
You can access your workers by using SSH through the options listed in the preceding list. However, the pipeline cannot access the internet, and internet hosts cannot access your Dataflow workers.

Comment 1.1.1.1

ID: 1117072 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 08 Jul 2024 22:24 Selected Answer: - Upvotes: 4

ref - https://cloud.google.com/dataflow/docs/guides/routes-firewall

Comment 1.1.2

ID: 1119298 User: BIGQUERY_ALT_ALT Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 11 Jul 2024 03:15 Selected Answer: - Upvotes: 3

VPC Service Controls are typically used to define and enforce security perimeters around APIs and services, restricting their access to a specified set of Google Cloud projects. In this scenario, the security constraint is focused on Compute Engine instances used by Dataflow, and VPC Service Controls might be considered a bit heavy-handed for just addressing the internal IP address requirement.

Comment 2

ID: 1226646 User: Lestrang Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Sun 08 Dec 2024 11:53 Selected Answer: D Upvotes: 1

No way it is C.
Like the use case for Google VPC Service Controls perimeter is not to establish secure connectivity on its own but rather to control connectivity, like allowing vms within x premise to access, and blocking vms outside premise even if in same VPC from access.

D on the other hand is completely sensical.

Comment 3

ID: 1163248 User: Moss2011 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 01 Sep 2024 06:20 Selected Answer: C Upvotes: 1

According to this documentation: https://cloud.google.com/vpc-service-controls/docs/overview I think the correct answer is C. Take into account the phrase "organizational constraint" and the VPC Service Control allow you to do that.

Comment 4

ID: 1162975 User: Tryolabs Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 29 Aug 2024 20:01 Selected Answer: D Upvotes: 1

https://cloud.google.com/vpc/docs/private-google-access

"VM instances that only have internal IP addresses (no external IP addresses) can use Private Google Access. They can reach the external IP addresses of Google APIs and services."

Comment 5

ID: 1134398 User: pandeyspecial Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 28 Jul 2024 20:10 Selected Answer: C Upvotes: 1

It should be C

Comment 6

ID: 1121713 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 14:13 Selected Answer: C Upvotes: 1

Option D, as GCP001 said

Comment 6.1

ID: 1121714 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 14:14 Selected Answer: - Upvotes: 2

Missclicked the answer <.<

Comment 7

ID: 1117065 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 08 Jul 2024 22:19 Selected Answer: D Upvotes: 4

https://cloud.google.com/dataflow/docs/guides/routes-firewall

Comment 8

ID: 1112891 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 15:27 Selected Answer: C Upvotes: 1

C. Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow, Cloud Storage, and BigQuery as allowed services in the perimeter. Use Dataflow with only internal IP addresses.

Comment 8.1

ID: 1119299 User: BIGQUERY_ALT_ALT Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 11 Jul 2024 03:15 Selected Answer: - Upvotes: 2

C is wrong. Option D is simple and straight forward. VPC Service Controls are typically used to define and enforce security perimeters around APIs and services, restricting their access to a specified set of Google Cloud projects. In this scenario, the security constraint is focused on Compute Engine instances used by Dataflow, and VPC Service Controls might be considered a bit heavy-handed for just addressing the internal IP address requirement.

57. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 86

Sequence: 180
Discussion ID: 17173
Source URL: https://www.examtopics.com/discussions/google/view/17173-exam-professional-data-engineer-topic-1-question-86/
Posted By: Rajokkiyam
Posted At: March 22, 2020, 4:20 a.m.

Question

You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?

A. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
B. Deploy a Kafka cluster on GCE VM Instances with the Pub/Sub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
C. Deploy the Pub/Sub Kafka connector to your on-prem Kafka cluster and configure Pub/Sub as a Source connector. Use a Dataflow job to read from Pub/Sub and write to GCS.
D. Deploy the Pub/Sub Kafka connector to your on-prem Kafka cluster and configure Pub/Sub as a Sink connector. Use a Dataflow job to read from Pub/Sub and write to GCS.

Community Answer Votes

A: 7 most voted

Comments 29 comments Click to expand

Comment 1

ID: 73525 User: Ganshank Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Tue 12 Oct 2021 06:08 Selected Answer: - Upvotes: 34

A.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330
The solution specifically mentions mirroring and minimizing the use of Kafka Connect plugin.
D would be the more Google Cloud-native way of implementing the same, but the requirement is better met by A.

Comment 2

ID: 504066 User: hendrixlives Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Sun 18 Jun 2023 07:14 Selected Answer: A Upvotes: 6

"A" is the answer which complies with the requirements (specifically, "The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins"). Indeed, one of the uses of what is called "Geo-Replication" (or Cross-Cluster Data Mirroring) in Kafka is precisely cloud migrations: https://kafka.apache.org/documentation/#georeplication

However I agree with Ganshank, and the optimal "Google way" way would be "D", installing the Pub/Sub Kafka connector to move the data from on-prem to GCP.

Comment 3

ID: 916994 User: Qix Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Sat 07 Dec 2024 11:02 Selected Answer: - Upvotes: 4

Pub/Sub Kafka connector requires Kafka Connect, as described here https://cloud.google.com/pubsub/docs/connect_kafka
Deployment of Kafka Connect is explicitly excluded by the requirements. So the only option available is A

Comment 4

ID: 798270 User: samdhimal Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 04 Aug 2024 18:59 Selected Answer: - Upvotes: 3

Option A: Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.

This option involves setting up a separate Kafka cluster in Google Cloud, and then configuring the on-prem cluster to mirror the topics to this cluster. The data from the Google Cloud Kafka cluster can then be read using either a Dataproc cluster or a Dataflow job and written to Cloud Storage for analysis in BigQuery.

Comment 4.1

ID: 798271 User: samdhimal Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 04 Aug 2024 19:00 Selected Answer: - Upvotes: 1

Option B: Deploy a Kafka cluster on GCE VM Instances with the Pub/Sub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.

This option is similar to Option A, but involves using the Pub/Sub Kafka connector as a sink connector instead of mirroring the topics from the on-prem cluster. This option would result in the same duplication of data and additional resources required as Option A, making it less desirable.

Comment 4.1.1

ID: 799010 User: samdhimal Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 05 Aug 2024 17:04 Selected Answer: - Upvotes: 1

Sorry. I messed up. The answer is probably A. My badd....

Comment 4.2

ID: 798273 User: samdhimal Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 04 Aug 2024 19:01 Selected Answer: - Upvotes: 1

Option D: Deploy the Pub/Sub Kafka connector to your on-prem Kafka cluster and configure Pub/Sub as a Sink connector. Use a Dataflow job to read from Pub/Sub and write to GCS.

This option involves deploying the Pub/Sub Kafka connector on the on-prem cluster, but configuring it as a sink connector. In this case, the data from the on-prem Kafka cluster would be sent directly to Pub/Sub, which would act as the final destination for the data. A Dataflow job would then be used to read the data from Pub/Sub and write it to Cloud Storage for analysis in BigQuery. This option would result in the data being stored in both the on-prem cluster and Pub/Sub, making it less desirable compared to option C, where the data is only stored in Pub/Sub as an intermediary between the on-prem cluster and Google Cloud.

Comment 4.2.1

ID: 820500 User: musumusu Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 24 Aug 2024 13:22 Selected Answer: - Upvotes: 1

you use chatgpt replies, if you instruct chat gpt that you don't need to use plugins as per question say, it will answer A

Comment 4.3

ID: 798272 User: samdhimal Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 04 Aug 2024 19:00 Selected Answer: - Upvotes: 2

Option C: Deploy the Pub/Sub Kafka connector to your on-prem Kafka cluster and configure Pub/Sub as a Source connector. Use a Dataflow job to read from Pub/Sub and write to GCS.

This option involves deploying the Pub/Sub Kafka connector directly on the on-prem cluster, and configuring it as a source connector. The data from the on-prem Kafka cluster is then sent directly to Pub/Sub, which acts as an intermediary between the on-prem cluster and the data stored in Google Cloud. A Dataflow job is then used to read the data from Pub/Sub and write it to Cloud Storage for analysis in BigQuery. This option avoids the duplication of data and additional resources required by the other options, making it the preferred option.

Comment 5

ID: 700613 User: Afonya Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Sun 21 Apr 2024 09:29 Selected Answer: A Upvotes: 1

"The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins."

Comment 6

ID: 692865 User: somnathmaddi Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Fri 12 Apr 2024 10:39 Selected Answer: - Upvotes: 3

D is the right answer

Comment 7

ID: 676612 User: clouditis Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 23 Mar 2024 02:21 Selected Answer: - Upvotes: 2

D is the right answer

Comment 8

ID: 466264 User: gcp_k Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 22 Apr 2023 20:06 Selected Answer: - Upvotes: 3

Going with "D"

Refer: https://stackoverflow.com/questions/55277188/kafka-to-google-pub-sub-using-sink-connector

Comment 8.1

ID: 504354 User: baubaumiaomiao Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 18 Jun 2023 17:26 Selected Answer: - Upvotes: 1

"avoid deployment of Kafka Connect plugins"

Comment 9

ID: 395279 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 01 Jan 2023 01:58 Selected Answer: - Upvotes: 1

Vote for A

Comment 10

ID: 308257 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 19:08 Selected Answer: - Upvotes: 3

Answer: A
Description: Question says mirroring to avoid kafka connect plugins

Comment 11

ID: 287145 User: Allan222 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Tue 09 Aug 2022 20:52 Selected Answer: - Upvotes: 1

Correct is D

Comment 11.1

ID: 402268 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 09 Jan 2023 01:29 Selected Answer: - Upvotes: 1

As per question - "avoid deployment of Kafka Connect plugins."

Comment 12

ID: 183110 User: vakati Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Sun 20 Mar 2022 18:40 Selected Answer: - Upvotes: 3

A.
the best solution would be D but given the restriction here to use mirroring and avoid connectors, A would be the natural choice

Comment 13

ID: 175667 User: Tanmoyk Badges: - Relative Date: 4 years ago Absolute Date: Tue 08 Mar 2022 09:04 Selected Answer: - Upvotes: 4

D should be the correct answer. Configure pub/sub as sink

Comment 14

ID: 162521 User: haroldbenites Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 01:20 Selected Answer: - Upvotes: 2

C is correct.
https://docs.confluent.io/current/connect/kafka-connect-gcp-pubsub/index.html

Comment 14.1

ID: 162528 User: haroldbenites Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 01:39 Selected Answer: - Upvotes: 1

Correct Answer: D
Why is this correct?
You can connect Kafka to GCP by using a connector. The 'downstream' service (Pub/Sub) will use a sink connector.

Comment 14.1.1

ID: 395277 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 01 Jan 2023 01:55 Selected Answer: - Upvotes: 2

Question says : avoid deployment of Kafka Connect plugins.

Comment 15

ID: 151611 User: clouditis Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 06 Feb 2022 05:10 Selected Answer: - Upvotes: 3

its D, why would google prefer Kafka in their own cert questions! :)

Comment 15.1

ID: 440465 User: Ral17 Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 19:24 Selected Answer: - Upvotes: 3

Because the questions mentions to avoid deployment of Kafka connect plugins

Comment 16

ID: 147492 User: Archy Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 30 Jan 2022 18:04 Selected Answer: - Upvotes: 1

answer is D, as on-prime kafka support sink connector for outging data.

Comment 17

ID: 66759 User: Rajokkiyam Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Wed 22 Sep 2021 03:20 Selected Answer: - Upvotes: 5

Correct Answer : D.

Comment 17.1

ID: 504356 User: baubaumiaomiao Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 18 Jun 2023 17:27 Selected Answer: - Upvotes: 1

"avoid deployment of Kafka Connect plugins"

Comment 17.2

ID: 162527 User: haroldbenites Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 01:39 Selected Answer: - Upvotes: 2

Correct Answer: D
Why is this correct?
You can connect Kafka to GCP by using a connector. The 'downstream' service (Pub/Sub) will use a sink connector.

58. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 121

Sequence: 185
Discussion ID: 17240
Source URL: https://www.examtopics.com/discussions/google/view/17240-exam-professional-data-engineer-topic-1-question-121/
Posted By: -
Posted At: March 22, 2020, 11:22 a.m.

Question

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally.
Because large parts of globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your
Kafka cluster. This is becoming difficult to manage and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?

A. Edge TPUs as sensor devices for storing and transmitting the messages.
B. Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.
C. An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.
D. A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.

Community Answer Votes

C: 1 most voted

Comments 16 comments Click to expand

Comment 1

ID: 69728 User: Rajokkiyam Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Thu 30 Sep 2021 07:13 Selected Answer: - Upvotes: 8

Answer C - Cloud Native = Pub/Sub + DataFlow

Comment 2

ID: 131891 User: Rajuuu Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Tue 11 Jan 2022 07:32 Selected Answer: - Upvotes: 5

Answer is C. Pub/Sub is the messaging tool for Global.

Comment 3

ID: 909410 User: ga8our Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Fri 29 Nov 2024 15:56 Selected Answer: - Upvotes: 1

Can anyone pls explain what's wrong with D, the load balancing solution?

Comment 4

ID: 820795 User: musumusu Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 24 Aug 2024 17:48 Selected Answer: - Upvotes: 2

Answer C:
What is wrong with D, nothing, Cloud load balancing can shift traffic for high volume and low internet in one region. It cost avg. 0.01-0.25 $ per GB, or if volume is too high. 0.05 $ per Hour http request. This might be the answer if your exam for network engineer.

Comment 5

ID: 811704 User: musumusu Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 17 Aug 2024 09:45 Selected Answer: - Upvotes: 1

Answer C, but it will not solve bad internet connection, make sure 100mbps speed of internet is at sensor side.

Comment 6

ID: 708655 User: MisuLava Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 30 Apr 2024 18:29 Selected Answer: - Upvotes: 1

"single on-premises Kafka cluster in a data center in the us-east region"
is it on-prem or in a datacenter in us-east ?

Comment 7

ID: 633686 User: JamesKarianis Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 19 Jan 2024 20:16 Selected Answer: C Upvotes: 1

Answer is C

Comment 8

ID: 553021 User: Prasanna_kumar Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 16:43 Selected Answer: - Upvotes: 1

Answer is option C

Comment 9

ID: 406803 User: ivanhsiav Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 15 Jan 2023 08:40 Selected Answer: - Upvotes: 4

Answer c
kafka cluster in on-premise for streaming msgs
pub/sub for streaming msgs in cloud

Comment 10

ID: 397439 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 03 Jan 2023 11:59 Selected Answer: - Upvotes: 4

Vote for C

Comment 11

ID: 298639 User: Allan222 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 25 Aug 2022 00:39 Selected Answer: - Upvotes: 4

Should be C

Comment 12

ID: 293413 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 18 Aug 2022 13:57 Selected Answer: - Upvotes: 5

C is correct:
the main trick come from A, and response is that TPU only use when we have a deployed machine learning model that we don't have now.

Comment 13

ID: 291347 User: ArunSingh1028 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 15 Aug 2022 23:32 Selected Answer: - Upvotes: 4

Answer - C

Comment 14

ID: 216558 User: Alasmindas Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Tue 10 May 2022 11:35 Selected Answer: - Upvotes: 5

Easy Question : ANswer is Option C.
Alterative to Kafka in google cloud native service is Pub/Sub and Dataflow punched with Pub/Sub is the google recommended option

Comment 15

ID: 163998 User: atnafu2020 Badges: - Relative Date: 4 years ago Absolute Date: Wed 23 Feb 2022 03:10 Selected Answer: - Upvotes: 4

C
the issue is with a single Kafka cluster is the need to scale automatically with Dataflow

Comment 16

ID: 163113 User: haroldbenites Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 21:11 Selected Answer: - Upvotes: 4

C is correct

59. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 17

Sequence: 186
Discussion ID: 16730
Source URL: https://www.examtopics.com/discussions/google/view/16730-exam-professional-data-engineer-topic-1-question-17/
Posted By: -
Posted At: March 16, 2020, 11:26 a.m.

Question

Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

A. Create a Google Cloud Dataflow job to process the data.
B. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
C. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
E. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.

Community Answer Votes

D: 16 most voted

Comments 21 comments Click to expand

Comment 1

ID: 475020 User: MaxNRG Badges: Highly Voted Relative Date: 4 years, 4 months ago Absolute Date: Tue 09 Nov 2021 20:11 Selected Answer: - Upvotes: 13

D is correct because it uses managed services, and also allows for the data to persist on GCS beyond the life of the cluster.
A is not correct because the goal is to re-use their Hadoop jobs and MapReduce and/or Spark jobs cannot simply be moved to Dataflow.
B is not correct because the goal is to persist the data beyond the life of the ephemeral clusters, and if HDFS is used as the primary attached storage mechanism, it will also disappear at the end of the cluster’s life.
C is not correct because the goal is to use managed services as much as possible, and this is the opposite.
E is not correct because the goal is to use managed services as much as possible, and this is the opposite.

Comment 1.1

ID: 1318682 User: certs4pk Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 27 Nov 2024 14:27 Selected Answer: - Upvotes: 1

B is incorrect bcoz, it did not say 'off cluster' persistent HDFS discs???

Comment 2

ID: 214273 User: Radhika7983 Badges: Highly Voted Relative Date: 5 years, 4 months ago Absolute Date: Fri 06 Nov 2020 20:50 Selected Answer: - Upvotes: 6

The correct answer is D. Here is the explanation to why Data proc and why not Data flow.
When a company wants to move their existing Hadoop jobs on premise to cloud, we can simply move the jobs in cloud data prod and replace hdfs with gs:// which is google storage. This way you are keeping compute and storage separately. Hence the correct answer is D. However, if the company wants to complete create a new jobs and don’t want to use the existing Hadoop jobs running on premise, the option is to create new data flow jobs.

Comment 3

ID: 1008581 User: suku2 Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:28 Selected Answer: D Upvotes: 2

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
Dataproc clusters can be created to lift and shift existing Hadoop jobs
Data stored in Google Cloud Storage extends beyond the life of a Dataproc cluster.

Comment 4

ID: 1027037 User: imran79 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:28 Selected Answer: - Upvotes: 1

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

Here's why:

Cloud Dataproc allows you to run Apache Hadoop jobs with minimal management. It is a managed Hadoop service.

Using the Google Cloud Storage (GCS) connector, Dataproc can access data stored in GCS, which allows data persistence beyond the life of the cluster. This means that even if the cluster is deleted, the data in GCS remains intact. Moreover, using GCS is often cheaper and more durable than using HDFS on persistent disks.

Comment 4.1

ID: 1318683 User: certs4pk Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 27 Nov 2024 14:28 Selected Answer: - Upvotes: 1

what if option B said, 'off cluster' persistent HDFS disks?

Comment 5

ID: 1050496 User: rtcpost Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:28 Selected Answer: D Upvotes: 3

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

Google Cloud Dataproc is a managed Hadoop and Spark service that allows you to easily create and manage Hadoop clusters in the cloud. By using the Google Cloud Storage connector, you can persist data in Google Cloud Storage, which provides durable storage beyond the cluster's lifecycle. This approach ensures data is retained even if the cluster is terminated, and it allows you to reuse your existing Hadoop jobs.

Option B (Creating a Dataproc cluster that uses persistent disks for HDFS) is another valid choice. However, using Google Cloud Storage for data storage and processing is often more cost-effective and scalable, especially when migrating to the cloud.

Options A, C, and E do not take full advantage of Google Cloud's services and the benefits of cloud-native data storage and processing with Google Cloud Storage and Dataproc.

Comment 6

ID: 1238709 User: fahadminhas Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 28 Jun 2024 14:18 Selected Answer: - Upvotes: 1

Option D is incorrect, as it would not provide persistent HDFS storage within cluster itself. Rather B should be the correct answer.

Comment 7

ID: 1008345 User: kshehadyx Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 15 Sep 2023 11:52 Selected Answer: - Upvotes: 1

Correct D

Comment 8

ID: 835665 User: bha11111 Badges: - Relative Date: 3 years ago Absolute Date: Sat 11 Mar 2023 05:49 Selected Answer: D Upvotes: 2

Hadoop --> Dataproc Persistent storage after the processing --> GCS

Comment 9

ID: 772685 User: samdhimal Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 11 Jan 2023 16:39 Selected Answer: D Upvotes: 1

D Seems right. Cloud storage can be used to achieve data storage even after the life of cluster.

Comment 10

ID: 768261 User: korntewin Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 07 Jan 2023 06:43 Selected Answer: D Upvotes: 1

The answer is D! Dataproc have no need for use to manage the infra and cloudstorage also no need for us to manage too!

Comment 11

ID: 741892 User: Nirca Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 11 Dec 2022 17:38 Selected Answer: D Upvotes: 1

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

Comment 12

ID: 712031 User: assU2 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 05 Nov 2022 23:27 Selected Answer: - Upvotes: 1

Seems like it is D. https://cloud.google.com/dataproc/docs/concepts/dataproc-hdfs
Never saw they mentioned persistent disks, although they are not deleted with the clusters...

Comment 12.1

ID: 712036 User: assU2 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 05 Nov 2022 23:29 Selected Answer: - Upvotes: 1

although:
By default, when no local SSDs are provided, HDFS data and intermediate shuffle data is stored on VM boot disks, which are Persistent Disks.

Comment 12.1.1

ID: 712041 User: assU2 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 05 Nov 2022 23:35 Selected Answer: - Upvotes: 2

and it says that only VM Boot disks are deleted when the cluster is deleted.

Comment 13

ID: 696989 User: achafill Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Mon 17 Oct 2022 08:25 Selected Answer: D Upvotes: 1

Correct Answer : D

Comment 14

ID: 681457 User: nkunwar Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 28 Sep 2022 08:33 Selected Answer: D Upvotes: 1

Dataproc cluster set up will be ephemeral to run HDFS Jobs and can be killed after Job execution killing persistent storage with cluster

Comment 15

ID: 648698 User: crisimenjivar Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 19 Aug 2022 04:35 Selected Answer: - Upvotes: 1

Anwer: D

Comment 16

ID: 612259 User: Asheesh1909 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Mon 06 Jun 2022 10:33 Selected Answer: - Upvotes: 1

Isn't it A and D both dataflow for reusable jobs and gcs for data peraistance?

Comment 17

ID: 584107 User: kmaiti Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Mon 11 Apr 2022 10:04 Selected Answer: D Upvotes: 2

Two key points:
Managed hadoop cluster - dataproc
Persistent storage: GCS (dataproc uses gcs connector to connect to gcs)

60. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 278

Sequence: 187
Discussion ID: 130263
Source URL: https://www.examtopics.com/discussions/google/view/130263-exam-professional-data-engineer-topic-1-question-278/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 5:24 a.m.

Question

Your car factory is pushing machine measurements as messages into a Pub/Sub topic in your Google Cloud project. A Dataflow streaming job, that you wrote with the Apache Beam SDK, reads these messages, sends acknowledgment to Pub/Sub, applies some custom business logic in a DoFn instance, and writes the result to BigQuery. You want to ensure that if your business logic fails on a message, the message will be sent to a Pub/Sub topic that you want to monitor for alerting purposes. What should you do?

A. Enable retaining of acknowledged messages in your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the subscription/num_retained_acked_messages metric on this subscription.
B. Use an exception handling block in your Dataflow’s DoFn code to push the messages that failed to be transformed through a side output and to a new Pub/Sub topic. Use Cloud Monitoring to monitor the topic/num_unacked_messages_by_region metric on this new topic.
C. Enable dead lettering in your Pub/Sub pull subscription, and specify a new Pub/Sub topic as the dead letter topic. Use Cloud Monitoring to monitor the subscription/dead_letter_message_count metric on your pull subscription.
D. Create a snapshot of your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the snapshot/num_messages metric on this snapshot.

Community Answer Votes

B: 20 most voted

Comments 11 comments Click to expand

Comment 1

ID: 1117872 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 22:57 Selected Answer: B Upvotes: 15

- Exception Handling in DoFn: Implementing an exception handling block within DoFn in Dataflow to catch failures during processing is a direct way to manage errors.
- Side Output to New Topic: Using a side output to redirect failed messages to a new Pub/Sub topic is an effective way to isolate and manage these messages.
- Monitoring: Monitoring the num_unacked_messages_by_region on the new topic can alert you to the presence of failed messages.

Comment 2

ID: 1289100 User: chrissamharris Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 18:56 Selected Answer: - Upvotes: 2

Option C - dead letter topic is built in and requires no changes https://cloud.google.com/pubsub/docs/handling-failures

Enable dead lettering in your Pub/Sub pull subscription, and specify a new Pub/Sub topic as the dead letter topic. Use Cloud Monitoring to monitor the subscription/dead_letter_message_count metric on your pull subscription.

Comment 2.1

ID: 1317401 User: Positron75 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Mon 25 Nov 2024 10:33 Selected Answer: - Upvotes: 3

Dead lettering is used to handle messages that have not been acknowledged, but that's unrelated to the processing that Dataflow does, which takes place later in the chain. A message could still be acknowledged and fail processing for whatever reason, so it would not be sent to the dead letter topic.

Also, Google advises against using dead lettering with Dataflow anyway: https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#dead-letter-topics

Correct answer is B. The error handling has to be written into the Dataflow pipeline itself.

Comment 3

ID: 1283558 User: 7787de3 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sat 14 Sep 2024 10:59 Selected Answer: B Upvotes: 1

See here:
https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#unsupported-features
It's not recommended to use Pub/Sub dead-letter topics with Dataflow (...) Instead, implement the dead-letter pattern explicitly in the pipeline

Comment 4

ID: 1254092 User: Jeyaraj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 24 Jul 2024 06:03 Selected Answer: - Upvotes: 1

Option B.

Here's why:

Side Output for Failed Messages: Dataflow allows you to use side outputs to handle messages that fail processing. In your DoFn , you can catch exceptions and write the failed messages to a separate PCollection . This PCollection can then be written to a new Pub/Sub topic.
New Pub/Sub Topic for Monitoring: Creating a dedicated Pub/Sub topic for failed messages allows you to monitor it specifically for alerting purposes. This provides a clear view of any issues with your business logic.
topic/num_unacked_messages_by_region Metric: This Cloud Monitoring metric tracks the number of unacknowledged messages in a Pub/Sub topic. By monitoring this metric on your new topic, you can identify when messages are failing to be processed correctly.

Comment 5

ID: 1192057 User: joao_01 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 09 Apr 2024 08:36 Selected Answer: - Upvotes: 1

I would like to know why isn't anyone considering the option C.

Comment 5.1

ID: 1192062 User: joao_01 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 09 Apr 2024 08:45 Selected Answer: - Upvotes: 2

I think that C is not right anyways: In order to use dead_letter feature, the message CANNOT be acknowledge (somehow) by the subscriber. In this question it says that the messages are first acknowledge and then applied the business logic. So, if there are error in the business logic we cannot use the feature dead_letter, beacuse the message was already acknowledge. Thus, option B is the right one.

Comment 6

ID: 1174490 User: hanoverquay Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Fri 15 Mar 2024 21:48 Selected Answer: B Upvotes: 1

option B

Comment 7

ID: 1155336 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 08:30 Selected Answer: B Upvotes: 1

Option B

Comment 8

ID: 1121847 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 17:28 Selected Answer: B Upvotes: 1

Option B - Raaad explanation is complete

Comment 9

ID: 1113333 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 05:24 Selected Answer: B Upvotes: 1

B. Use an exception handling block in your Dataflow’s DoFn code to push the messages that failed to be transformed through a side output and to a new Pub/Sub topic. Use Cloud Monitoring to monitor the topic/num_unacked_messages_by_region metric on this new topic.

61. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 33

Sequence: 188
Discussion ID: 17054
Source URL: https://www.examtopics.com/discussions/google/view/17054-exam-professional-data-engineer-topic-1-question-33/
Posted By: -
Posted At: March 20, 2020, 3:42 p.m.

Question

Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud
Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully. What should you do next?

A. Check the dashboard application to see if it is not displaying correctly.
B. Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
C. Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.
D. Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.

Community Answer Votes

B: 35 most voted
C: 8
D: 4

Comments 32 comments Click to expand

Comment 1

ID: 68581 User: [Removed] Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Sat 27 Mar 2021 13:50 Selected Answer: - Upvotes: 36

Answer: C
Description: Stackdriver can be used to check the error like number of unack messages, publisher pushing messages faster

Comment 1.1

ID: 133299 User: tprashanth Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Tue 13 Jul 2021 00:46 Selected Answer: - Upvotes: 25

B.
Stack driver monitoring is for performance, not logging of missing data.

Comment 1.1.1

ID: 142010 User: mikey007 Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Fri 23 Jul 2021 14:37 Selected Answer: - Upvotes: 2

https://cloud.google.com/pubsub/docs/monitoring

Comment 1.1.1.1

ID: 616354 User: ritinhabb Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 14 Jun 2023 20:37 Selected Answer: - Upvotes: 1

Exactly!

Comment 1.2

ID: 218203 User: snamburi3 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 13 Nov 2021 01:48 Selected Answer: - Upvotes: 10

All messages are being published to Cloud Pub/Sub successfully. so Stackdriver might not help.

Comment 1.2.1

ID: 442804 User: kubosuke Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 06:03 Selected Answer: - Upvotes: 12

messages sent successfully to Topic, but not Subscription.
in this case, if Dataflow cannot handle messages correctly it might not return acknowledgments to the Pub/Sub, and these errors can be seen from Monitoring.
https://cloud.google.com/pubsub/docs/monitoring#monitoring_exp

Comment 1.2.1.1

ID: 532366 User: Tanzu Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Wed 25 Jan 2023 20:36 Selected Answer: - Upvotes: 1

to be more precise, first to publisher,
- then forwards to topic, and persistance for a while
- then forwards to subscribe,
- then to subscription..
- then acknowledgement happens

so in every steps, there is possibly for errors.

Comment 1.2.1.1.1

ID: 738082 User: jkhong Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 07 Dec 2023 17:01 Selected Answer: - Upvotes: 2

PubSub doesn't forward from subscriber to subscription. A topic sends it over to subscription first, then to subscriber

Comment 2

ID: 66282 User: [Removed] Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Sat 20 Mar 2021 15:42 Selected Answer: - Upvotes: 25

Should be B

Comment 3

ID: 1050562 User: rtcpost Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Tue 22 Oct 2024 14:35 Selected Answer: B Upvotes: 1

B. Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
* By running a fixed dataset through the Cloud Dataflow pipeline, you can determine if the problem lies within the data processing stage. This allows you to identify any issues with data transformation, filtering, or processing in your pipeline.
* Analyzing the output from this fixed dataset will help you isolate the problem and confirm whether it's related to data processing or the dashboard application.

Comment 3.1

ID: 1058697 User: ruben82 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Thu 31 Oct 2024 11:50 Selected Answer: - Upvotes: 1

You must know what kind of data causes errors. I think, the first step is to get erroneous data and then test with sample of it.

Comment 4

ID: 1027204 User: imran79 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 07 Oct 2024 10:34 Selected Answer: - Upvotes: 1

B. Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output. If this results in the expected output, then the problem might be with the dashboard application (Option A), and that should be checked next.

Comment 5

ID: 919136 User: WillemHendr Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sun 09 Jun 2024 11:27 Selected Answer: B Upvotes: 3

"...to find the missing messages"
Up to that remark, Monitoring was a valid option as well. But missing messages cannot be found with monitoring.
It is simply not possible to find the exact missing message. I read this remark as a test if you know what is, and what isn't possible with monitoring.

Comment 6

ID: 867236 User: izekc Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 13:16 Selected Answer: B Upvotes: 2

here is to determine next step. Not better way to optimize the workload. So B is the correct next step

Comment 6.1

ID: 887718 User: Jarek7 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 02 May 2024 19:06 Selected Answer: - Upvotes: 1

B is not the next step. The next step is between pub/sub ands dataflow(C). B will not help with it at all. However it could show the issue if it is the pipeline or the view. But also it could not show it - you have no idea why some messages are not shown, so most probably it wouldnt get you any info. Definitely next step is to chek if the issue is between pubsub and dataflow. Then you coud go with B.

Comment 7

ID: 866662 User: Adswerve Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 00:37 Selected Answer: D Upvotes: 1

Pull subscription is the correct one. Push subscription means Dataflow cannot keep up with the topic.

Comment 7.1

ID: 887722 User: Jarek7 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 02 May 2024 19:11 Selected Answer: - Upvotes: 1

It could be the issue. But C would reveal it if this is the real issue - if you will not check stackdriver, you cannot be sure if you really resolved the issue, as even if it seems to be working properly after switch to pull you cannot be sure if it is because of some other temporal factor.

Comment 8

ID: 818701 User: midgoo Badges: - Relative Date: 2 years ago Absolute Date: Fri 23 Feb 2024 03:44 Selected Answer: B Upvotes: 4

If the Dataflow does not have the expected output, it is either wrong at the input or at the pipelines. The chance that the issue is at the input (PubSub) is very low. For this case, it is likely the pipelines got some mistakes (e.g. JSON parsing failed). So we should follow B to debug the pipelines (using snapshot as test dataset for example)

Comment 9

ID: 796981 User: ploer Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 03 Feb 2024 12:21 Selected Answer: B Upvotes: 7

The most efficient solution would be to run a fixed dataset through the Cloud Dataflow pipeline and analyze the output (Option B). This will allow you to determine if the issue is with the pipeline or with the dashboard application. By analyzing the output, you can see if the messages are being processed correctly and determine if there are any discrepancies or missing messages. If the issue is with the pipeline, you can then debug and make any necessary updates to ensure that all messages are processed correctly. If the issue is with the dashboard application, you can then focus on resolving that issue. This approach allows you to isolate and identify the root cause of the missing messages in a controlled and efficient manner.

Comment 10

ID: 791706 User: Lestrang Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 29 Jan 2024 15:21 Selected Answer: B Upvotes: 7

I've just skimmed over the Stackdriver docs, yes guys, it helps you check the number and age of messages that were not received/acknowledged, excellent, hurray.

So first off, c will not give us the missing messages, it will give us the count and age.
that means that c is inherently incorrect.

Additionally, will knowledge of the number of messages make resolving the problem any easier? No, it is just confirming what we already know.

Meanwhile, approach B, will allow us to see HOW and WHY it is missing some messages, which is the step that proceeds the fix.

Comment 11

ID: 788617 User: PolyMoe Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 11:38 Selected Answer: C Upvotes: 2

Here is ChatGPT answer :
It's always a good practice to start by checking the logs and monitoring tools to see if there is any indication of an issue with the messages being published to Cloud Pub/Sub. In this case, you should use Google Stackdriver Monitoring to investigate if the missing messages have been published or not. You can also run a fixed dataset through the Cloud Dataflow pipeline to see if the pipeline is processing the messages correctly. If there is no issue found on the Cloud Pub/Sub and Cloud Dataflow, then you can check the dashboard application to see if it is not displaying the messages correctly. As a last resort, you can switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.

Comment 11.1

ID: 791710 User: Lestrang Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 29 Jan 2024 15:24 Selected Answer: - Upvotes: 4

I provided it with the question as input but added the metrics available in Stackdriver, here is the response:

B. Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.

If messages are being published successfully to Cloud Pub/Sub but are missing in the dashboard, the issue is likely to be with the Cloud Dataflow pipeline that processes the messages. To find the root cause of the problem, you should run a fixed dataset through the pipeline and analyze the output. This will allow you to see if the pipeline is correctly processing all messages, and identify any processing errors that might be causing messages to be lost. The output can be compared to the expected results to identify any discrepancies and resolve the issue.

Comment 11.2

ID: 887737 User: Jarek7 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 02 May 2024 19:19 Selected Answer: - Upvotes: 2

I'm tired with these responses about what chatGPT says. Most probably you've used the free 3.5 version which is absolute disaster regarding being all knowing oracle. BTW in this case I wouldn't believe even GPT4. It is a difficult question that needs a specific knowledge and experience which might be not available in the GPT training data. You cannot use any GPT up to 4 as an argument in such cases.

Comment 11.2.1

ID: 1076471 User: axantroff Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 21 Nov 2024 17:55 Selected Answer: - Upvotes: 1

Exactly. Sometimes it is total garbage

Comment 12

ID: 780245 User: desertlotus1211 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 18 Jan 2024 18:18 Selected Answer: - Upvotes: 4

The question is really not asking for a solution to the problem, per se - but more of what would the next step in the situation to triage the issue....

Answer would be B over D. Answer D would be the recommended solution IF the question asked to rectify/fixed the issue.

Thoughts?

Comment 12.1

ID: 867237 User: izekc Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 13:17 Selected Answer: - Upvotes: 1

Agree with u

Comment 13

ID: 750380 User: Prakzz Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 20 Dec 2023 02:24 Selected Answer: D Upvotes: 3

D. Dataflow must PULL the data to process it in real-time. Missing messages in the dashboard, means that the Pub/Sub to Dataflow was misconfigured as PUSH.

Comment 13.1

ID: 782411 User: hasoweh Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 20 Jan 2024 16:27 Selected Answer: - Upvotes: 1

Pull will lead to latency as new data will not be streamed upon arrival, but instead will only be passed on when Dataflow makes a pull request. So if data comes in at time 0:01 but pull requests are only happening every 10 seconds, we have 9 second delay. Push will automatically push the data to any subscribers as soon as the data comes, and thus is closer to real-time.

Comment 14

ID: 750297 User: Krish6488 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 19 Dec 2023 23:39 Selected Answer: B Upvotes: 3

To me, B sounds more logical for the below reason.
Option C would have been ideal because any debugging starts with checking the logs, however the option says, check stackdriver for missing messages. Had it been, check stackdriver to figure out the number of undelivered messages, C would have been more suitable. Given the slight bit of dodginess in option c, I would go with B

Comment 15

ID: 744170 User: Nirca Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 13 Dec 2023 16:03 Selected Answer: B Upvotes: 1

Why checking Pub/Sub again when this is already verified to be fine according to the question. Shouldn't you be checking the next stage in the flow which is Dataflow?
Option - B

Comment 16

ID: 743505 User: DGames Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Wed 13 Dec 2023 03:18 Selected Answer: B Upvotes: 1

Answer - B. Because already we know message is missing so better to test with fixed dataset and check code .

Comment 17

ID: 730800 User: Atnafu Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Wed 29 Nov 2023 20:53 Selected Answer: - Upvotes: 1

C
https://cloud.google.com/pubsub/docs/monitoring#:~:text=the%20specific%20metrics.-,Monitor%20message%20backlog,information%20about%20this%20metric%2C%20see%20the%20relevant%20section%20of%20this%20document.,-Create%20alerting%20policies

62. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 25

Sequence: 194
Discussion ID: 16286
Source URL: https://www.examtopics.com/discussions/google/view/16286-exam-professional-data-engineer-topic-1-question-25/
Posted By: jvg637
Posted At: March 11, 2020, 7:16 p.m.

Question

You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?

A. Make a call to the Stackdriver API to list all logs, and apply an advanced filter.
B. In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.
C. In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.
D. Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

Community Answer Votes

D: 25 most voted

Comments 17 comments Click to expand

Comment 1

ID: 62604 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 6 months ago Absolute Date: Fri 11 Sep 2020 18:16 Selected Answer: - Upvotes: 49

I would choose D.
A and B are wrong since don't notify anything to the monitoring tool.
C has no filter on what will be notified. We want only some tables.

Comment 2

ID: 475602 User: MaxNRG Badges: Highly Voted Relative Date: 3 years, 10 months ago Absolute Date: Tue 10 May 2022 15:21 Selected Answer: - Upvotes: 15

D as the key requirement is to have notification on a particular table. It can be achieved using advanced log filter to filter only the table logs and create a project sink to Cloud Pub/Sub for notification.
Refer GCP documentation - Advanced Logs Filters: https://cloud.google.com/logging/docs/view/advanced-queries
A is wrong as advanced filter will help in filtering. However, there is no notification sends.
B is wrong as it would send all the logs and BigQuery does not provide notifications.
C is wrong as it would send all the logs.

Comment 3

ID: 1212403 User: suwalsageen12 Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Sat 16 Nov 2024 14:49 Selected Answer: - Upvotes: 2

D is the correct answer because:
- we need to advance filtering to filter the logs for the specific table
- we need to use monitoring tool for notification.

Comment 4

ID: 1076348 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 14:09 Selected Answer: D Upvotes: 2

Good point by MaxNRG about reducing the number of logs sending to Pub/Sub

Comment 5

ID: 1058647 User: ruben82 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 30 Apr 2024 09:42 Selected Answer: - Upvotes: 1

Theorically Pub/Sub could filters log to forward the right ones to the correct topic. https://cloud.google.com/pubsub/docs/subscription-message-filter
So C could be accepted, but It's better if filtering is performed earlier, so in this case D is more performing

Comment 6

ID: 1050531 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 14:04 Selected Answer: D Upvotes: 2

D. Using the Stackdriver API, create a project sink with an advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

This approach allows you to set up a custom log sink with an advanced filter that targets the specific table and then export the log entries to Google Cloud Pub/Sub. Your monitoring tool can subscribe to the Pub/Sub topic, providing you with instant notifications when relevant events occur without being inundated with notifications from other tables.

Options A and B do not offer the same level of customization and specificity in targeting notifications for a particular table.

Option C is almost correct but doesn't mention the use of an advanced log filter in the sink configuration, which is typically needed to filter the logs to a specific table effectively. Using the Stackdriver API for more advanced configuration is often necessary for fine-grained control over log filtering.

Comment 7

ID: 1008760 User: suku2 Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Sat 16 Mar 2024 03:00 Selected Answer: D Upvotes: 1

D makes sense.

Comment 8

ID: 975414 User: GCP_PDE_AG Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 08 Feb 2024 12:50 Selected Answer: - Upvotes: 1

D should be the answer

Comment 9

ID: 961252 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 24 Jan 2024 10:07 Selected Answer: D Upvotes: 1

A and B mention nothing about notifications and C would push all data. It's D.

Comment 10

ID: 835673 User: bha11111 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 11 Sep 2023 05:09 Selected Answer: D Upvotes: 1

D makes sense

Comment 11

ID: 760106 User: Jackalski Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Wed 28 Jun 2023 16:47 Selected Answer: D Upvotes: 2

"advanced log filter" is the key word here, all other options push all data ...

Comment 12

ID: 715495 User: Jasar Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 10 May 2023 19:13 Selected Answer: D Upvotes: 1

D is the best choice

Comment 13

ID: 588587 User: alecuba16 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 20 Oct 2022 13:24 Selected Answer: D Upvotes: 4

Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

Comment 14

ID: 580917 User: devric Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 04 Oct 2022 22:46 Selected Answer: D Upvotes: 2

D. Option B doesn't make sense

Comment 15

ID: 530672 User: samdhimal Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sat 23 Jul 2022 16:24 Selected Answer: - Upvotes: 3

correct answer -> Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

Option C is also most likely right answer but it doesn't have the filter. We don't want all the tables. We only want one. So the correct answer is D.

Logging sink - Using a Logging sink, you can direct specific log entries to your business logic. In this example, you can use Cloud Audit logs for Compute Engine which use the resource type gce_firewall_rule to filter for the logs of interest. You can also add an event type GCE_OPERATION_DONE to the filter to capture only the completed log events. Here is the Logging filter used to identify the logs. You can try out the query in the Logs Viewer.

Pub/Sub topic – In Pub/Sub, you can create a topic to which to direct the log sink and use the Pub/Sub message to trigger a cloud function.

Reference: https://cloud.google.com/blog/products/management-tools/automate-your-response-to-a-cloud-logging-event

Comment 16

ID: 521159 User: santoshindia Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sun 10 Jul 2022 21:55 Selected Answer: D Upvotes: 3

explained by MaxNRG

Comment 17

ID: 516525 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Mon 04 Jul 2022 11:34 Selected Answer: D Upvotes: 3

as explained by MaxNRG

63. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 35

Sequence: 195
Discussion ID: 16658
Source URL: https://www.examtopics.com/discussions/google/view/16658-exam-professional-data-engineer-topic-1-question-35/
Posted By: jvg637
Posted At: March 15, 2020, 1:18 p.m.

Question

Flowlogistic Case Study -

Company Overview -
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background -
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.

Solution Concept -
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment -
Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters
- SQL Server `" user data, inventory, static data
3 physical servers
- Cassandra `" metadata, tracking messages
10 Kafka servers `" tracking message aggregation and batch insert
✑ Application servers `" customer front end, middleware for order/customs
60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements -
✑ Build a reliable and reproducible environment with scaled panty of production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements -
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud environment

SEO Statement -
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.

CTO Statement -
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.

CFO Statement -
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
B. Cloud Pub/Sub, Cloud Dataflow, and Local SSD
C. Cloud Pub/Sub, Cloud SQL, and Cloud Storage
D. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Community Answer Votes

A: 19 most voted
C: 2

Comments 17 comments Click to expand

Comment 1

ID: 64253 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 15 Sep 2020 12:18 Selected Answer: - Upvotes: 38

I would say A.
I think Pub/Sub can't directly send data to Cloud SQL.

Comment 2

ID: 669324 User: Dhamsl Badges: Highly Voted Relative Date: 2 years, 12 months ago Absolute Date: Tue 14 Mar 2023 23:51 Selected Answer: - Upvotes: 9

This site make me feel that it intends to make users to be involved in discussion by suggesting wrong answer

Comment 3

ID: 1212586 User: billalltf Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Sun 17 Nov 2024 00:17 Selected Answer: A Upvotes: 1

A is right answer

Comment 4

ID: 1087729 User: JOKKUNO Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 04 Jun 2024 15:03 Selected Answer: - Upvotes: 1

Given the requirements for ingesting data from global sources, processing and querying in real-time, and storing the data reliably for the real-time inventory tracking system, the most suitable combination of Google Cloud Platform (GCP) products is:
A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
Explanation:
Cloud Pub/Sub: It is a messaging service that allows you to asynchronously send and receive messages between independent applications.
Cloud Dataflow: It can handle both streaming and batch data, making it suitable for real-time processing of data from various sources.
Cloud Storage: Cloud Storage can be used to store the processed and analyzed data reliably. It provides scalable, durable, and globally accessible object storage, making it suitable for storing large volumes of data.

Comment 5

ID: 1050791 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 17:21 Selected Answer: A Upvotes: 2

A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

Here's why this combination is suitable:

Cloud Pub/Sub: It is used for ingesting real-time data from various global sources. It's a messaging service that can handle large volumes of data and is highly scalable.

Cloud Dataflow: It's a stream and batch data processing service that allows you to process and analyze the data in real-time. It can take data from Pub/Sub and perform transformations or aggregations as needed.

Cloud Storage: It provides reliable storage for the data. You can store the processed data in Cloud Storage for further analysis, and it is a scalable and durable storage solution.

Option B is not ideal because Local SSDs are not a suitable storage option for persisting data that needs to be reliably stored. Option C includes Cloud SQL, which is not typically used for ingesting and processing real-time data. Option D includes Cloud Load Balancing, which is not relevant to the use case of ingesting and processing data for the inventory tracking system.

Comment 6

ID: 963715 User: Vipul1600 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 15:40 Selected Answer: - Upvotes: 1

Since Cloud SQL is fully managed service & Dataflow is serverless hence we should opt for dataflow as it is thumb rule for google that we should choose serverless product over fully managed service.

Comment 7

ID: 955329 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 18 Jan 2024 14:41 Selected Answer: A Upvotes: 2

The technical requirements mention that the pipeline should handle both streaming and batch data. The solution should include DataFlow and not Cloud SQL. the answer is A.

Comment 8

ID: 808309 User: niketd Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 14 Aug 2023 11:27 Selected Answer: A Upvotes: 1

Pub/Sub to scale streaming data, Dataflow to processes both structured and unstructured data and cloud storage to store common data

Comment 9

ID: 788628 User: PolyMoe Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Wed 26 Jul 2023 10:50 Selected Answer: A Upvotes: 1

Option B. Cloud Pub/Sub, Cloud Dataflow, and Local SSD is not a good option as Local SSD is not a scalable solution and could not handle large amount of data
Option C. Cloud Pub/Sub, Cloud SQL, and Cloud Storage is not a good option as Cloud SQL is a relational database and is not suitable for real-time processing and querying large amounts of data
Option D. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage is not a good option as Cloud Load Balancing is used for distributing traffic across multiple instances, it doesn't handle data processing and storage.

Comment 10

ID: 788625 User: PolyMoe Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Wed 26 Jul 2023 10:49 Selected Answer: A Upvotes: 1

A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage is the best combination of GCP products for the use case described.
Cloud Pub/Sub can be used to ingest data from a variety of global sources, as it allows for easy integration with external systems through its publish-subscribe messaging model.
Cloud Dataflow can be used to process and query the data in real-time, as it is a fully managed service for creating data pipelines that can handle both batch and streaming data.
Cloud Storage can be used to store the data reliably, as it is a fully managed object storage service that can handle large amounts of data and is highly durable and available.

Comment 11

ID: 778764 User: jkh_goh Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 17 Jul 2023 09:53 Selected Answer: A Upvotes: 1

Answer is A. Cloud Dataflow for batch + streaming, Cloud Pub/Sub for streaming ingestion, Cloud Storage for long term data storage.

Comment 12

ID: 722208 User: Jay_Krish Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 19 May 2023 19:53 Selected Answer: - Upvotes: 2

Are scenario based questions still in the latest exam?? Are these still relevant?

Comment 13

ID: 698441 User: kastuarr Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 18 Apr 2023 20:33 Selected Answer: C Upvotes: 2

Existing inventory data is in SQL, data ingested from Kafka will need to update inventory at some point. Existence of SQL in current estate indicates SQL must be present in the Cloud estate

Comment 14

ID: 645242 User: Megmang Badges: - Relative Date: 3 years ago Absolute Date: Sat 11 Feb 2023 06:07 Selected Answer: A Upvotes: 3

Answer is clearly option A.

Comment 15

ID: 627783 User: ratnesh99 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 06 Jan 2023 10:57 Selected Answer: - Upvotes: 1

Answer A : because Cloud Sql not suitable for Global

Comment 16

ID: 585156 User: CedricLP Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 13 Oct 2022 13:03 Selected Answer: A Upvotes: 2

Only A can manage a lot's of data.
Target is Cloud Storage (obviously not SSD)
Input is Pub/Sub to replace Kafka
Cloud SQL + Storage has no sense in this context

Comment 17

ID: 560219 User: Arkon88 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 16:37 Selected Answer: A Upvotes: 4

A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
as explained by JayZeeLee :
B is incorrect, because local SSD wouldn't satisfy the needs.
C is incorrect, because one of the requirements is 'Global', Cloud SQL is well suited for regional applications. Cloud Spanner is a better suit in that regard.
D is incorrect, because Load Balancer is for web traffic, not messages.

64. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 100

Sequence: 200
Discussion ID: 16982
Source URL: https://www.examtopics.com/discussions/google/view/16982-exam-professional-data-engineer-topic-1-question-100/
Posted By: jvg637
Posted At: March 19, 2020, 5:39 p.m.

Question

You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends. What should you do?

A. Use bq load to load a batch of sensor data every 60 seconds.
B. Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
C. Use the INSERT statement to insert a batch of data every 60 seconds.
D. Use the MERGE statement to apply updates in batch every 60 seconds.

Community Answer Votes

B: 26 most voted

Comments 20 comments Click to expand

Comment 1

ID: 66024 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Thu 19 Mar 2020 17:39 Selected Answer: - Upvotes: 29

I think we need a pipeline, so it's B to me.

Comment 2

ID: 513456 User: MaxNRG Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Thu 30 Dec 2021 15:47 Selected Answer: B Upvotes: 7

Is B, if we expect a growth we’ll need some buffer (that will be pub-sub) and the dataflow pipeline to stream data in big query.
The tabledata.insertAll method is not valid here.

Comment 3

ID: 1306807 User: Erg_de Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Mon 04 Nov 2024 07:40 Selected Answer: B Upvotes: 1

For real-time analysis and quick data availability, the appropriate option is the combination of the pipeline with BigQuery.

Comment 4

ID: 1085479 User: Helinia Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 01 Dec 2023 21:20 Selected Answer: - Upvotes: 1

“need the data to be available within 1 minute of ingestion for real-time analysis” → low latency requirement → Dataflow streaming

The database can either be BQ or BigTable for this kind of requirement in data volume and latency. But it mentioned that the destination has to be BQ, so B.

Comment 5

ID: 972970 User: NeoNitin Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 05 Aug 2023 13:53 Selected Answer: - Upvotes: 3

ANSWER b.
FULL question ihave if you nee mail me
neonitin6ATtherate......

Comment 5.1

ID: 999841 User: aryaavinash Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Tue 05 Sep 2023 21:27 Selected Answer: - Upvotes: 1

full email id please ?

Comment 6

ID: 880648 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 18:49 Selected Answer: B Upvotes: 3

I think we need a pipeline, so it's B to me.))

Comment 7

ID: 872645 User: votinhluombikip Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 17 Apr 2023 13:38 Selected Answer: B Upvotes: 2

I think we need a pipeline, so it's B to me.

Comment 8

ID: 830638 User: JANCAI Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 09:32 Selected Answer: - Upvotes: 1

Why the answer from the <reveal answer> is C??

Comment 9

ID: 717539 User: Prasha123 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 13 Nov 2022 22:52 Selected Answer: B Upvotes: 1

Need pipeline so its B

Comment 10

ID: 675317 User: sedado77 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 21 Sep 2022 18:24 Selected Answer: B Upvotes: 7

I got this question on sept 2022. Answer is B

Comment 11

ID: 518497 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 06 Jan 2022 20:08 Selected Answer: B Upvotes: 2

omg. B only

Comment 12

ID: 504512 User: hendrixlives Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sun 19 Dec 2021 00:23 Selected Answer: B Upvotes: 3

B, streaming with dataflow

Comment 13

ID: 487806 User: JG123 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Sat 27 Nov 2021 03:45 Selected Answer: - Upvotes: 1

Wrong answer shown again by examtopics.com
Ans: B

Comment 14

ID: 453452 User: Ysance_AGS Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Tue 28 Sep 2021 15:58 Selected Answer: - Upvotes: 2

B => with dataflow you can parallelize data ingestion

Comment 14.1

ID: 472292 User: szefco Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Wed 03 Nov 2021 22:20 Selected Answer: - Upvotes: 1

And make it streaming

Comment 15

ID: 421798 User: sandipk91 Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Sun 08 Aug 2021 21:26 Selected Answer: - Upvotes: 2

B is the right answer

Comment 16

ID: 396180 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 01 Jul 2021 18:21 Selected Answer: - Upvotes: 4

Vote for B

Comment 17

ID: 244833 User: felixwtf Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Tue 15 Dec 2020 19:21 Selected Answer: - Upvotes: 6

You need a pipeline because this type of operation can be easily parallelized, as the ingestion can be divided between into chunks (PCollections) and handled by many workers.

Comment 17.1

ID: 244834 User: felixwtf Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Tue 15 Dec 2020 19:21 Selected Answer: - Upvotes: 3

so, B is the right answer.

65. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 205

Sequence: 202
Discussion ID: 89458
Source URL: https://www.examtopics.com/discussions/google/view/89458-exam-professional-data-engineer-topic-1-question-205/
Posted By: Atnafu
Posted At: Nov. 30, 2022, 11:10 p.m.

Question

You have a data processing application that runs on Google Kubernetes Engine (GKE). Containers need to be launched with their latest available configurations from a container registry. Your GKE nodes need to have GPUs, local SSDs, and 8 Gbps bandwidth. You want to efficiently provision the data processing infrastructure and manage the deployment process. What should you do?

A. Use Compute Engine startup scripts to pull container images, and use gcloud commands to provision the infrastructure.
B. Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images.
C. Use GKE to autoscale containers, and use gcloud commands to provision the infrastructure.
D. Use Dataflow to provision the data pipeline, and use Cloud Scheduler to run the job.

Community Answer Votes

B: 16 most voted

Comments 14 comments Click to expand

Comment 1

ID: 1103585 User: MaxNRG Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 22 Dec 2023 18:43 Selected Answer: B Upvotes: 5

B is the best option to efficiently provision and manage the deployment process for this data processing application on GKE:

Comment 1.1

ID: 1103586 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 22 Dec 2023 18:43 Selected Answer: - Upvotes: 2

• Cloud Build allows you to automate the building, testing, and deployment of your application using Docker containers.
• Using Terraform with Cloud Build provides Infrastructure as Code capabilities to provision the GKE cluster with GPUs, SSDs, and network bandwidth.
• Terraform can be configured to pull the latest container images from the registry when deploying.
• Cloud Build triggers provide event-based automation to rebuild and redeploy when container images are updated.
• This provides an automated CI/CD pipeline to launch the application on GKE using the desired infrastructure and latest images.
• Dataflow and Cloud Scheduler don't directly provide infrastructure provisioning or deployment orchestration for GKE.
• gcloud commands can be used but don't provide the same automation benefits as Cloud Build + Terraform.

Comment 1.1.1

ID: 1103587 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 23 Dec 2023 05:17 Selected Answer: - Upvotes: 2

So using Cloud Build with Terraform templates provides the most efficient way to provision and deploy this data processing application on GKE.

Comment 2

ID: 1304169 User: SamuelTsch Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Mon 28 Oct 2024 21:22 Selected Answer: B Upvotes: 1

I would go to option B. That is from my point of view a CI/CD question. Only B covers the deployement and set up the latest container image.

Comment 3

ID: 1244971 User: anyone_99 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 17:18 Selected Answer: - Upvotes: 2

another wrong answer?

Comment 4

ID: 1112141 User: raaad Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 02 Jan 2024 19:36 Selected Answer: B Upvotes: 2

- Dataflow is a fully managed service for stream and batch data processing and is well-suited for real-time data processing tasks like identifying longtail and outlier data points.
- Using BigQuery as a sink allows to efficiently store the cleansed and processed data for further analysis and serving it to AI models.

Comment 5

ID: 1064254 User: spicebits Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 06 Nov 2023 22:16 Selected Answer: B Upvotes: 1

I don't really like B or C... but given the choices I would go with B.
B-Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images. {The Terraform command is Terraform Apply and not Terraform build, but also why not use gcloud container command instead of introducing 3rd party builder image?)... I don't like this choice but it is the best one.
C. Use GKE to autoscale containers, and use gcloud commands to provision the infrastructure. {This doesn't handle the building of the infra, or the deployment of the latest images, this one is clearly wrong, not sure why it is marked as the right choice}

Comment 6

ID: 960753 User: vamgcp Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 21:01 Selected Answer: B Upvotes: 2

B is correct

Comment 7

ID: 870119 User: whorillo Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 14 Apr 2023 12:24 Selected Answer: B Upvotes: 1

B is correct

Comment 8

ID: 806611 User: charline Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Feb 2023 18:35 Selected Answer: B Upvotes: 1

b is ok

Comment 9

ID: 763432 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 02:15 Selected Answer: - Upvotes: 1

B. Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images.

Comment 10

ID: 734909 User: hauhau Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 04 Dec 2022 08:28 Selected Answer: B Upvotes: 3

Maybe B
ref: https://cloud.google.com/architecture/managing-infrastructure-as-code

Comment 11

ID: 732040 User: Atnafu Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 30 Nov 2022 23:10 Selected Answer: - Upvotes: 1

C is correct answer

Comment 11.1

ID: 732324 User: Atnafu Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 01 Dec 2022 07:20 Selected Answer: - Upvotes: 3

Sorry I meant B

66. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 227

Sequence: 203
Discussion ID: 129874
Source URL: https://www.examtopics.com/discussions/google/view/129874-exam-professional-data-engineer-topic-1-question-227/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:53 a.m.

Question

You stream order data by using a Dataflow pipeline, and write the aggregated result to Memorystore. You provisioned a Memorystore for Redis instance with Basic Tier, 4 GB capacity, which is used by 40 clients for read-only access. You are expecting the number of read-only clients to increase significantly to a few hundred and you need to be able to support the demand. You want to ensure that read and write access availability is not impacted, and any changes you make can be deployed quickly. What should you do?

A. Create a new Memorystore for Redis instance with Standard Tier. Set capacity to 4 GB and read replica to No read replicas (high availability only). Delete the old instance.
B. Create a new Memorystore for Redis instance with Standard Tier. Set capacity to 5 GB and create multiple read replicas. Delete the old instance.
C. Create a new Memorystore for Memcached instance. Set a minimum of three nodes, and memory per node to 4 GB. Modify the Dataflow pipeline and all clients to use the Memcached instance. Delete the old instance.
D. Create multiple new Memorystore for Redis instances with Basic Tier (4 GB capacity). Modify the Dataflow pipeline and new clients to use all instances.

Community Answer Votes

B: 15 most voted

Comments 4 comments Click to expand

Comment 1

ID: 1113822 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 16:43 Selected Answer: B Upvotes: 10

- Upgrading to the Standard Tier and adding read replicas is an effective way to scale and manage increased read load.
- The additional capacity (5 GB) provides more space for data, and read replicas help distribute the read load across multiple instances.

Comment 1.1

ID: 1123608 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 20:44 Selected Answer: - Upvotes: 3

Descrived here:
https://cloud.google.com/memorystore/docs/redis/redis-tiers

Comment 2

ID: 1304660 User: SamuelTsch Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Tue 29 Oct 2024 20:08 Selected Answer: B Upvotes: 2

I don't like any answer. It seems Option B makes more senses due to read replicas.

Comment 3

ID: 1109555 User: e70ea9e Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:53 Selected Answer: B Upvotes: 3

Scalability for Read-Only Clients: Read replicas distribute read traffic across multiple instances, significantly enhancing read capacity to support a large number of clients without impacting write performance.
High Availability: Standard Tier ensures high availability with automatic failover, minimizing downtime in case of instance failure.
Minimal Code Changes: Redis clients can seamlessly connect to read replicas without requiring extensive code modifications, enabling a quick deployment.

67. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 237

Sequence: 206
Discussion ID: 130180
Source URL: https://www.examtopics.com/discussions/google/view/130180-exam-professional-data-engineer-topic-1-question-237/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 1:42 p.m.

Question

You are planning to load some of your existing on-premises data into BigQuery on Google Cloud. You want to either stream or batch-load data, depending on your use case. Additionally, you want to mask some sensitive data before loading into BigQuery. You need to do this in a programmatic way while keeping costs to a minimum. What should you do?

A. Use Cloud Data Fusion to design your pipeline, use the Cloud DLP plug-in to de-identify data within your pipeline, and then move the data into BigQuery.
B. Use the BigQuery Data Transfer Service to schedule your migration. After the data is populated in BigQuery, use the connection to the Cloud Data Loss Prevention (Cloud DLP) API to de-identify the necessary data.
C. Create your pipeline with Dataflow through the Apache Beam SDK for Python, customizing separate options within your code for streaming, batch processing, and Cloud DLP. Select BigQuery as your data sink.
D. Set up Datastream to replicate your on-premise data on BigQuery.

Community Answer Votes

C: 14 most voted
A: 1

Comments 8 comments Click to expand

Comment 1

ID: 1114009 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 20:13 Selected Answer: C Upvotes: 11

- Programmatic Flexibility: Apache Beam provides extensive control over pipeline design, allowing for customization of data transformations, including integration with Cloud DLP for sensitive data masking.
- Streaming and Batch Support: Beam seamlessly supports both streaming and batch data processing modes, enabling flexibility in data loading patterns.
- Cost-Effective Processing: Dataflow offers a serverless model, scaling resources as needed, and only charging for resources used, helping optimize costs.
- Integration with Cloud DLP: Beam integrates well with Cloud DLP for sensitive data masking, ensuring data privacy before loading into BigQuery.

Comment 1.1

ID: 1116645 User: qq589539483084gfrgrgfr Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Mon 08 Jan 2024 14:42 Selected Answer: - Upvotes: 2

In correct Option is A because you want a programatic way whereas datafusion is codeless solution and also dataflow is cost effective

Comment 1.1.1

ID: 1122235 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 14 Jan 2024 04:27 Selected Answer: - Upvotes: 2

You are saying Option C

Comment 2

ID: 1153407 User: JyoGCP Badges: Most Recent Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 17:04 Selected Answer: C Upvotes: 1

Option C

Comment 3

ID: 1128460 User: tibuenoc Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 22 Jan 2024 10:12 Selected Answer: C Upvotes: 2

C is correct. Using Dataflow as Python as programming and BQ as sink.

A is incorrect - DataFusion is Code-free as the main propose

Comment 4

ID: 1112749 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 13:42 Selected Answer: A Upvotes: 1

A.
Use Cloud Data Fusion to design your pipeline, use the Cloud DLP plug-in to de-identify data within your pipeline, and then move the data into BigQuery.

Comment 4.1

ID: 1304332 User: ggg24 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 29 Oct 2024 09:00 Selected Answer: - Upvotes: 1

Data Fusion support only Batch and Streaming is required

Comment 4.2

ID: 1212421 User: chrissamharris Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 16 May 2024 15:03 Selected Answer: - Upvotes: 1

Incorrect, that's a low-code solution. Doesnt meet this specific requirement: "You need to do this in a programmatic way"

68. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 163

Sequence: 210
Discussion ID: 79472
Source URL: https://www.examtopics.com/discussions/google/view/79472-exam-professional-data-engineer-topic-1-question-163/
Posted By: PhuocT
Posted At: Sept. 2, 2022, 6:03 p.m.

Question

You have data pipelines running on BigQuery, Dataflow, and Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products or features of the platform. What should you do?

A. Export the information to Cloud Monitoring, and set up an Alerting policy
B. Run a Virtual Machine in Compute Engine with Airflow, and export the information to Cloud Monitoring
C. Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
D. Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs

Community Answer Votes

A: 15 most voted

Comments 8 comments Click to expand

Comment 1

ID: 665954 User: John_Pongthorn Badges: Highly Voted Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 10:33 Selected Answer: A Upvotes: 5

A . Your preference is to use managed products or features of the platform

Comment 2

ID: 1303878 User: SamuelTsch Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Mon 28 Oct 2024 09:18 Selected Answer: A Upvotes: 1

A

Comment 3

ID: 1016262 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Mon 25 Sep 2023 02:58 Selected Answer: A Upvotes: 2

Cloud Monitoring (formerly known as Stackdriver) is a fully managed monitoring service provided by GCP, which can collect metrics, logs, and other telemetry data from various GCP services, including BigQuery, Dataflow, and Dataproc.

Alerting Policies: Cloud Monitoring allows you to define alerting policies based on specific conditions or thresholds, such as pipeline failures, latency spikes, or other custom metrics. When these conditions are met, Cloud Monitoring can trigger notifications (e.g., emails) to alert the team managing the pipelines.

Cross-Project Monitoring: Cloud Monitoring supports monitoring resources across multiple GCP projects, making it suitable for your requirement to monitor pipelines in multiple projects.

Managed Solution: Cloud Monitoring is a managed service, reducing the operational overhead compared to running your own virtual machine instances or building custom solutions.

Comment 4

ID: 1002818 User: sergiomujica Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 09 Sep 2023 03:43 Selected Answer: A Upvotes: 1

use managed products

Comment 5

ID: 827980 User: whorillo Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 14:03 Selected Answer: A Upvotes: 2

Should be A

Comment 6

ID: 662270 User: pluiedust Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Wed 07 Sep 2022 11:00 Selected Answer: A Upvotes: 1

Should be A

Comment 7

ID: 657595 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 18:21 Selected Answer: A Upvotes: 2

A. Export the information to Cloud Monitoring, and set up an Alerting policy

Comment 8

ID: 657583 User: PhuocT Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 18:03 Selected Answer: A Upvotes: 1

Should be A

69. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 84

Sequence: 213
Discussion ID: 16835
Source URL: https://www.examtopics.com/discussions/google/view/16835-exam-professional-data-engineer-topic-1-question-84/
Posted By: rickywck
Posted At: March 17, 2020, 8:31 a.m.

Question

After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You've loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.
What should you do?

A. Select random samples from the tables using the RAND() function and compare the samples.
B. Select random samples from the tables using the HASH() function and compare the samples.
C. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
D. Create stratified random samples using the OVER() function and compare equivalent samples from each table.

Community Answer Votes

C: 5 most voted
B: 1

Comments 24 comments Click to expand

Comment 1

ID: 65076 User: rickywck Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Tue 17 Mar 2020 08:31 Selected Answer: - Upvotes: 33

C is the only way which all records will be compared.

Comment 1.1

ID: 737828 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 13:10 Selected Answer: - Upvotes: 2

Agree with your argument

Comment 2

ID: 68743 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 28 Mar 2020 05:10 Selected Answer: - Upvotes: 16

Answer: C
Description: Full comparison with this option, rest are comparison on sample which doesnot ensure all the data will be ok

Comment 3

ID: 1302115 User: SamuelTsch Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Wed 23 Oct 2024 18:18 Selected Answer: C Upvotes: 1

Hash is always a good idea to compare the data

Comment 4

ID: 826548 User: midgoo Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 07:45 Selected Answer: - Upvotes: 3

In practice, I will do B. That means it may have error due to randomness. But that is how we normally do validation/QA in general, i.e. we test random samples

In this question, I will do C.

Comment 5

ID: 809398 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Wed 15 Feb 2023 12:02 Selected Answer: - Upvotes: 1

key words here- Hash or collect value on "EACH table", after sorting the table.
Option C

Comment 6

ID: 792191 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 30 Jan 2023 01:04 Selected Answer: - Upvotes: 1

C. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table. This approach will ensure that the data is read in a consistent order, and the hash function will provide a quick and efficient way to compare the contents of the tables and ensure that they are identical.

Comment 6.1

ID: 792201 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 30 Jan 2023 01:23 Selected Answer: - Upvotes: 2

A. Selecting random samples from the tables using the RAND() function may not provide an accurate representation of the data and there is a risk that the comparison will not identify any differences between the tables.

B. Selecting random samples from the tables using the HASH() function may not be an effective method for comparison, as the HASH() function may return different results for equivalent data.

D. Creating stratified random samples using the OVER() function may not provide a comprehensive comparison between the tables as there is a risk that important differences could be missed in the sample data.

Comment 7

ID: 727757 User: Leeeeee Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 26 Nov 2022 19:46 Selected Answer: C Upvotes: 1

All records

Comment 8

ID: 690925 User: hfuihe Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Mon 10 Oct 2022 11:22 Selected Answer: B Upvotes: 1

B is the only way which all records will be compared.

Comment 8.1

ID: 712576 User: cloudmon Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 06 Nov 2022 19:50 Selected Answer: - Upvotes: 2

You must have meant to say C

Comment 9

ID: 518467 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 06 Jan 2022 19:29 Selected Answer: C Upvotes: 1

HASH() to compare data skipping dates and timestamps

Comment 9.1

ID: 594489 User: stefanop Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Fri 29 Apr 2022 15:28 Selected Answer: - Upvotes: 1

The hash in answer C is used to select a sample of the table, not to compare them

Comment 9.1.1

ID: 594491 User: stefanop Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Fri 29 Apr 2022 15:31 Selected Answer: - Upvotes: 1

Ignore my comment, it was about answer B.
I suggest you to go with answer C which is the only solution comparing all the rows/tables

Comment 10

ID: 507645 User: MaxNRG Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 23 Dec 2021 08:02 Selected Answer: C Upvotes: 2

options A B and D only will determine that it “might” be identical since is only a sample. HASH() can be helpful when doing bulk comparisons, but you still have to compare field by field to get the final answer.
The only one left is C which looks good to me

Comment 11

ID: 474450 User: JayZeeLee Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Mon 08 Nov 2021 20:23 Selected Answer: - Upvotes: 1

C.
The rest use RAND() at some point, which makes it hard to compare for consistency, unless there's a 'seed' option, which wasn't mentioned. So C.

Comment 12

ID: 458758 User: u_t_s Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Thu 07 Oct 2021 15:20 Selected Answer: - Upvotes: 3

Since there is no PK and it is possible that set of values is commons in some records which result in same hashkey for those records. But still Anwer is C

Comment 13

ID: 395241 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Wed 30 Jun 2021 23:52 Selected Answer: - Upvotes: 1

Vote for 'C"

Comment 14

ID: 292191 User: daghayeghi Badges: - Relative Date: 5 years ago Absolute Date: Wed 17 Feb 2021 00:47 Selected Answer: - Upvotes: 3

B:
Because said migrated to BigQuery, then we don't need Dataproc, and samples don't mean you don't compare all of data.

Comment 14.1

ID: 447133 User: yoshik Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sat 18 Sep 2021 16:24 Selected Answer: - Upvotes: 1

a sample is a subset of data. then you should assure that the union of the samples contain the data set. Excessively complicated.
You migrate to BigQuery but need to check BigQuery output, that is why you should use another tool, Dataproc in this case.
Agree that then you should control Dataproc output but suppositions are becoming too many.

Comment 15

ID: 163290 User: atnafu2020 Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Sat 22 Aug 2020 04:12 Selected Answer: - Upvotes: 3

C
Using Cloud Storage with big data

Cloud Storage is a key part of storing and working with Big Data on Google Cloud. Examples include:

Loading data into BigQuery.

Using Dataproc, which automatically installs the HDFS-compatible Cloud Storage connector, enabling the use of Cloud Storage buckets in parallel with HDFS.

Using a bucket to hold staging files and temporary data for Dataflow pipelines.

For Dataflow, a Cloud Storage bucket is required. For BigQuery and Dataproc, using a Cloud Storage bucket is optional but recommended.

gsutil is a command-line tool that enables you to work with Cloud Storage buckets and objects easily and robustly, in particular in big data scenarios. For example, with gsutil you can copy many files in parallel with a single command, copy large files efficiently, calculate checksums on your data, and measure performance from your local computer to Cloud Storage.

Comment 16

ID: 162369 User: haroldbenites Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Thu 20 Aug 2020 18:51 Selected Answer: - Upvotes: 4

C is correct

Comment 16.1

ID: 162370 User: haroldbenites Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Thu 20 Aug 2020 18:53 Selected Answer: - Upvotes: 3

It Says: "...that they are identical." , You must not use sample.

Comment 17

ID: 126866 User: Rajuuu Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Sun 05 Jul 2020 15:20 Selected Answer: - Upvotes: 4

C is correct.

70. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 134

Sequence: 215
Discussion ID: 17233
Source URL: https://www.examtopics.com/discussions/google/view/17233-exam-professional-data-engineer-topic-1-question-134/
Posted By: -
Posted At: March 22, 2020, 10:43 a.m.

Question

You are building an application to share financial market data with consumers, who will receive data feeds. Data is collected from the markets in real time.
Consumers will receive the data in the following ways:
✑ Real-time event stream
✑ ANSI SQL access to real-time stream and historical data
✑ Batch historical exports
Which solution should you use?

A. Cloud Dataflow, Cloud SQL, Cloud Spanner
B. Cloud Pub/Sub, Cloud Storage, BigQuery
C. Cloud Dataproc, Cloud Dataflow, BigQuery
D. Cloud Pub/Sub, Cloud Dataproc, Cloud SQL

Community Answer Votes

B: 21 most voted
D: 1
A: 1

Comments 24 comments Click to expand

Comment 1

ID: 194737 User: itche_scratche Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Wed 07 Oct 2020 03:14 Selected Answer: - Upvotes: 12

D, not ideal but only option that work. You need pubsub, then a processing layer (dataflow or dataproc), then storage (some sql database).

Comment 1.1

ID: 251358 User: seiyassa Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Thu 24 Dec 2020 01:49 Selected Answer: - Upvotes: 3

I think pubsub doesn't have good connection to dataproc, so D is not the answer

Comment 1.1.1

ID: 745216 User: jkhong Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 14 Dec 2022 16:32 Selected Answer: - Upvotes: 2

As of Dec 2022,there is the PubSub Lite connector to Dataproc

Comment 1.2

ID: 745218 User: jkhong Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 14 Dec 2022 16:34 Selected Answer: - Upvotes: 1

We can have our pubsub topics to have BigQuery subscriptions, where data is automatically streamed into our BQ tables. Autoscaling is already handled automatically so this renders Dataflow and Dataproc pretty irrelevant for our usecase

Comment 1.2.1

ID: 921244 User: cetanx Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Mon 12 Jun 2023 10:19 Selected Answer: - Upvotes: 2

Here is the reference:
https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics

Comment 2

ID: 519522 User: medeis_jar Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Sat 08 Jan 2022 13:56 Selected Answer: B Upvotes: 10

✑ Real-time event stream -> Pub/Sub
✑ ANSI SQL access to real-time stream and historical data -> BigQuery
✑ Batch historical exports -> Cloud Storage

Comment 3

ID: 1302577 User: SamuelTsch Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Thu 24 Oct 2024 19:46 Selected Answer: D Upvotes: 1

why B? The main goal of the question is data storage. Thus BigQuery is not neccessary for this situation. Option D from my point of view cover the whole requirements. Pub/Sub for streaming data, dataproc for data processing, SQL for storage.

Comment 4

ID: 1015426 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 04:10 Selected Answer: - Upvotes: 4

B. Cloud Pub/Sub, Cloud Storage, BigQuery.

Here's how this solution aligns with your requirements:
Real-time Event Stream: Cloud Pub/Sub is a managed messaging service that can handle real-time event streams efficiently. You can use Pub/Sub to ingest and publish real-time market data to consumers.
ANSI SQL Access: BigQuery supports ANSI SQL queries, making it suitable for both real-time and historical data analysis. You can stream data into BigQuery tables from Pub/Sub and provide ANSI SQL access to consumers.
Batch Historical Exports: Cloud Storage can be used for batch historical exports. You can export data from BigQuery to Cloud Storage in batch, making it available for consumers to download.

Comment 5

ID: 918053 User: vaga1 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 08 Jun 2023 16:16 Selected Answer: B Upvotes: 2

I was in doubt as I did not know that BQ handles real-time access to data without dataflow underneath.

https://cloud.google.com/bigquery/docs/write-api#:~:text=You%20can%20use%20the%20Storage,in%20a%20single%20atomic%20operation.

Comment 6

ID: 837715 User: midgoo Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Mon 13 Mar 2023 08:25 Selected Answer: B Upvotes: 1

Event Stream -> PubSub
PubSub has direct Write to BigQuery
Historical Exports to GCS

Comment 7

ID: 762710 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 17:48 Selected Answer: - Upvotes: 3

B. Cloud Pub/Sub, Cloud Storage, BigQuery

Comment 7.1

ID: 762713 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 17:49 Selected Answer: - Upvotes: 1

https://cloud.google.com/solutions/stream-analytics/

Comment 8

ID: 676172 User: John_Pongthorn Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 22 Sep 2022 15:19 Selected Answer: B Upvotes: 3

B: https://cloud.google.com/solutions/stream-analytics/
Real-time made real easy
Adopt simple ingestion for complex events
Ingest and analyze hundreds of millions of events per second from applications or devices virtually anywhere on the globe with Pub/Sub. Or directly stream millions of events per second into your data warehouse for SQL-based analysis with BigQuery's streaming API.

Comment 9

ID: 667704 User: John_Pongthorn Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Tue 13 Sep 2022 08:09 Selected Answer: B Upvotes: 1

No matter what the last of it must end up with bigquery and the first service is pubsub I think intimidate service it should be dataflow

Comment 10

ID: 578180 User: Motivated_Gamer Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 30 Mar 2022 12:20 Selected Answer: A Upvotes: 1

Dataflow: Streaming data
CLoud SQL: for ansi sql support
Spanner: for batch historical data export

Comment 10.1

ID: 584809 User: tavva_prudhvi Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 12 Apr 2022 17:15 Selected Answer: - Upvotes: 2

You gonna use batch historical export for Spanner? It's B!

Comment 11

ID: 553073 User: Prasanna_kumar Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 19:05 Selected Answer: - Upvotes: 1

Answer is B

Comment 12

ID: 520320 User: MaxNRG Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sun 09 Jan 2022 17:20 Selected Answer: B Upvotes: 4

Cloud Pub/Sub, Cloud Dataflow, BigQuery
https://cloud.google.com/solutions/stream-analytics/

Comment 12.1

ID: 1099517 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Mon 18 Dec 2023 09:07 Selected Answer: - Upvotes: 1

B. Cloud Pub/Sub, Cloud Storage, BigQuery
The key requirements here are:
1. Real-time event stream (Pub/Sub)
2. ANSI SQL access to real-time and historical data (BigQuery)
3. Batch historical exports (Cloud Storage)
So Cloud Pub/Sub provides the real-time stream, BigQuery provides ANSI SQL access to stream and historical data, and Cloud Storage enables batch historical exports.
Option A is incorrect because Cloud Spanner does not offer batch exports and Dataflow is overkill for just SQL access.
Option C is incorrect as Dataproc is for spark workloads, not serving consumer data.
Option D is incorrect as Cloud SQL does not provide batch export capabilities.
Therefore, option B with Pub/Sub, Storage, and BigQuery is the best solution given the stated requirements. Dataflow https://cloud.google.com/solutions/stream-analytics/

Comment 13

ID: 487117 User: JG123 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Fri 26 Nov 2021 07:41 Selected Answer: - Upvotes: 1

Correct: B

Comment 14

ID: 486164 User: AdrianMonter26 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Wed 24 Nov 2021 19:16 Selected Answer: - Upvotes: 3

I think it must be D because you need Pub/Sub for streaming data, Dataflow or DataProc to get the data from Pub/Sub and store it in a database and finally the Cloud SQL database to store the data.
A and C cannot be because it is missing something for streaming data
B It can't be because you need something to pass the data from Pub/Sub to Cloud storage

Comment 15

ID: 397825 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Sat 03 Jul 2021 21:45 Selected Answer: - Upvotes: 3

Vote for B

Comment 16

ID: 397824 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Sat 03 Jul 2021 21:44 Selected Answer: - Upvotes: 1

Vote for B

Comment 17

ID: 191969 User: rgpalop Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 05:54 Selected Answer: - Upvotes: 2

I think B

71. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 37

Sequence: 216
Discussion ID: 17059
Source URL: https://www.examtopics.com/discussions/google/view/17059-exam-professional-data-engineer-topic-1-question-37/
Posted By: -
Posted At: March 20, 2020, 4:42 p.m.

Question

Flowlogistic Case Study -

Company Overview -
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background -
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.

Solution Concept -
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment -
Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters
- SQL Server `" user data, inventory, static data
3 physical servers
- Cassandra `" metadata, tracking messages
10 Kafka servers `" tracking message aggregation and batch insert
✑ Application servers `" customer front end, middleware for order/customs
60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements -
✑ Build a reliable and reproducible environment with scaled panty of production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements -
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud environment

SEO Statement -
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.

CTO Statement -
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.

CFO Statement -
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single
Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in
Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?

A. Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.
B. Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.
C. Use the NOW () function in BigQuery to record the event's time.
D. Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Community Answer Votes

B: 14 most voted
D: 4

Comments 23 comments Click to expand

Comment 1

ID: 335227 User: Manue Badges: Highly Voted Relative Date: 3 years, 11 months ago Absolute Date: Thu 14 Apr 2022 08:18 Selected Answer: - Upvotes: 37

"However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume."

Sure man, Kafka is not performing, let's use PubSub instead hahaha...

Comment 1.1

ID: 393797 User: ralf_cc Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 29 Jun 2022 15:07 Selected Answer: - Upvotes: 8

lol this is a vendor exam...

Comment 1.2

ID: 730040 User: sfsdeniso Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Wed 29 Nov 2023 08:03 Selected Answer: - Upvotes: 2

google send via pub sub web indexes
twice a day a whole internet is being sent via pub sub

Comment 2

ID: 403340 User: humza Badges: Highly Voted Relative Date: 3 years, 8 months ago Absolute Date: Sun 10 Jul 2022 14:41 Selected Answer: - Upvotes: 7

Answer: B
A. There is no indication that the application can do this. Moreover, due to networking problems, it is possible that Pub/Sub doesn't receive messages in order. This will analysis difficult.
B. This makes sure that you have access to publishing timestamp which provides you with the correct ordering of messages.
C. If timestamps are already messed up, BigQuery will get wrong results anyways.
D. The timestamp we are interested in is when the data was produced by the publisher, not when it was received by Pub/Sub.

Comment 3

ID: 1050794 User: rtcpost Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Tue 22 Oct 2024 17:24 Selected Answer: B Upvotes: 6

B. Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Cloud Pub/Sub.

Here's why this approach is the most suitable:

By attaching a timestamp and Package ID at the point of origin (publisher device), you ensure that each message has a clear and consistent timestamp associated with it from the moment it is generated. This provides a reliable and accurate record of when each package-tracking message was created, which is crucial for analyzing the data over time.

This approach allows you to maintain the chronological order of events as they occurred at the source, which is important for real-time reporting and historical analysis.

Option A suggests attaching the timestamp in the Cloud Pub/Sub subscriber application. While this can work, it introduces a potential delay and the risk of timestamps not being accurate if there are issues with message processing.

Option C, using the NOW() function in BigQuery, records the time when the data is ingested into BigQuery, which may not reflect the actual time of the event.

Comment 4

ID: 802189 User: JJJJim Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 08 Feb 2024 16:31 Selected Answer: B Upvotes: 1

Answer is B, attach the timestamp and ID is necessary to analyze data easily.

Comment 5

ID: 592114 User: nidmed Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 26 Apr 2023 07:41 Selected Answer: B Upvotes: 4

Answer: B

Comment 6

ID: 560228 User: Arkon88 Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 17:48 Selected Answer: B Upvotes: 1

we need package ID + Timestamp so B

Comment 7

ID: 530591 User: davidqianwen Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 15:34 Selected Answer: B Upvotes: 1

Answer: B

Comment 8

ID: 530197 User: exnaniantwort Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 02:51 Selected Answer: B Upvotes: 1

agree with humza

Comment 9

ID: 523214 User: sraakesh95 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 14 Jan 2023 01:28 Selected Answer: D Upvotes: 4

https://cloud.google.com/pubsub/docs/reference/rest/v1/PubsubMessage

Comment 10

ID: 461323 User: Chelseajcole Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 13 Oct 2022 03:25 Selected Answer: - Upvotes: 1

It is about processing time and event time.. Answer is B.

Comment 10.1

ID: 532803 User: Tanzu Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 26 Jan 2023 11:45 Selected Answer: - Upvotes: 1

not just timing, but also package-id .. cause they are sending 1 topic in gcp instead of to many in kafka. that means there must be added some additional critical data too.

Comment 11

ID: 461176 User: anji007 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 12 Oct 2022 18:26 Selected Answer: - Upvotes: 4

Ans: B
A: Adding timestamp as they received is not a better option, messages may not arrive in order at the receiver/ subscriber, could be due to connectivity or network.
B: Timestamp should be added here.
C: Doesn't make sense at all.
D: Ordering should be based on the order how messages are generated at the publisher but not as per order they reach the pub/sub.

Comment 12

ID: 401951 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 15:22 Selected Answer: - Upvotes: 2

Vote for B

Comment 13

ID: 294426 User: funtoosh Badges: - Relative Date: 4 years ago Absolute Date: Sat 19 Feb 2022 17:51 Selected Answer: - Upvotes: 3

Better if the publisher attached the package ID and Timestamp as packages can come in an Asynchronous fashion.

Comment 14

ID: 285626 User: naga Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 07 Feb 2022 17:36 Selected Answer: - Upvotes: 3

Correct B

Comment 15

ID: 221067 User: Radhika7983 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Wed 17 Nov 2021 13:56 Selected Answer: - Upvotes: 6

The answer is B.

JSON representation
{
"data": string,
"attributes": {
string: string,
...
},
"messageId": string,
"publishTime": string,
"orderingKey": string
}

In the attribute, we can have package id and timestamp.

Comment 16

ID: 220590 User: snamburi3 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Tue 16 Nov 2021 20:44 Selected Answer: - Upvotes: 1

D. "PublishTime- It must not be populated by the publisher in a topics.publish call." https://cloud.google.com/pubsub/docs/reference/rest/v1/PubsubMessage

Comment 16.1

ID: 532812 User: Tanzu Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 26 Jan 2023 11:58 Selected Answer: - Upvotes: 1

you can add timestamp as data , not as processTime.

Comment 16.2

ID: 220593 User: snamburi3 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Tue 16 Nov 2021 20:47 Selected Answer: - Upvotes: 1

We can order the messages: https://cloud.google.com/pubsub/docs/ordering

Comment 16.2.1

ID: 306718 User: daghayeghi Badges: - Relative Date: 4 years ago Absolute Date: Wed 09 Mar 2022 23:26 Selected Answer: - Upvotes: 1

but no one can guarantee that finally the message will receive in order, because during the transmission of message the order of them will change. then using an identifier and timestamp would be a must.

Comment 17

ID: 161137 User: haroldbenites Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 19 Aug 2021 00:58 Selected Answer: - Upvotes: 5

B Correct

72. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 88

Sequence: 219
Discussion ID: 79771
Source URL: https://www.examtopics.com/discussions/google/view/79771-exam-professional-data-engineer-topic-1-question-88/
Posted By: AWSandeep
Posted At: Sept. 3, 2022, 1:58 p.m.

Question

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).
What should you do?

A. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
B. Add a tryג€¦ catch block to your DoFn that transforms the data, extract erroneous rows from logs.
C. Add a tryג€¦ catch block to your DoFn that transforms the data, write erroneous rows to Pub/Sub PubSub directly from the DoFn.
D. Add a tryג€¦ catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to Pub/Sub later.

Community Answer Votes

D: 34 most voted
C: 5

Comments 17 comments Click to expand

Comment 1

ID: 826605 User: midgoo Badges: Highly Voted Relative Date: 2 years, 6 months ago Absolute Date: Sat 02 Sep 2023 07:28 Selected Answer: D Upvotes: 14

C is a big NO. Writing to PubSub in DoFn will cause bottleneck in the pipeline. For IO, we should always use those IO lib (e.g PubsubIO)
Using sideOutput is the correct answer here. There is a Qwiklab about this. It is recommended to do that lab to understand more.

Comment 2

ID: 809892 User: jonathanthezombieboy Badges: Highly Voted Relative Date: 2 years, 6 months ago Absolute Date: Tue 15 Aug 2023 19:40 Selected Answer: D Upvotes: 8

Based on the given scenario, option D would be the best approach to improve the reliability of the pipeline.

Adding a try-catch block to the DoFn that transforms the data would allow you to catch and handle errors within the pipeline. However, storing erroneous rows in Pub/Sub directly from the DoFn (Option C) could potentially create a bottleneck in the pipeline, as it adds additional I/O operations to the data processing.

Option A of filtering the erroneous data would not allow the pipeline to reprocess the failing data, which could result in data loss.

Option D of using a sideOutput to create a PCollection of erroneous data would allow for reprocessing of the failed data and would not create a bottleneck in the pipeline. Storing the erroneous data in a separate PCollection would also make it easier to debug and analyze the failed data.

Therefore, adding a try-catch block to the DoFn that transforms the data and using a sideOutput to create a PCollection of erroneous data that can be stored to Pub/Sub later would be the best approach to improve the reliability of the pipeline.

Comment 3

ID: 1191496 User: Farah_007 Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Tue 08 Oct 2024 12:05 Selected Answer: D Upvotes: 2

I think it's D because here you can write data from Dataflow PCollection to pub/sub. https://cloud.google.com/dataflow/docs/guides/write-to-pubsub

Comment 4

ID: 961839 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 24 Jan 2024 19:56 Selected Answer: C Upvotes: 1

Answer is C. Here is the github repo and an example from the Qwiklab where they tag the output as 'parsed_rows' and 'unparsed_rows' before they send the data to GCS. I don't see how GCS or PubSub would make a difference at this point. It seems like a more maintanable solution to just parse the data in the DoFn.

1) If the function does more than that then it serves multiple purposes and it's not good software engineering. Unless there is a good reason, writing to PubSub should be separated from the DoFn.

ii) It's faster to write in mini-batches or one batch than stream the errors. What's the need for streaming out errors 1 by 1? Literally no real advantage.

https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/quests/dataflow_python/7_Advanced_Streaming_Analytics/solution/streaming_minute_traffic_pipeline.py

Comment 5

ID: 847959 User: tibuenoc Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 23 Sep 2023 08:11 Selected Answer: D Upvotes: 2

Output errors to new PCollection – Send to collector for later analysis (Pub/Sub is a good target)

Comment 6

ID: 809431 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Tue 15 Aug 2023 11:44 Selected Answer: - Upvotes: 3

Option D is right approach to use to get errors as sideOutput. Apache beam has a special scripting docs not dynamic as python itself. So lets follow standard sideOutput(withoutputs in the code)
syntax be like in pipeline:
'ProcessData' >> beam.ParDo(DoFn).with_outputs

Comment 6.1

ID: 820520 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 24 Aug 2023 13:47 Selected Answer: - Upvotes: 1

After using you try: Catch: you can also send the erroneous records to dead letter sink into BQ
``` outputTuple.get(deadLetterTag).apply(BigQuery.write(...)) ```

Comment 7

ID: 801803 User: abwey Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 08 Aug 2023 08:47 Selected Answer: D Upvotes: 3

blahblahblahblahblahblahblahblah

Comment 8

ID: 791590 User: waiebdi Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 29 Jul 2023 11:38 Selected Answer: D Upvotes: 2

It`s D.
Use a try catch block to direct erroneous rows into a side output. The PCollection of the side output can be sent efficiently to the PubSub topic via Apache Beam PubSubIO.

It's not C because C means to sent every single invalid row in a separate request to PubSub which is very inefficient when working with Dataflow as now batching is involved.

Comment 9

ID: 732669 User: hauhau Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 01 Jun 2023 13:50 Selected Answer: - Upvotes: 1

C
D: dataflow to pub/sub is weird

Comment 10

ID: 726639 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 25 May 2023 10:15 Selected Answer: - Upvotes: 3

D
Side output is a great manner to branch the processing. Let's take the example of an input data source that contains both valid and invalid values. Valid values must be written in place #1 and the invalid ones in place#2. A naive solution suggests to use a filter and write 2 distinct processing pipelines. However this approach has one main drawback - the input dataset is read twice. If for the mentioned problem we use side outputs, we can still have 1 ParDo transform that internally dispatches valid and invalid values to appropriate places (#1 or #2, depending on value's validity).

https://www.waitingforcode.com/apache-beam/side-output-apache-beam/read#:~:text=simple%20test%20cases.-,Side%20output%20defined,-%C2%B6

Comment 11

ID: 725886 User: sfsdeniso Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 13:55 Selected Answer: - Upvotes: 1

Answer is D

Comment 12

ID: 712588 User: cloudmon Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 06 May 2023 19:01 Selected Answer: C Upvotes: 2

It's C.
In D, "storing to PubSub later" doesn't really make sense.

Comment 13

ID: 695068 User: devaid Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 15 Apr 2023 02:23 Selected Answer: C Upvotes: 2

Answer is C. You need to reprocess all the failling data, and yes, you can use PubSub as a sink, according to the documentation: https://beam.apache.org/documentation/io/connectors/

Comment 14

ID: 681639 User: nickyshil Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 28 Mar 2023 13:29 Selected Answer: - Upvotes: 4

Answer C

Comment 15

ID: 681637 User: nickyshil Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 28 Mar 2023 13:27 Selected Answer: - Upvotes: 6

The error records are directly written to PubSub from the DoFn (it’s equivalent in python).
You cannot directly write a PCollection to PubSub. You have to extract each record and write one at a time. Why do the additional work and why not write it using PubSubIO in the DoFn itself?
You can write the whole PCollection to Bigquery though, as explained in

Reference:
https://medium.com/google-cloud/dead-letter-queues-simple-implementation-strategy-for-cloud-pub-sub-80adf4a4a800

Comment 16

ID: 658404 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 14:58 Selected Answer: D Upvotes: 3

D. Add a try-catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to Pub/Sub later.

73. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 145

Sequence: 223
Discussion ID: 16674
Source URL: https://www.examtopics.com/discussions/google/view/16674-exam-professional-data-engineer-topic-1-question-145/
Posted By: madhu1171
Posted At: March 15, 2020, 5:07 p.m.

Question

You receive data files in CSV format monthly from a third party. You need to cleanse this data, but every third month the schema of the files changes. Your requirements for implementing these transformations include:
✑ Executing the transformations on a schedule
✑ Enabling non-developer analysts to modify transformations
✑ Providing a graphical tool for designing transformations
What should you do?

A. Use Dataprep by Trifacta to build and maintain the transformation recipes, and execute them on a scheduled basis
B. Load each month's CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query
C. Help the analysts write a Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data's schema changes
D. Use Apache Spark on Dataproc to infer the schema of the CSV file before creating a Dataframe. Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery

Community Answer Votes

A: 14 most voted

Comments 20 comments Click to expand

Comment 1

ID: 64364 User: madhu1171 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 15 Sep 2020 16:07 Selected Answer: - Upvotes: 35

A should be the answer

Comment 2

ID: 487229 User: JG123 Badges: Highly Voted Relative Date: 3 years, 9 months ago Absolute Date: Thu 26 May 2022 09:42 Selected Answer: - Upvotes: 7

Why there are so many wrong answers? Examtopics.com are you enjoying paid subscription by giving random answers from people?
Ans: A

Comment 2.1

ID: 638347 User: duytran_d Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 28 Jan 2023 01:19 Selected Answer: - Upvotes: 1

this comment is being repeated and i really appreciate this feeling :D

Comment 3

ID: 1189562 User: CGS22 Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Sat 05 Oct 2024 00:32 Selected Answer: A Upvotes: 2

A. Use Dataprep by Trifacta to build and maintain the transformation recipes, and execute them on a scheduled basis

Addresses Requirements:
Scheduled Execution: Dataprep supports running transformations on a schedule.
Analyst-Friendly: Dataprep's visual interface is designed for non-developer analysts to build and modify transformations easily.
Graphical Tool: It provides a drag-and-drop environment for designing data transformations.
Schema Flexibility: Dataprep can handle schema changes. Analysts can adapt recipes using the visual interface

Comment 4

ID: 1015453 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 06:11 Selected Answer: A Upvotes: 1

Scheduled Transformations: Dataprep by Trifacta allows you to design and schedule transformation recipes to process data on a regular basis. You can automate the data cleansing process by scheduling it to run monthly.

User-Friendly Interface: Dataprep provides a user-friendly graphical interface that enables non-developer analysts to design, modify, and maintain transformation recipes without writing code. This empowers analysts to work with the data effectively.

Transformation Flexibility: Dataprep supports flexible data transformations, making it suitable for scenarios where the schema of the incoming data changes. Analysts can adapt the transformations to new schemas using the visual tools provided by Dataprep.

Comment 5

ID: 893193 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 09 Nov 2023 17:50 Selected Answer: A Upvotes: 4

Providing a graphical tool for designing transformations is enough for A

Comment 6

ID: 822133 User: Dhruv28 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 26 Aug 2023 07:19 Selected Answer: - Upvotes: 1

Your company receives a lot of financial data in CSV files. The files need to be processed, cleaned and transformed before they are made available for analytics. The schema of the data also changes every third month. The Data analysts should be able to perform the tasks
1. No prior knowledge of any language with no coding
2. Provided a GUI tool to build and modify the schema
What solution best fits the need?

Comment 7

ID: 661094 User: arpitagrawal Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 13:31 Selected Answer: A Upvotes: 2

non-developer analysts

Comment 8

ID: 609536 User: devdimidved Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 30 Nov 2022 07:07 Selected Answer: A Upvotes: 1

Dataprep is for non developers

Comment 9

ID: 603613 User: amitsingla012 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 19 Nov 2022 08:35 Selected Answer: A Upvotes: 1

Option A -- Dataprep is the right answer

Comment 10

ID: 555856 User: Prasanna_kumar Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 25 Aug 2022 08:52 Selected Answer: - Upvotes: 1

Answer is A

Comment 11

ID: 520404 User: MaxNRG Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 09 Jul 2022 18:33 Selected Answer: A Upvotes: 2

A: https://cloud.google.com/dataprep/

Comment 12

ID: 519559 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 13:55 Selected Answer: A Upvotes: 1

Cloud Dataprep is a tool to do the job.

Comment 13

ID: 422082 User: sandipk91 Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Wed 09 Feb 2022 14:29 Selected Answer: - Upvotes: 4

vote for option A

Comment 14

ID: 398344 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 04 Jan 2022 15:58 Selected Answer: - Upvotes: 5

Vote for 'A', because of requirement - Enabling non-developer analysts to modify transformations

Comment 15

ID: 163563 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Mon 22 Feb 2021 14:38 Selected Answer: - Upvotes: 3

A is correct

Comment 16

ID: 131532 User: SSV Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Sun 10 Jan 2021 17:58 Selected Answer: - Upvotes: 1

Answer should be D. Dataprep will detect schema automatically in the initial recipe.After 3 months if the schema is changed the scheduled dataprep job cannot handle. So D should be the option.

Comment 16.1

ID: 148760 User: Archy Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Tue 02 Feb 2021 00:55 Selected Answer: - Upvotes: 5

spark is not graphical tool.

Comment 16.2

ID: 167885 User: atnafu2020 Badges: - Relative Date: 5 years ago Absolute Date: Sun 28 Feb 2021 01:56 Selected Answer: - Upvotes: 8

A
you can use dataprep for continuously changing target schema
In general, a target consists of the set of information required to define the expected data in a dataset. Often referred to as a "schema," this target schema information can include:

Names of columns
Order of columns
Column data types
Data type format
Example rows of data
A dataset associated with a target is expected to conform to the requirements of the schema. Where there are differences between target schema and dataset schema, a validation indicator (or schema tag) is displayed.
https://cloud.google.com/dataprep/docs/html/Overview-of-RapidTarget_136155049

Comment 17

ID: 70331 User: Rajokkiyam Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Fri 02 Oct 2020 05:17 Selected Answer: - Upvotes: 5

Answer A

74. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 62

Sequence: 225
Discussion ID: 17104
Source URL: https://www.examtopics.com/discussions/google/view/17104-exam-professional-data-engineer-topic-1-question-62/
Posted By: -
Posted At: March 21, 2020, 3:16 p.m.

Question

Your company receives both batch- and stream-based event data. You want to process the data using Google Cloud Dataflow over a predictable time period.
However, you realize that in some instances data can arrive late or out of order. How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?

A. Set a single global window to capture all the data.
B. Set sliding windows to capture all the lagged data.
C. Use watermarks and timestamps to capture the lagged data.
D. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.

Community Answer Votes

C: 14 most voted
B: 1
A: 1

Comments 22 comments Click to expand

Comment 1

ID: 784904 User: samdhimal Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 04:06 Selected Answer: - Upvotes: 9

C: Use watermarks and timestamps to capture the lagged data.

Watermarks are a way to indicate that some data may still be in transit and not yet processed. By setting a watermark, you can define a time period during which Dataflow will continue to accept late or out-of-order data and incorporate it into your processing. This allows you to maintain a predictable time period for processing while still allowing for some flexibility in the arrival of data.

Timestamps, on the other hand, are used to order events correctly, even if they arrive out of order. By assigning timestamps to each event, you can ensure that they are processed in the correct order, even if they don't arrive in that order.

Comment 1.1

ID: 784905 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 04:06 Selected Answer: - Upvotes: 4

Option A: Set a single global window to capture all the data is not a good idea because it may not allow for late or out-of-order data to be processed.

Option B: Set sliding windows to capture all the lagged data is not suitable for the case where you want to process the data over a predictable time period. Sliding windows are used when you want to process data over a period of time that is continuously moving forward, not a fixed period.

Option D: Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data is a good practice but not a complete solution, because it only ensures that data is ordered correctly, but it does not account for data that may be late.

Comment 2

ID: 67055 User: jvg637 Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Mon 22 Mar 2021 19:16 Selected Answer: - Upvotes: 8

We need a combination of window + watermark (timestamps) + trigger to treat the late data. So D.

Comment 3

ID: 1017066 User: MikkelRev Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 19:32 Selected Answer: C Upvotes: 1

option C: Use watermarks and timestamps to capture the lagged data.

Comment 4

ID: 1017065 User: MikkelRev Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 19:32 Selected Answer: - Upvotes: 1

option C: Use watermarks and timestamps to capture the lagged data.

Comment 5

ID: 778377 User: desertlotus1211 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 17 Jan 2024 00:35 Selected Answer: - Upvotes: 1

Answer is C:

There is no such thing as a sliding windows using by dataflow.

Comment 5.1

ID: 783122 User: DeeData Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 21 Jan 2024 09:30 Selected Answer: - Upvotes: 1

I highly doubt, DataFlow windowing is divided into three(3) types:

1. Fixed
2. Sliding
3. Session

Comment 5.2

ID: 959548 User: Mathew106 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 22 Jul 2024 15:20 Selected Answer: - Upvotes: 1

The naming in Apache Beam is: Fixed, Sliding, Session
In Dataflow it's: Tumbling, Hopping, Session.
I was very confused at first too when I saw "hopping" in a question.

Comment 6

ID: 766193 User: AzureDP900 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 02:58 Selected Answer: - Upvotes: 1

Answer is Use watermarks and timestamps to capture the lagged data.

A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If new data arrives with a timestamp that's in the window but older than the watermark, the data is considered late data.

Comment 7

ID: 745507 User: DGames Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 14 Dec 2023 23:36 Selected Answer: C Upvotes: 2

Watermark is use for late date,

Comment 8

ID: 609405 User: FrankT2L Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 20:23 Selected Answer: B Upvotes: 1

Preemptible workers are the default secondary worker type. They are reclaimed and removed from the cluster if they are required by Google Cloud for other tasks. Although the potential removal of preemptible workers can affect job stability, you may decide to use preemptible instances to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost

https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms

Comment 8.1

ID: 609409 User: FrankT2L Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 20:26 Selected Answer: - Upvotes: 8

delete this answer. The answer belongs to another question

Comment 9

ID: 548747 User: Tanzu Badges: - Relative Date: 3 years ago Absolute Date: Thu 16 Feb 2023 17:12 Selected Answer: A Upvotes: 1

That's why we have watermarks in apache beam.

Comment 10

ID: 545841 User: VishalBule Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Feb 2023 13:47 Selected Answer: - Upvotes: 1

Answer is C Use watermarks and timestamps to capture the lagged data.

A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If new data arrives with a timestamp that's in the window but older than the watermark, the data is considered late data.

Comment 11

ID: 516763 User: medeis_jar Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 16:22 Selected Answer: C Upvotes: 3

"Watermark in implementation is a monotonically increasing timestamp. When Beam/Dataflow see a record with an event timestamp that is earlier than the watermark, the record is treated as late data."

Comment 12

ID: 505617 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 20 Dec 2022 19:18 Selected Answer: C Upvotes: 4

A is a direct No, if data don’t have timestamp, we’ll only have the procesing time and not the “event time”.
B is not either, sliding windows are not for this. Hopping|sliding windowing is useful for taking running averages of data, but not to process late data.
D looks correct but has one concept missing, the watermark to know if the process time is ok with the event time or not. I’m not 100% sure is incorrect. If, since we have a “predictable time period”, might be this will do. I mean, if our dashboard is shown after the last input data has arrived (single global window), this should be ok. We’d have a “perfect watermark”. Anyway we’d need triggering .

Comment 12.1

ID: 505621 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 20 Dec 2022 19:19 Selected Answer: - Upvotes: 3

C is, I think, the correct answer: Watermark is different from late data. Watermark in implementation is a monotonically increasing timestamp. When Beam/Dataflow see a record with an event timestamp that is earlier than the watermark, the record is treated as late data.
I’ll try to explain: Late data is inherent to Beam’s model for out-of-order processing. What does it mean for data to be late? The definition and its properties are intertwined with watermarks that track the progress of each computation across the event time domain. The simple intuition behind handling lateness is this: only late input should result in late data anywhere in the pipeline.
So, is not easy to decide between C and D. If you ask me I’d say C since for D we ought to make some suppositions.

Comment 13

ID: 503894 User: Jlozano Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 17 Dec 2022 21:24 Selected Answer: C Upvotes: 4

"Expert Verified" but >50% questions have random answer. "Sliding window" really? Please, this can be fixed easyly with our most voted answer. Of course, the correct answer is C.

Comment 14

ID: 487538 User: JG123 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 26 Nov 2022 19:04 Selected Answer: - Upvotes: 4

Why there are so many wrong answers? Examtopics.com are you enjoying paid subscription by giving random answers from people?
Ans: C

Comment 15

ID: 463633 User: anji007 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Mon 17 Oct 2022 18:51 Selected Answer: - Upvotes: 2

Ans: C

Comment 16

ID: 424532 User: safiyu Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sat 13 Aug 2022 22:41 Selected Answer: - Upvotes: 7

Answer should be C. sliding windows are meant for calculating running average and not lagging data. Watermark is best for this purpose

Comment 17

ID: 393224 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 28 Jun 2022 22:38 Selected Answer: - Upvotes: 4

vote for 'C"

75. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 97

Sequence: 226
Discussion ID: 16109
Source URL: https://www.examtopics.com/discussions/google/view/16109-exam-professional-data-engineer-topic-1-question-97/
Posted By: cleroy
Posted At: March 10, 2020, 3:10 p.m.

Question

You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.
What should you do?

A. Use Cloud Dataflow with Beam to detect errors and perform transformations.
B. Use Cloud Dataprep with recipes to detect errors and perform transformations.
C. Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations.
D. Use federated tables in BigQuery with queries to detect errors and perform transformations.

Community Answer Votes

B: 17 most voted
A: 2

Comments 22 comments Click to expand

Comment 1

ID: 61727 User: cleroy Badges: Highly Voted Relative Date: 6 years ago Absolute Date: Tue 10 Mar 2020 15:10 Selected Answer: - Upvotes: 56

Use Dataprep ! It's THE tool for this

Comment 2

ID: 65103 User: rickywck Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Tue 17 Mar 2020 09:53 Selected Answer: - Upvotes: 53

Yes B.

Honest speaking, sometime I thought the answers being posted here were intentionally to mislead people whose do not have proper knowledge on the subject, but just memorizing answers to pass the exam.

Comment 3

ID: 1001086 User: sergiomujica Badges: Most Recent Relative Date: 2 years, 6 months ago Absolute Date: Thu 07 Sep 2023 03:13 Selected Answer: A Upvotes: 1

A is the right way to do it... dataprepo is clumsy

Comment 3.1

ID: 1042130 User: Wudihero2 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 13 Oct 2023 01:36 Selected Answer: - Upvotes: 7

...Did you even read through the question? It says "not require programming or knowledge of SQL". YOU are the one who's clumsy, not dataprep.

Comment 4

ID: 971437 User: crazycosmos Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 03 Aug 2023 22:06 Selected Answer: B Upvotes: 1

no programming -> B

Comment 5

ID: 967283 User: FP77 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 30 Jul 2023 18:26 Selected Answer: B Upvotes: 2

I honestly do not understand what is the deal with this website. The correct answer is obviously Dataprep. How can they say it's A?

Comment 6

ID: 775217 User: Besss Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 14 Jan 2023 10:25 Selected Answer: B Upvotes: 2

It'B. DataPrep it's the right tool.
https://cloud.google.com/dataprep

Comment 7

ID: 675314 User: sedado77 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 21 Sep 2022 18:23 Selected Answer: - Upvotes: 4

I got this question on sept 2022.

Comment 8

ID: 669495 User: John_Pongthorn Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 15 Sep 2022 07:28 Selected Answer: B Upvotes: 2

B. Actually there are two tools to fix this problem.
Dataprep rely on dataflw
Datafusion rely on dataproc

Comment 8.1

ID: 1288940 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 11:00 Selected Answer: - Upvotes: 2

Datafusion is lo-code but not no-code. The only no-code system is dataprep.

Comment 9

ID: 598981 User: diagniste Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Mon 09 May 2022 12:18 Selected Answer: A Upvotes: 1

A is the best answer !

Comment 9.1

ID: 779387 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Wed 18 Jan 2023 00:12 Selected Answer: - Upvotes: 1

dataflow IS Apache Beam...

Comment 10

ID: 555403 User: Venkat007 Badges: - Relative Date: 4 years ago Absolute Date: Thu 24 Feb 2022 17:55 Selected Answer: B Upvotes: 2

B Dataprep

Comment 11

ID: 518492 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 06 Jan 2022 20:03 Selected Answer: B Upvotes: 2

https://cloud.google.com/dataprep/

Comment 12

ID: 513448 User: MaxNRG Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 30 Dec 2021 15:42 Selected Answer: B Upvotes: 6

B, “Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning”
https://cloud.google.com/dataprep/

Comment 12.1

ID: 633971 User: dattatray_shinde Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Wed 20 Jul 2022 12:06 Selected Answer: - Upvotes: 2

max you rock man!

Comment 13

ID: 466354 User: GirijaSrinivasan Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 23 Oct 2021 00:40 Selected Answer: - Upvotes: 4

Answer is B. Data prep. The keyword here is no programming skills required.

Comment 14

ID: 446283 User: nguyenmoon Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Fri 17 Sep 2021 05:07 Selected Answer: - Upvotes: 2

B- Dataprep

Comment 15

ID: 420633 User: pass_gcp Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Fri 06 Aug 2021 08:37 Selected Answer: - Upvotes: 2

Use Dataprep ....is the answer

Comment 16

ID: 396129 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 01 Jul 2021 17:36 Selected Answer: - Upvotes: 2

Vote for B

Comment 16.1

ID: 396149 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 01 Jul 2021 17:58 Selected Answer: - Upvotes: 1

Cloud Dataprep - almost fully automated

Comment 17

ID: 265455 User: AnilKr Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Tue 12 Jan 2021 12:33 Selected Answer: - Upvotes: 1

Most of the answers are wrong, this is simple even.

76. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 124

Sequence: 229
Discussion ID: 81264
Source URL: https://www.examtopics.com/discussions/google/view/81264-exam-professional-data-engineer-topic-1-question-124/
Posted By: kenanars
Posted At: Sept. 8, 2022, 8:07 p.m.

Question

You are designing a cloud-native historical data processing system to meet the following conditions:
✑ The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Dataproc, BigQuery, and Compute
Engine.
✑ A batch pipeline moves daily data.
✑ Performance is not a factor in the solution.
✑ The solution design should maximize availability.
How should you design data storage for this solution?

A. Create a Dataproc cluster with high availability. Store the data in HDFS, and perform analysis as needed.
B. Store the data in BigQuery. Access the data using the BigQuery Connector on Dataproc and Compute Engine.
C. Store the data in a regional Cloud Storage bucket. Access the bucket directly using Dataproc, BigQuery, and Compute Engine.
D. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Dataproc, BigQuery, and Compute Engine.

Community Answer Votes

D: 13 most voted

Comments 6 comments Click to expand

Comment 1

ID: 697377 User: jkhong Badges: Highly Voted Relative Date: 3 years, 4 months ago Absolute Date: Mon 17 Oct 2022 13:57 Selected Answer: D Upvotes: 7

Problem: How to store data?
Considerations: High availability, performance not an issue

A → avoid HDFS
C → multi-regional > regional in terms of availability

B could be the answer but we’re dealing with PDF documents, we need blob storage (cloud storage). If we only have csv or Avro, this may be the answer

Comment 2

ID: 1287401 User: serch_engine Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Sat 21 Sep 2024 18:13 Selected Answer: D Upvotes: 1

D is the answer

Comment 3

ID: 762450 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 04:00 Selected Answer: - Upvotes: 1

D is right

Comment 4

ID: 761263 User: dconesoko Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 29 Dec 2022 17:36 Selected Answer: D Upvotes: 2

vote for D

Comment 5

ID: 696331 User: devaid Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 16 Oct 2022 17:24 Selected Answer: D Upvotes: 2

D of course

Comment 6

ID: 663892 User: kenanars Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 20:07 Selected Answer: D Upvotes: 1

D is the correct answer

77. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 42

Sequence: 231
Discussion ID: 16660
Source URL: https://www.examtopics.com/discussions/google/view/16660-exam-professional-data-engineer-topic-1-question-42/
Posted By: jvg637
Posted At: March 15, 2020, 1:43 p.m.

Question

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

A. Rewrite the job in Pig.
B. Rewrite the job in Apache Spark.
C. Increase the size of the Hadoop cluster.
D. Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Community Answer Votes

B: 12 most voted

Comments 23 comments Click to expand

Comment 1

ID: 64260 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 15 Sep 2020 12:43 Selected Answer: - Upvotes: 36

I would say B since Apache Spark is faster than Hadoop/Pig/MapReduce

Comment 1.1

ID: 1176580 User: Trocinek Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 18 Sep 2024 15:01 Selected Answer: - Upvotes: 1

But it requires much more memory causing it more expensive, which is not what we're aiming for here..

Comment 2

ID: 765322 User: ler_mp Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 06:48 Selected Answer: - Upvotes: 19

Wow, a question that does not recommend to use Google product

Comment 3

ID: 1076488 User: axantroff Badges: Most Recent Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 17:19 Selected Answer: B Upvotes: 1

Just a regular Spark. B

Comment 4

ID: 1073109 User: DataFrame Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 17 May 2024 08:30 Selected Answer: - Upvotes: 1

C. I think it should be C because intent of asking question is to realize the problem of on-prem auto-scaling not the optimization that we achieve using spark in-memory features. Its GCP exam they want to highlight if hadoop cluster commodity hard doesn't increase when data increases then it can create problem unlike GCP. Hence migrate to GCP.

Comment 5

ID: 948077 User: itsmynickname Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 15:12 Selected Answer: - Upvotes: 11

None. Being a GCP exam, it must be either Dataflow or BigQuery :D

Comment 6

ID: 880583 User: KHAN0007 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 25 Oct 2023 17:33 Selected Answer: - Upvotes: 6

I would like to take a moment to thank you all guys
You guys are awesome!!!

Comment 7

ID: 753570 User: Whoswho Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 22 Jun 2023 18:27 Selected Answer: - Upvotes: 8

looks like he's trying to spark the company up.

Comment 7.1

ID: 948075 User: itsmynickname Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 15:10 Selected Answer: - Upvotes: 2

It seems he's not well paid.

Comment 8

ID: 750310 User: Krish6488 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Mon 19 Jun 2023 23:11 Selected Answer: B Upvotes: 4

Both Pig & Spark requires rewriting the code so its an additional overhead, but as an architect I would think about a long lasting solution. Resizing Hadoop cluster can resolve the problem statement for the workloads at that point in time but not on longer run. So Spark is the right choice, although its a cost to start with, it will certainly be a long lasting solution

Comment 9

ID: 622693 User: Mamta072 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 26 Dec 2022 20:08 Selected Answer: - Upvotes: 2

Ans is B . Apache spark.

Comment 10

ID: 588703 User: alecuba16 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 20 Oct 2022 15:17 Selected Answer: B Upvotes: 4

SPARK > hadoop, pig, hive

Comment 11

ID: 544160 User: kped21 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Tue 09 Aug 2022 23:31 Selected Answer: - Upvotes: 1

B - Apache Spark

Comment 11.1

ID: 862826 User: luamail Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 06 Oct 2023 11:42 Selected Answer: - Upvotes: 2

https://www.ibm.com/cloud/blog/hadoop-vs-spark

Comment 12

ID: 535007 User: kped21 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Thu 28 Jul 2022 23:44 Selected Answer: - Upvotes: 1

B Spark for optimization and processing.

Comment 13

ID: 523239 User: sraakesh95 Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Thu 14 Jul 2022 01:13 Selected Answer: B Upvotes: 1

B: Spark is suitable for the given operation is much more powerful

Comment 14

ID: 516579 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Mon 04 Jul 2022 12:28 Selected Answer: B Upvotes: 1

as explained by pr2web

Comment 15

ID: 498821 User: pr2web Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Fri 10 Jun 2022 16:13 Selected Answer: B Upvotes: 1

Ans B:
Spark is a 100 times faster and utilizes memory, instead of Hadoop Mapreduce's two-stage paradigm.

Comment 16

ID: 478890 User: MaxNRG Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sun 15 May 2022 17:53 Selected Answer: - Upvotes: 1

B as Spark can improve the performance as it performs lazy in-memory execution.
Spark is important because it does part of its pipeline processing in memory rather than copying from disk. For some applications, this makes Spark extremely fast.

Comment 16.1

ID: 478891 User: MaxNRG Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sun 15 May 2022 17:54 Selected Answer: - Upvotes: 1

With a Spark pipeline, you have two different kinds of operations, transforms and actions. Spark builds its pipeline used an abstraction called a directed graph. Each transform builds additional nodes into the graph but spark doesn't execute the pipeline until it sees an action.
Spark waits until it has the whole story, all the information. This allows Spark to choose the best way to distribute the work and run the pipeline. The process of waiting on transforms and executing on actions is called, lazy execution. For a transformation, the input is an RDD and the output is an RDD. When Spark sees a transformation, it registers it in the directed graph and then it waits. An action triggers Spark to process the pipeline, the output is usually a result format, such as a text file, rather than an RDD.

Comment 16.1.1

ID: 478892 User: MaxNRG Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sun 15 May 2022 17:54 Selected Answer: - Upvotes: 3

Option A is wrong as Pig is wrapper and would initiate Map Reduce jobs
Option C is wrong as it would increase the cost.
Option D is wrong Hive is wrapper and would initiate Map Reduce jobs. Also, reducing the size would reduce performance.

Comment 16.1.1.1

ID: 698847 User: kastuarr Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 19 Apr 2023 10:58 Selected Answer: - Upvotes: 1

Wont Option B increase the cost ? Cost of re-writing the job in Spark + Cost of additional memory ?

Comment 17

ID: 462731 User: anji007 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 15 Apr 2022 19:41 Selected Answer: - Upvotes: 2

Ans: B
Spark performs better than MapReduce due to in memory processing.

78. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 110

Sequence: 233
Discussion ID: 17252
Source URL: https://www.examtopics.com/discussions/google/view/17252-exam-professional-data-engineer-topic-1-question-110/
Posted By: -
Posted At: March 22, 2020, 2:51 p.m.

Question

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?

A. Add a SideInput that returns a Boolean if the element is corrupt.
B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
C. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
D. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.

Community Answer Votes

B: 11 most voted

Comments 17 comments Click to expand

Comment 1

ID: 185006 User: SteelWarrior Badges: Highly Voted Relative Date: 3 years, 11 months ago Absolute Date: Wed 23 Mar 2022 07:51 Selected Answer: - Upvotes: 7

Should be B. The Partition transform would require the element identifying the valid/invalid records for partitioning the pcollection that means there is some logic to be executed before the Partition transformation is invoked. That logic can be implemented in a ParDO transform and which can both identify valid/invalid records and also generate two PCollections one with valid records and other with invalid records.

Comment 2

ID: 514695 User: MaxNRG Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Sat 01 Jul 2023 21:30 Selected Answer: B Upvotes: 6

B: ParDo is a Beam transform for generic parallel processing. ParDo is useful for common data processing operations, including:
a. Filtering a data set. You can use ParDo to consider each element in a PCollection and either output that element to a new collection, or discard it.
b. Formatting or type-converting each element in a data set.
c. Extracting parts of each element in a data set.
d. Performing computations on each element in a data set.
A does not help
C Partition is a Beam transform for PCollection objects that store the same data type. Partition splits a single PCollection into a fixed number of smaller collections. Again, does not help
D GroupByKey is a Beam transform for processing collections of key/value pairs. GroupByKey is a good way to aggregate data that has something in common

Comment 3

ID: 837490 User: midgoo Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Fri 13 Sep 2024 00:39 Selected Answer: B Upvotes: 3

A - SideInput is often used to validate data, however, we need to create the SideInput first. When using SideInput to filter data, it is actually another ParDo call.
C, D - This is common way to filter too, but we will need the key in order to partition or GroupByKey
B - ParDo is the most basic method, it can do anything to the PCollection

Comment 4

ID: 762278 User: AzureDP900 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 19:56 Selected Answer: - Upvotes: 1

B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.

Comment 5

ID: 518516 User: medeis_jar Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 06 Jul 2023 19:24 Selected Answer: B Upvotes: 2

Filtering with ParDo. ParDo is a Beam transform for generic parallel processing. ParDo is useful for common data processing operations/

Comment 5.1

ID: 762274 User: AzureDP900 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 19:52 Selected Answer: - Upvotes: 1

I agree with B

Comment 6

ID: 396887 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 15:56 Selected Answer: - Upvotes: 4

vote for 'B', ParDo can discard the elements.

https://beam.apache.org/documentation/programming-guide/

Comment 7

ID: 265350 User: DeepakKhattar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 12 Jul 2022 07:30 Selected Answer: - Upvotes: 3

B - seems to be better option since we need to filter out, question does not specify that we do need to store it into different Pcollection.
https://beam.apache.org/documentation/transforms/python/overview/
ParDo is general purpose whereas partition splits the elements into do different pcollections.
https://beam.apache.org/documentation/transforms/python/elementwise/partition/

Comment 8

ID: 222496 User: arghya13 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Thu 19 May 2022 06:35 Selected Answer: - Upvotes: 3

B is correct

Comment 9

ID: 162888 User: haroldbenites Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 14:01 Selected Answer: - Upvotes: 3

B is correct

Comment 10

ID: 148150 User: Archy Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 31 Jan 2022 18:52 Selected Answer: - Upvotes: 4

B, ParDo is useful for a variety of common data processing operations, including:

Filtering a data set. You can use ParDo to consider each element in a PCollection and either output that element to a new collection or discard it.

Comment 11

ID: 134269 User: tprashanth Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Thu 13 Jan 2022 22:43 Selected Answer: - Upvotes: 2

Looks like C it is
https://beam.apache.org/documentation/programming-guide/

Comment 11.1

ID: 163935 User: atnafu2020 Badges: - Relative Date: 4 years ago Absolute Date: Wed 23 Feb 2022 00:19 Selected Answer: - Upvotes: 5

according this link its
Pardo
* Filtering a data set. You can use ParDo to consider each element in a PCollection and either output that element to a new collection or discard it.
* But Partition just splitting which is is a Beam transform for PCollection objects that store the same data type. Partition splits a single PCollection into a fixed number of smaller collections.

Comment 11.1.1

ID: 241038 User: xrun Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Sat 11 Jun 2022 15:56 Selected Answer: - Upvotes: 1

Seems like two answers may be correct. With ParDo you can discard corrupt data. With Partition you can split the data into two PCollections: corrupt and ok. You stream ok data further to BigQuery and corrupt data to some other storage for analysis. If one is not interested in analysis, then ParDo is enough.

Comment 12

ID: 128971 User: dg63 Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Fri 07 Jan 2022 15:53 Selected Answer: - Upvotes: 5

Correct answer should be "C". A Pardo transform will allow the processing to happening in parallel using multiple workers. Partition transform will allow data to be partitions in two different Pcollections according to some logic. Using partition transform once can split the corrupted data and finally discard it.

Comment 13

ID: 127416 User: Rajuuu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 06 Jan 2022 07:22 Selected Answer: - Upvotes: 4

Correct B.

Comment 14

ID: 121831 User: norwayping Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 28 Dec 2021 14:52 Selected Answer: - Upvotes: 5

Correct - B

79. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 153

Sequence: 236
Discussion ID: 17218
Source URL: https://www.examtopics.com/discussions/google/view/17218-exam-professional-data-engineer-topic-1-question-153/
Posted By: -
Posted At: March 22, 2020, 8:12 a.m.

Question

You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?

A. Consume the stream of data in Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
B. Consume the stream of data in Dataflow using Kafka IO. Set a fixed time window of 1 hour. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
C. Use Kafka Connect to link your Kafka message queue to Pub/Sub. Use a Dataflow template to write your messages from Pub/Sub to Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Bigtable in the last hour. If that number falls below 4000, send an alert.
D. Use Kafka Connect to link your Kafka message queue to Pub/Sub. Use a Dataflow template to write your messages from Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.

Community Answer Votes

A: 12 most voted

Comments 19 comments Click to expand

Comment 1

ID: 218912 User: Alasmindas Badges: Highly Voted Relative Date: 4 years, 10 months ago Absolute Date: Fri 14 May 2021 03:58 Selected Answer: - Upvotes: 7

Option A is the correct answer. Reasons:-
a) Kafka IO and Dataflow is a valid option for interconnect (needless where Kafka is located - On Prem/Google Cloud/Other cloud)
b) Sliding Window will help to calculate average.

Option C and D are overkill and complex, considering the scenario in the question,
https://cloud.google.com/solutions/processing-messages-from-kafka-hosted-outside-gcp

Comment 2

ID: 70574 User: Rajokkiyam Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 02:33 Selected Answer: - Upvotes: 6

Should be A.

Comment 3

ID: 1165793 User: mothkuri Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Wed 04 Sep 2024 16:28 Selected Answer: A Upvotes: 2

Option A is correct answer.
Option B is not correct. There could be a chance middle of 1st window to middle of 2nd window less messages(i.e > 4000).
Option C & D out of scope.

Comment 4

ID: 1015838 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 16:41 Selected Answer: A Upvotes: 2

Dataflow with Sliding Time Windows: Dataflow allows you to work with event-time windows, making it suitable for time-series data like incoming IoT messages. Using sliding windows every 5 minutes allows you to compute moving averages efficiently.

Sliding Time Window: The sliding time window of 1 hour every 5 minutes enables you to calculate the moving average over the specified time frame.

Computing Averages: You can efficiently compute the average when each sliding window closes. This approach ensures that you have real-time visibility into the message rate and can detect deviations from the expected rate.

Alerting: When the calculated average drops below 4000 messages per second, you can trigger an alert from within the Dataflow pipeline, sending it to your desired alerting mechanism, such as Cloud Monitoring, Pub/Sub, or another notification service.

Scalability: Dataflow can scale automatically based on the incoming data volume, ensuring that you can handle the expected rate of 5000 messages per second.

Comment 5

ID: 963242 User: vamgcp Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 05:12 Selected Answer: A Upvotes: 2

Option A

Pros:

This option is relatively simple to implement.
It can be used to compute the moving average over any time window.
Cons:

This option can be computationally expensive, especially if the data stream is large.
It can be difficult to troubleshoot if the alert does not fire when it is supposed to.

Comment 6

ID: 893837 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 10 Nov 2023 13:54 Selected Answer: A Upvotes: 2

the correct answer is between A and B since it doesn't make sense to use Pub/Sub combined with Kafka. To have a Moving Average then we should go for A, updating the average estimation every 5 minutes using the new data that came in and eliminating the "most far" 5 minutes.

Comment 7

ID: 520128 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 09 Jul 2022 10:44 Selected Answer: A Upvotes: 2

as explained by Alasmindas

Comment 8

ID: 499389 User: AACHB Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Sat 11 Jun 2022 12:25 Selected Answer: A Upvotes: 2

Correct Answer: A

Comment 9

ID: 486425 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 25 May 2022 04:09 Selected Answer: - Upvotes: 1

Correct: A

Comment 10

ID: 455572 User: Chelseajcole Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 01 Apr 2022 15:20 Selected Answer: - Upvotes: 1

A is enough

Comment 11

ID: 294585 User: daghayeghi Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 19 Aug 2021 20:33 Selected Answer: - Upvotes: 2

A:
the correct answer is between A and B, But because used "Moving Average" then we should go for A.

Comment 12

ID: 256667 User: apnu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 01 Jul 2021 05:48 Selected Answer: - Upvotes: 2

yes , using KafkaIO , we can connect to Kafka cluster.

Comment 13

ID: 251440 User: ashuchip Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 24 Jun 2021 04:37 Selected Answer: - Upvotes: 3

yes A is correct , because sliding window can only help here.

Comment 14

ID: 218906 User: Alasmindas Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Fri 14 May 2021 03:41 Selected Answer: - Upvotes: 6

Option A is the correct answer. Reasons:-
a) Kafka IO and Dataflow is a valid option for interconnect (needless where Kafka is located - On Prem/Google Cloud/Other cloud)
b) Sliding Window will help to calculate average.

Option C and D are overkill and complex, considering the scenario in the question,

Comment 15

ID: 168573 User: atnafu2020 Badges: - Relative Date: 5 years ago Absolute Date: Sun 28 Feb 2021 18:37 Selected Answer: - Upvotes: 2

A
To take running averages of data, use hopping windows. You can use one-minute hopping windows with a thirty-second period to compute a one-minute running average every thirty seconds.

Comment 16

ID: 149209 User: Prakzz Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Tue 02 Feb 2021 18:20 Selected Answer: - Upvotes: 1

I don't think its A or B. Dataflow can't connect directly to kafka.

Comment 16.1

ID: 157745 User: FARR Badges: - Relative Date: 5 years ago Absolute Date: Sun 14 Feb 2021 05:16 Selected Answer: - Upvotes: 3

Yes, via KafkaIO. See the link in above comment

Comment 17

ID: 142724 User: kino2020 Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Sun 24 Jan 2021 15:35 Selected Answer: - Upvotes: 4

"You operate an IoT pipeline built around Apache Kafka"
The statement in question states. Therefore, building with kafka is the requirement definition for this problem.

Just in case you are wondering, a case along with this problem is listed on google by the architects.
"Using Cloud Dataflow to Process Outside-Hosted Messages from Kafka"
https://cloud.google.com/solutions/processing-messages-from-kafka-hosted-outside-gcp

Therefore, A is the correct answer.

Comment 17.1

ID: 223233 User: SPutri Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Thu 20 May 2021 02:06 Selected Answer: - Upvotes: 2

the link that you share above is saying, "..illustrates a popular scenario: you use Dataflow to process the messages, where Kafka is hosted either on-premises or in another public cloud such as Amazon Web Services (AWS)."
but in this case, we are processing data coming from IoT pipeline, not from on-premise or other cloud. so, i don't think A is the proper solution. I consider option C instead.

80. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 155

Sequence: 237
Discussion ID: 80517
Source URL: https://www.examtopics.com/discussions/google/view/80517-exam-professional-data-engineer-topic-1-question-155/
Posted By: YorelNation
Posted At: Sept. 6, 2022, 7:57 a.m.

Question

Your company is selecting a system to centralize data ingestion and delivery. You are considering messaging and data integration systems to address the requirements. The key requirements are:
✑ The ability to seek to a particular offset in a topic, possibly back to the start of all data ever captured
✑ Support for publish/subscribe semantics on hundreds of topics

Retain per-key ordering -

Which system should you choose?

A. Apache Kafka
B. Cloud Storage
C. Dataflow
D. Firebase Cloud Messaging

Community Answer Votes

A: 11 most voted

Comments 6 comments Click to expand

Comment 1

ID: 660855 User: YorelNation Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 08:57 Selected Answer: A Upvotes: 10

A I think it's the only technology that met the requirements

Comment 2

ID: 676138 User: dn_mohammed_data Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Wed 22 Mar 2023 15:49 Selected Answer: - Upvotes: 7

vote for A: topics, offsets --> apache kafka

Comment 3

ID: 1165796 User: mothkuri Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Wed 04 Sep 2024 16:30 Selected Answer: A Upvotes: 1

Only Kafka can support publish/subscribe semantics on hundreds of topics

Comment 4

ID: 1015856 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 16:56 Selected Answer: - Upvotes: 4

Ability to Seek to a Particular Offset: Kafka allows consumers to seek to a specific offset in a topic, enabling you to read data from a specific point, including back to the start of all data ever captured. This is a fundamental capability of Kafka.

Support for Publish/Subscribe Semantics: Kafka supports publish/subscribe semantics through topics. You can have hundreds of topics in Kafka, and consumers can subscribe to these topics to receive messages in a publish/subscribe fashion.

Retain Per-Key Ordering: Kafka retains the order of messages within a partition. If you have a key associated with your messages, you can ensure per-key ordering by sending messages with the same key to the same partition.

Scalability: Kafka is designed to handle high-throughput data streaming and is capable of scaling to meet your needs.

Apache Kafka aligns well with the requirements you've outlined for centralized data ingestion and delivery. It's a robust choice for scenarios that involve data streaming, publish/subscribe, and retaining message ordering.

Comment 5

ID: 812938 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 12:30 Selected Answer: - Upvotes: 3

Answer A: Apache Kafka
Key words: Ingestion and Delivery together ( it is combination of pub/sub for ingestion, and delivery = dataflow+any database in gcp)
Offset of a topic = Partition of a topic and reprocess specific part of topic, its not possible in pub/sub as it is designed for as come and go for 1 topic.
Per key ordering. means message with same key can be process or assigned to a user in kafka.

Comment 6

ID: 680285 User: aquevedos91 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Mon 27 Mar 2023 02:58 Selected Answer: - Upvotes: 1

deberia ser la C, debido a que siempre es mejor escoger los servicios de google

81. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 156

Sequence: 238
Discussion ID: 17211
Source URL: https://www.examtopics.com/discussions/google/view/17211-exam-professional-data-engineer-topic-1-question-156/
Posted By: -
Posted At: March 22, 2020, 7:31 a.m.

Question

You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud. You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service. What should you do?

A. Deploy a Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
B. Deploy a Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
C. Install Hadoop and Spark on a 10-node Compute Engine instance group with standard instances. Install the Cloud Storage connector, and store the data in Cloud Storage. Change references in scripts from hdfs:// to gs://
D. Install Hadoop and Spark on a 10-node Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://

Community Answer Votes

A: 6 most voted

Comments 15 comments Click to expand

Comment 1

ID: 128646 User: Rajuuu Badges: Highly Voted Relative Date: 5 years, 2 months ago Absolute Date: Thu 07 Jan 2021 08:45 Selected Answer: - Upvotes: 7

Answer is A…Cloud Dataproc for Managed Cloud native application and HDD for cost-effective solution.

Comment 2

ID: 70613 User: Rajokkiyam Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 06:03 Selected Answer: - Upvotes: 5

Answer A

Comment 3

ID: 1165800 User: mothkuri Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Wed 04 Sep 2024 16:33 Selected Answer: A Upvotes: 2

Options A is the right answer.
Option B using SSD persistent disk which will add more cost than default HDD
Option C & D are out of scope.

Comment 4

ID: 1015862 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 17:01 Selected Answer: A Upvotes: 2

Dataproc Managed Service: Dataproc is a fully managed service for running Apache Hadoop and Spark. It provides ease of management and automation.

Standard Persistent Disk: Using standard persistent disks for Dataproc workers ensures durability and is cost-effective compared to SSDs.

Preemptible Workers: By using 50% preemptible workers, you can significantly reduce costs while maintaining fault tolerance. Preemptible VMs are cheaper but can be preempted by Google, so having a mix of preemptible and non-preemptible workers provides cost savings with redundancy.

Storing Data in Cloud Storage: Storing data in Cloud Storage is highly durable, scalable, and cost-effective. It also makes data accessible to Dataproc clusters, and you can leverage native connectors for reading data from Cloud Storage.

Changing References to gs://: Updating your scripts to reference data in Cloud Storage using gs:// ensures that your jobs work seamlessly with the cloud storage infrastructure.

Comment 5

ID: 893862 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 10 Nov 2023 14:29 Selected Answer: A Upvotes: 1

Apache Hadoop -> Dataproc or Compute Engine with proper SW installation
cost-effective -> use standard persistent disk + store data in Cloud Storage
batch -> Dataproc or Compute Engine with proper SW installation
managed service -> Dataproc

Comment 6

ID: 665834 User: MounicaN Badges: - Relative Date: 3 years ago Absolute Date: Sat 11 Mar 2023 08:33 Selected Answer: A Upvotes: 1

it says cost effective , hence no SSD

Comment 7

ID: 486430 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 25 May 2022 04:17 Selected Answer: - Upvotes: 2

Correct: A

Comment 8

ID: 408401 User: LORETOGOMEZ Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 17 Jan 2022 14:14 Selected Answer: - Upvotes: 2

Correct : A
Option B is usefull if you use HDFS, and in this case as you use preemtible machines it isn't worth use SSD disks.

Comment 9

ID: 292804 User: ArunSingh1028 Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Tue 17 Aug 2021 19:08 Selected Answer: - Upvotes: 1

Answer - B

Comment 10

ID: 266684 User: StelSen Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Wed 14 Jul 2021 02:52 Selected Answer: - Upvotes: 5

Look at this link. https://cloud.google.com/bigtable/docs/choosing-ssd-hdd
At the First look I chose Option-B as they mentioned SSD is cost-effective on most cases. But after reading the whole page, they also mentioned that for batch workloads, HDD is suggested as long as not heavy read. So I changed my mind to Option-A (I assumed this is not ready heavy process?).

Comment 10.1

ID: 643562 User: NM1212 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 07 Feb 2023 05:01 Selected Answer: - Upvotes: 1

Caution about the link you provided as reference. It's intendedfor BigTable which is GC's low-latency solution which is totally different requirement. Mentioning only because on first read I thought SSD is the obvious choice.
Per below link, SSD may not be required unless there is a low-latency requirement or a high I/O requirement. Since the question does not specify anything like that, A looks correct.
https://cloud.google.com/solutions/migration/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc

Comment 11

ID: 218915 User: Alasmindas Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Fri 14 May 2021 04:07 Selected Answer: - Upvotes: 4

Option B - SSD disks, reasons:-
The question asks "fault-tolerant and cost-effective as possible for long-running batch job".
3 Key words are - fault tolerant / cost effective / long running batch jobs..

The cost efficiency part mentioned in the question could be addressed by 50% preemptible disks and storing the data in cloud storage than HDFS.
For long running batch jobs and as standard approach for Dataproc - we should always go with SSD disk types as per google recommendations.

Comment 11.1

ID: 238814 User: beedle Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Wed 09 Jun 2021 01:10 Selected Answer: - Upvotes: 2

where is the proof...show me the link?

Comment 11.1.1

ID: 303270 User: Raangs Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Sat 04 Sep 2021 10:03 Selected Answer: - Upvotes: 6

https://cloud.google.com/solutions/migration/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc
As per this, SSD is only recommended if it is high IO intensive. In this question no where mentioned its high IO intensive, and asks for cost effective (as much as possible), so no need to use SSD.
I will go with A.

Comment 12

ID: 161216 User: Ravivarma4786 Badges: - Relative Date: 5 years ago Absolute Date: Fri 19 Feb 2021 06:39 Selected Answer: - Upvotes: 2

Ans is B, for long running SDD suitable. HDD maintenance will be additional charge for long running jobs

82. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 230

Sequence: 239
Discussion ID: 130174
Source URL: https://www.examtopics.com/discussions/google/view/130174-exam-professional-data-engineer-topic-1-question-230/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 12:43 p.m.

Question

You need to modernize your existing on-premises data strategy. Your organization currently uses:
• Apache Hadoop clusters for processing multiple large data sets, including on-premises Hadoop Distributed File System (HDFS) for data replication.
• Apache Airflow to orchestrate hundreds of ETL pipelines with thousands of job steps.

You need to set up a new architecture in Google Cloud that can handle your Hadoop workloads and requires minimal changes to your existing orchestration processes. What should you do?

A. Use Bigtable for your large workloads, with connections to Cloud Storage to handle any HDFS use cases. Orchestrate your pipelines with Cloud Composer.
B. Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Orchestrate your pipelines with Cloud Composer.
C. Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Convert your ETL pipelines to Dataflow.
D. Use Dataproc to migrate your Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Use Cloud Data Fusion to visually design and deploy your ETL pipelines.

Community Answer Votes

B: 17 most voted

Comments 6 comments Click to expand

Comment 1

ID: 1113840 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 16:01 Selected Answer: B Upvotes: 7

Straight forward

Comment 2

ID: 1166109 User: datasmg Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Thu 05 Sep 2024 00:16 Selected Answer: B Upvotes: 1

You can use Dataproc for doing Apache Hadoop process, then Cloud Storage to replace the HDFS, and using Cloud Composer (built in Apache Airflow) for orchestrator.

Comment 3

ID: 1156775 User: cuadradobertolinisebastiancami Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 22 Aug 2024 22:29 Selected Answer: B Upvotes: 2

Airflow -> composer
Minimum changes -> Dataproc

Comment 4

ID: 1152734 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 17 Aug 2024 17:57 Selected Answer: B Upvotes: 1

Option B

Comment 5

ID: 1121546 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 11:16 Selected Answer: B Upvotes: 2

definitely B

Comment 6

ID: 1112717 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 11:43 Selected Answer: B Upvotes: 4

Cloud Composer -> Airflow

83. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 266

Sequence: 248
Discussion ID: 130217
Source URL: https://www.examtopics.com/discussions/google/view/130217-exam-professional-data-engineer-topic-1-question-266/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 6:23 p.m.

Question

You are building a streaming Dataflow pipeline that ingests noise level data from hundreds of sensors placed near construction sites across a city. The sensors measure noise level every ten seconds, and send that data to the pipeline when levels reach above 70 dBA. You need to detect the average noise level from a sensor when data is received for a duration of more than 30 minutes, but the window ends when no data has been received for 15 minutes. What should you do?

A. Use session windows with a 15-minute gap duration.
B. Use session windows with a 30-minute gap duration.
C. Use hopping windows with a 15-minute window, and a thirty-minute period.
D. Use tumbling windows with a 15-minute window and a fifteen-minute .withAllowedLateness operator.

Community Answer Votes

A: 31 most voted
C: 8
D: 5

Comments 19 comments Click to expand

Comment 1

ID: 1126707 User: datapassionate Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Fri 19 Jan 2024 14:47 Selected Answer: A Upvotes: 11

to detect average noise levels from sensors, the best approach is to use session windows with a 15-minute gap duration (Option A). Session windows are ideal for cases like this where the events (sensor data) are sporadic. They group events that occur within a certain time interval (15 minutes in your case) and a new window is started if no data is received for the duration of the gap. This matches your requirement to end the window when no data is received for 15 minutes, ensuring that the average noise level is calculated over periods of continuous data

Comment 1.1

ID: 1152223 User: ashdam Badges: - Relative Date: 2 years ago Absolute Date: Fri 16 Feb 2024 21:57 Selected Answer: - Upvotes: 3

But you are not fulfilling this requirement "You need to detect the average noise level from a sensor when data is received for a duration of more than 30 minutes". I would say C

Comment 2

ID: 1144622 User: saschak94 Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Thu 08 Feb 2024 16:23 Selected Answer: A Upvotes: 10

You need a window that start when data for a sensor arrives and end when there's a gap in data. That would rule out hopping and tumbling windows.

- > Windows need to stay open as long as there's data arriving - 30+ mins
-> Window Should close when no data has been received for 15 mins -> Gap 15 mins

Comment 3

ID: 1271741 User: shanks_t Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Sat 24 Aug 2024 18:27 Selected Answer: A Upvotes: 3

The problem requires detecting average noise levels when data is received for more than 30 minutes, but the window should end when no data has been received for 15 minutes.
Session windows are ideal for this scenario because:
They are designed to capture bursts of activity followed by periods of inactivity.
They dynamically size based on the data received, which fits well with the variable duration of noise events.
The gap duration can be set to define when a session ends.
The 15-minute gap duration aligns perfectly with the requirement to end the window when no data has been received for 15 minutes.
Session windows will naturally extend beyond 30 minutes if data keeps coming in, satisfying the requirement to detect levels for durations of more than 30 minutes.

Comment 4

ID: 1267360 User: viciousjpjp Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 17 Aug 2024 00:51 Selected Answer: D Upvotes: 1

D. Using a 15-minute window with a 15-minute tumbling window withAllowedLateness is the most suitable option for the following reasons:

Flexibility: By allowing a 15-minute delay, it can accommodate various situations such as network latency or sensor failures.
Processing efficiency: Using a fixed window improves processing efficiency.
Compliance with conditions: The window ends if no data is received for 15 minutes, meeting the specified condition.
Implementation points:

.withAllowedLateness: This operator allows delayed events to be included in the current window.
Trigger: When 30 minutes of data is collected, a trigger event is generated, and the average value is calculated based on this event.
Watermark: By setting a watermark, processing of old data can be terminated.

Comment 5

ID: 1261139 User: JamesKarianis Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 05 Aug 2024 19:38 Selected Answer: A Upvotes: 3

Without a doubt A: https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#session-windows

Comment 6

ID: 1260013 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Fri 02 Aug 2024 21:56 Selected Answer: A Upvotes: 1

The requirements are
- recieve data for a duration of MORE than 30 minutes
- end the window based on inactivity

Comment 7

ID: 1213921 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sun 19 May 2024 20:20 Selected Answer: - Upvotes: 3

Correct answer: A.
Use a session Window to cature data and create an aggregation when the Session is larger than 30 minutes.
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#session-windows
https://beam.apache.org/releases/javadoc/2.6.0/org/apache/beam/sdk/transforms/windowing/Sessions.html

Comment 8

ID: 1213609 User: f74ca0c Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sun 19 May 2024 07:05 Selected Answer: C Upvotes: 1

C- Running average: https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#hopping-windows

Comment 9

ID: 1190843 User: joao_01 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 07 Apr 2024 09:51 Selected Answer: - Upvotes: 2

Guys, I think its B. I was considering C, but C is calculating every 30 min, the 15min window gap data.Thats not what the questions wants. The questions wants a solution to get the average data of a 30 min window. So Its B.

Look at this relating to C:
"Use hopping windows with a 15-minute window, and a thirty-minute period" --> Wrong
(IS DIFFERENT THEN)
"Use hopping windows with a 30-minute window, and a 15-minute period" --> Right.

Thats why I think the B is the right answer.

Comment 9.1

ID: 1198392 User: joao_01 Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Fri 19 Apr 2024 09:10 Selected Answer: - Upvotes: 1

Actually I think with B, the window will never be closed because the probability of having a period of inactivity of 30 minutes is very low. In that case I think the option A is the more correct one.

Comment 10

ID: 1167664 User: 342f1c6 Badges: - Relative Date: 2 years ago Absolute Date: Thu 07 Mar 2024 06:04 Selected Answer: C Upvotes: 1

To take running averages of data, use hopping windows. You can use one-minute hopping windows with a thirty-second period to compute a one-minute running average every thirty seconds.

Comment 11

ID: 1158699 User: kck6ra4214wm Badges: - Relative Date: 2 years ago Absolute Date: Sun 25 Feb 2024 13:30 Selected Answer: A Upvotes: 3

https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#session-windows
A session window contains elements within a gap duration of another element. The gap duration is an interval between new data in a data stream. If data arrives after the gap duration, the data is assigned to a new window.
So, we need

Comment 12

ID: 1131837 User: imiu Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 25 Jan 2024 17:11 Selected Answer: C Upvotes: 1

https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#hopping-windows To take running averages of data, use hopping windows. You can use one-minute hopping windows with a thirty-second period to compute a one-minute running average every thirty seconds.

Comment 13

ID: 1121771 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 16:02 Selected Answer: D Upvotes: 2

Option D to me, It aligns with the specified criteria for detecting the average noise level within a 30-minute duration and handling the end of the window when no data is received for 15 minutes.

Comment 13.1

ID: 1130667 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 24 Jan 2024 14:54 Selected Answer: - Upvotes: 1

Agree D.
Data comes -> 30 mts duration.
Data didn't come in 15 mts -> 15 mts duration

Comment 14

ID: 1119356 User: BIGQUERY_ALT_ALT Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 11 Jan 2024 05:38 Selected Answer: D Upvotes: 2

OPTION D is correct for the specific scenario where we want to detect the average noise level for a duration of more than 30 minutes but end the window when no data has been received for 15 minutes.

Explanation:

- Tumbling windows are non-overlapping windows, and in this case, you want to capture data continuously for 30-minute intervals.

- Using a tumbling window with a 15-minute window size aligns with your requirement to detect the average noise level for a duration of more than 30 minutes.

- Adding a .withAllowedLateness operator with a duration of fifteen minutes ensures that the window will still consider late-arriving data within that time frame. After fifteen minutes of no data, the window will be closed, and any late-arriving data will not be considered.

Option A and B invalid as they capture fixed logic with 15 or 30 mins. Option C captures only 15 min average with 30 min trigger hence not suitable.

Comment 15

ID: 1117492 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 14:02 Selected Answer: C Upvotes: 2

Hopping windows (called sliding windows in Apache Beam).
To take running averages of data, use hopping windows.

Comment 16

ID: 1112964 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 18:23 Selected Answer: C Upvotes: 3

C. Use hopping windows with a 15-minute window, and a thirty-minute period.

84. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 276

Sequence: 249
Discussion ID: 130223
Source URL: https://www.examtopics.com/discussions/google/view/130223-exam-professional-data-engineer-topic-1-question-276/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 7:21 p.m.

Question

You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time. What should you do?

A. Submit duplicate pipelines in two different zones by using the --zone flag.
B. Set the pipeline staging location as a regional Cloud Storage bucket.
C. Specify a worker region by using the --region flag.
D. Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.

Community Answer Votes

C: 20 most voted

Comments 5 comments Click to expand

Comment 1

ID: 1121828 User: Matt_108 Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 16:04 Selected Answer: C Upvotes: 10

Option C: https://cloud.google.com/dataflow/docs/guides/pipeline-workflows#zonal-failures

Comment 2

ID: 1117788 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 19:33 Selected Answer: C Upvotes: 6

- Specifying a worker region (instead of a specific zone) allows Google Cloud's Dataflow service to manage the distribution of resources across multiple zones within that region

Comment 3

ID: 1155319 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Wed 21 Aug 2024 06:42 Selected Answer: C Upvotes: 1

Option C

Comment 4

ID: 1117605 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 15:35 Selected Answer: C Upvotes: 1

https://cloud.google.com/dataflow/docs/guides/pipeline-workflows#zonal-failures

Comment 5

ID: 1113016 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 18:21 Selected Answer: C Upvotes: 2

C. Specify a worker region by using the --region flag.

85. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 290

Sequence: 250
Discussion ID: 130293
Source URL: https://www.examtopics.com/discussions/google/view/130293-exam-professional-data-engineer-topic-1-question-290/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 11:15 a.m.

Question

You are designing a messaging system by using Pub/Sub to process clickstream data with an event-driven consumer app that relies on a push subscription. You need to configure the messaging system that is reliable enough to handle temporary downtime of the consumer app. You also need the messaging system to store the input messages that cannot be consumed by the subscriber. The system needs to retry failed messages gradually, avoiding overloading the consumer app, and store the failed messages after a maximum of 10 retries in a topic. How should you configure the Pub/Sub subscription?

A. Increase the acknowledgement deadline to 10 minutes.
B. Use immediate redelivery as the subscription retry policy, and configure dead lettering to a different topic with maximum delivery attempts set to 10.
C. Use exponential backoff as the subscription retry policy, and configure dead lettering to the same source topic with maximum delivery attempts set to 10.
D. Use exponential backoff as the subscription retry policy, and configure dead lettering to a different topic with maximum delivery attempts set to 10.

Community Answer Votes

D: 21 most voted

Comments 6 comments Click to expand

Comment 1

ID: 1117956 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Wed 10 Jul 2024 00:13 Selected Answer: D Upvotes: 17

- Exponential Backoff: This retry policy gradually increases the delay between retries, which helps to avoid overloading the consumer app.
- Dead Lettering to a Different Topic: Configuring dead lettering sends messages that couldn't be processed after the specified number of delivery attempts (10 in this case) to a separate topic. This allows for handling of failed messages without interrupting the regular flow of new messages.
- Maximum Delivery Attempts Set to 10: This setting ensures that the system retries each message up to 10 times before considering it a failure and moving it to the dead letter topic.

Comment 2

ID: 1155709 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Wed 21 Aug 2024 16:43 Selected Answer: D Upvotes: 1

Option D

Comment 3

ID: 1121905 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 17:25 Selected Answer: D Upvotes: 1

Option D - agree with other comments explanation

Comment 4

ID: 1115989 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 15:59 Selected Answer: - Upvotes: 3

D. Use exponential backoff as the subscription retry policy, and configure dead lettering to a different topic with maximum delivery attempts set to 10

Best suitable options for graceful retry and storing failed messages

Comment 5

ID: 1113524 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 10:15 Selected Answer: D Upvotes: 2

D. Use exponential backoff as the subscription retry policy, and configure dead lettering to a different topic with maximum delivery attempts set to 10.

Comment 5.1

ID: 1116023 User: Smakyel79 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 16:52 Selected Answer: - Upvotes: 2

Exponential backoff will help in managing the load on the consumer app by gradually increasing the delay between retries. Configuring dead lettering to a different topic after a maximum of 10 delivery attempts ensures that undeliverable messages are stored separately, preventing them from being retried endlessly and cluttering the main message flow.

86. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 299

Sequence: 252
Discussion ID: 130328
Source URL: https://www.examtopics.com/discussions/google/view/130328-exam-professional-data-engineer-topic-1-question-299/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 2:34 p.m.

Question

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

A. 1. Create two regional Cloud Storage buckets, one in the us-central1 region and one in the us-south1 region.
2. Have the upstream process write data to the us-central1 bucket. Use the Storage Transfer Service to copy data hourly from the us-central1 bucket to the us-south1 bucket.
3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in that region.
4. In case of regional failure, redeploy your Dataproc clusters to the us-south1 region and read from the bucket in that region instead.
B. 1. Create a Cloud Storage bucket in the US multi-region.
2. Run the Dataproc cluster in a zone in the us-central1 region, reading data from the US multi-region bucket.
3. In case of a regional failure, redeploy the Dataproc cluster to the us-central2 region and continue reading from the same bucket.
C. 1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.
2. Enable turbo replication.
3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the us-south1 region.
4. In case of a regional failure, redeploy your Dataproc cluster to the us-south1 region and continue reading from the same bucket.
D. 1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.
2. Enable turbo replication.
3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region.
4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.

Community Answer Votes

D: 12 most voted

Comments 4 comments Click to expand

Comment 1

ID: 1115434 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sat 06 Jul 2024 20:57 Selected Answer: D Upvotes: 6

- Rapid Replication: Turbo replication ensures near-real-time data synchronization between regions, achieving an RPO of 15 minutes or less.
- Minimal Latency: Dataproc clusters can read from the bucket in the same region, minimizing data transfer latency and optimizing performance.
- Disaster Recovery: In case of regional failure, Dataproc clusters can seamlessly redeploy to the other region and continue reading from the same bucket, ensuring business continuity.

Comment 2

ID: 1156075 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Thu 22 Aug 2024 03:20 Selected Answer: D Upvotes: 1

Option D

Comment 3

ID: 1121930 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 17:45 Selected Answer: D Upvotes: 2

Option D, answers all needs from the request

Comment 4

ID: 1113705 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 13:34 Selected Answer: D Upvotes: 3

D.
1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.
2. Enable turbo replication.
3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region.
4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.

87. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 212

Sequence: 262
Discussion ID: 129859
Source URL: https://www.examtopics.com/discussions/google/view/129859-exam-professional-data-engineer-topic-1-question-212/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:36 a.m.

Question

You are troubleshooting your Dataflow pipeline that processes data from Cloud Storage to BigQuery. You have discovered that the Dataflow worker nodes cannot communicate with one another. Your networking team relies on Google Cloud network tags to define firewall rules. You need to identify the issue while following Google-recommended networking security practices. What should you do?

A. Determine whether your Dataflow pipeline has a custom network tag set.
B. Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag.
C. Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 on the subnet used by Dataflow workers.
D. Determine whether your Dataflow pipeline is deployed with the external IP address option enabled.

Community Answer Votes

B: 20 most voted

Comments 6 comments Click to expand

Comment 1

ID: 1116098 User: MaxNRG Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 18:46 Selected Answer: B Upvotes: 10

The best approach would be to check if there is a firewall rule allowing traffic on TCP ports 12345 and 12346 for the Dataflow network tag.
Dataflow uses TCP ports 12345 and 12346 for communication between worker nodes. Using network tags and associated firewall rules is a Google-recommended security practice for controlling access between Compute Engine instances like Dataflow workers.

So the key things to check would be:

1. Ensure your Dataflow pipeline is using the Dataflow network tag on the worker nodes. This tag is applied by default unless overridden.
2. Check if there is a firewall rule allowing TCP 12345 and 12346 ingress and egress traffic for instances with the Dataflow network tag. If not, add the rule.

Options A, C and D relate to other networking aspects but do not directly address the Google recommended practice of using network tags and firewall rules.

Comment 2

ID: 1151156 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Thu 15 Aug 2024 17:27 Selected Answer: B Upvotes: 1

B. Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag.

Comment 3

ID: 1121447 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 09:18 Selected Answer: B Upvotes: 1

B, check if there is a firewall rule allowing traffic on TCP ports 12345 and 12346 for the Dataflow network tag.

Comment 4

ID: 1115676 User: Smakyel79 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 09:28 Selected Answer: B Upvotes: 3

Because network tags are used and Dataflow uses TCP ports 12345 and 12346 as stated on
https://cloud.google.com/dataflow/docs/guides/routes-firewall

Comment 5

ID: 1112248 User: raaad Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 02 Jul 2024 21:00 Selected Answer: B Upvotes: 3

This option focuses directly on ensuring that the firewall rules are set up correctly for the network tags used by Dataflow worker nodes. It specifically addresses the potential issue of worker nodes not being able to communicate due to restrictive firewall rules blocking the necessary ports.

Comment 6

ID: 1109536 User: e70ea9e Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 08:36 Selected Answer: B Upvotes: 2

Focus on Network Tags:

Adheres to the recommended practice of using network tags for firewall configuration, enhancing security and flexibility.
Avoids targeting specific subnets, which can be less secure and harder to manage.

88. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 34

Sequence: 273
Discussion ID: 79622
Source URL: https://www.examtopics.com/discussions/google/view/79622-exam-professional-data-engineer-topic-1-question-34/
Posted By: ducc
Posted At: Sept. 3, 2022, 12:44 a.m.

Question

Flowlogistic Case Study -

Company Overview -
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background -
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.

Solution Concept -
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment -
Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters
- SQL Server `" user data, inventory, static data
3 physical servers
- Cassandra `" metadata, tracking messages
10 Kafka servers `" tracking message aggregation and batch insert
✑ Application servers `" customer front end, middleware for order/customs
60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements -
Build a reliable and reproducible environment with scaled panty of production.

✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements -
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud environment

SEO Statement -
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.

CTO Statement -
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.

CFO Statement -
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to
BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

A. Store the common data in BigQuery as partitioned tables.
B. Store the common data in BigQuery and expose authorized views.
C. Store the common data encoded as Avro in Google Cloud Storage.
D. Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Community Answer Votes

C: 15 most voted
B: 7
D: 4

Comments 20 comments Click to expand

Comment 1

ID: 1050790 User: rtcpost Badges: Highly Voted Relative Date: 2 years, 4 months ago Absolute Date: Sun 22 Oct 2023 17:20 Selected Answer: C Upvotes: 6

C. Store the common data encoded as Avro in Google Cloud Storage.

This approach allows for interoperability between BigQuery and Hadoop/Spark as Avro is a commonly used data serialization format that can be read by both systems. Data stored in Google Cloud Storage can be accessed by both BigQuery and Dataproc, providing a bridge between the two environments. Additionally, you can set up data transformation pipelines in Dataproc to work with this data.

Comment 2

ID: 1259026 User: iooj Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Wed 31 Jul 2024 19:54 Selected Answer: C Upvotes: 1

in BigQuery we can use BigLake tables based on Avro for historical data, and Spark stored procedures

Comment 3

ID: 1230087 User: dhvanil Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Thu 13 Jun 2024 22:28 Selected Answer: - Upvotes: 1

Data lake,fully managed, data analytics. Stores structured and unstructured data are keywords,so answer is GCS, OPTION C

Comment 4

ID: 1087727 User: JOKKUNO Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 04 Dec 2023 15:59 Selected Answer: - Upvotes: 4

Given the scenario described for Flowlogistic's requirements and technical environment, the most suitable option for storing common data that is used by both Google BigQuery and Apache Hadoop/Spark workloads is:

C. Store the common data encoded as Avro in Google Cloud Storage.

Comment 5

ID: 967572 User: nescafe7 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 31 Jul 2023 00:57 Selected Answer: D Upvotes: 3

To simplify the question, Apache Hadoop and Spark workloads that cannot be moved to BigQuery can be handled by DataProc. So the correct answer is D.

Comment 6

ID: 961322 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 10:12 Selected Answer: B Upvotes: 2

B is the right answer. Common data will lie in BigQuery but will be accessible via the views with SQL in Hadoop workloads.

Comment 7

ID: 818845 User: midgoo Badges: - Relative Date: 3 years ago Absolute Date: Thu 23 Feb 2023 07:11 Selected Answer: B Upvotes: 4

C should be the correct answer. However, please note that Google just released the BigQuery Connector for Hadoop, so if they ask the same question today, B will be the correct answer.
A could be correct too, but I cannot see why it has to be partitioned

Comment 7.1

ID: 942687 User: res3 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 13:31 Selected Answer: - Upvotes: 3

If you check the https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery, it unloads the BQ data to GCS, utilizes it, and then deletes it from the GCS. Storing common data twice (at BQ and GCS) will not be the best option compared to 'C' (using GCS as the main common dataset).

Comment 8

ID: 769007 User: korntewin Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 08 Jan 2023 02:12 Selected Answer: C Upvotes: 1

I would vote for C as it can be used for analysis with Bigquery. Furthermore, Hadoop workload can also be transferred to dataproc connected to GCS.

Comment 9

ID: 743516 User: DGames Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Tue 13 Dec 2022 03:31 Selected Answer: B Upvotes: 1

Answer B look ok , because in question they want to store common data which can use by both workload, and using big query and primary analytical tool that would be best option and easy to analysis common data.

Comment 10

ID: 731620 User: kelvintoys93 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 30 Nov 2022 15:48 Selected Answer: - Upvotes: 3

"Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data" - BigQuery cant take unstructured data so A and B are out.
Storing data in HDFS storage is never recommended unless latency is a requirement, so D is out.

That leaves us with GCS. Answer is C

Comment 10.1

ID: 763233 User: tunstila Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 01 Jan 2023 16:03 Selected Answer: - Upvotes: 1

I thought you can now store unstructured data in BigQuery via the object tables announced during Google NEXT 2022... If that's possib;e, does that make B a better choice?

Comment 11

ID: 721297 User: drunk_goat82 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 18 Nov 2022 14:26 Selected Answer: C Upvotes: 2

BigQuery can use federated queries to connect to the avro data in GCS while running spark jobs on it. If you duplicate the date you have to manage both data sets.

Comment 12

ID: 719822 User: wan2three Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 16 Nov 2022 17:13 Selected Answer: - Upvotes: 1

A
They wanted BigQuery. And connector is all you need to perform Hadoop or spark. Hadoop migration can be done using dataproc.

Comment 12.1

ID: 719823 User: wan2three Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 16 Nov 2022 17:14 Selected Answer: - Upvotes: 2

Also apparently they want all data at one place and want bigQ

Comment 13

ID: 719733 User: gudiking Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 16 Nov 2022 15:28 Selected Answer: C Upvotes: 1

C as it can be used as an external table from BigQuery and with the Cloud Storage Connector it can be used by the Spark workloads (running in Dataproc)

Comment 14

ID: 718021 User: solar_maker Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 14 Nov 2022 15:30 Selected Answer: C Upvotes: 1

C, as both capable of AVRO, but the customer does not know what they want to do with the data yet.

Comment 15

ID: 710912 User: Leelas Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Fri 04 Nov 2022 05:56 Selected Answer: D Upvotes: 1

In Technical requirements it Was clearly mentioned that they need to Migrate existing Hadoop Cluster for which Data Proc Cluster is a replacement.

Comment 16

ID: 675089 User: vishal0202 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 21 Sep 2022 13:49 Selected Answer: - Upvotes: 4

C is ans...avro data can be accessed by spark as well

Comment 17

ID: 657882 User: ducc Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 00:44 Selected Answer: C Upvotes: 3

The answer is C

89. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 98

Sequence: 275
Discussion ID: 17255
Source URL: https://www.examtopics.com/discussions/google/view/17255-exam-professional-data-engineer-topic-1-question-98/
Posted By: -
Posted At: March 22, 2020, 4:23 p.m.

Question

Your company needs to upload their historic data to Cloud Storage. The security rules don't allow access from external IPs to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every day. What should they do?

A. Execute gsutil rsync from the on-premises servers.
B. Use Dataflow and write the data to Cloud Storage.
C. Write a job template in Dataproc to perform the data transfer.
D. Install an FTP server on a Compute Engine VM to receive the files and move them to Cloud Storage.

Community Answer Votes

A: 11 most voted

Comments 18 comments Click to expand

Comment 1

ID: 76417 User: itche_scratche Badges: Highly Voted Relative Date: 5 years, 10 months ago Absolute Date: Sun 19 Apr 2020 16:22 Selected Answer: - Upvotes: 14

should be A, dataflow is on cloud is external; "don't allow access from external IPs to their on-premises resources" so no dataflow.

Comment 2

ID: 513453 User: MaxNRG Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Thu 30 Dec 2021 15:44 Selected Answer: A Upvotes: 6

A is the better and most simple IF there is no problem in having gsutil in our servers.
B and C no way, the comms will go GCP–Home, which sais is not allowed.
D is valid, we can send the files with http://ftp…BUT ftp is not secure, and we’ll need to move them to the cloud storage afterwards, which is not detailed in the answer.
https://cloud.google.com/storage/docs/gsutil/commands/rsync

Comment 3

ID: 1255778 User: shroffshivangi Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Fri 26 Jul 2024 17:56 Selected Answer: - Upvotes: 2

The gcloud storage command is the standard tool for small- to medium-sized transfers over a typical enterprise-scale network, from a private data center or from another cloud provider to Google Cloud.

Comment 4

ID: 775219 User: Besss Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 14 Jan 2023 10:29 Selected Answer: A Upvotes: 1

A is correct

Comment 5

ID: 692896 User: somnathmaddi Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 12 Oct 2022 11:15 Selected Answer: A Upvotes: 2

Should be A

Comment 6

ID: 518494 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 06 Jan 2022 20:05 Selected Answer: A Upvotes: 1

Without this "The security rules don't allow access from external IPs to their on-premises resources" B would be an answer.

Comment 7

ID: 508539 User: am2005 Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Fri 24 Dec 2021 15:42 Selected Answer: - Upvotes: 1

I am confused . which one is correct A or B ???

Comment 8

ID: 504509 User: hendrixlives Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sun 19 Dec 2021 00:22 Selected Answer: A Upvotes: 1

A is correct.

Comment 9

ID: 463204 User: Chelseajcole Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 16 Oct 2021 19:16 Selected Answer: - Upvotes: 2

This is the link:https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#gsutil_for_smaller_transfers_of_on-premises_data

Comment 10

ID: 435577 User: manocha_01887 Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Mon 30 Aug 2021 16:28 Selected Answer: - Upvotes: 4

How rsynch will handle private network?
"..The security rules don't allow access from external IPs to their on-premises resources.."

Comment 11

ID: 396174 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 01 Jul 2021 18:17 Selected Answer: - Upvotes: 2

Vote for A

Comment 12

ID: 308407 User: daghayeghi Badges: - Relative Date: 5 years ago Absolute Date: Thu 11 Mar 2021 23:22 Selected Answer: - Upvotes: 4

A:
https://cloud.google.com/solutions/migration-to-google-cloud-transferring-your-large-datasets#options_available_from_google

Comment 12.1

ID: 493219 User: maurodipa Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Fri 03 Dec 2021 16:59 Selected Answer: - Upvotes: 2

How could gsutil connect to Cloud Storage, if there is not access from external IPs? Should I understand that there is not access from outside to inside, but it is possible to send from inside to outside?

Comment 12.1.1

ID: 504148 User: szefco Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 18 Dec 2021 11:20 Selected Answer: - Upvotes: 4

Yes. There is no access to on-prem from external IPs, but on prem can talk to external

Comment 13

ID: 163514 User: Ravivarma4786 Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Sat 22 Aug 2020 11:57 Selected Answer: - Upvotes: 4

gsutil rsync will be used to transfer the files ANS A

Comment 14

ID: 162854 User: haroldbenites Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Fri 21 Aug 2020 12:08 Selected Answer: - Upvotes: 3

A is correct

Comment 15

ID: 128845 User: VishalB Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Tue 07 Jul 2020 13:06 Selected Answer: - Upvotes: 6

Ans : A
The gsutil rsync command makes the contents under dst_url the same as the contents under src_url, by copying any missing files/objects (or those whose data has changed), and (if the -d option is specified) deleting any extra files/objects. src_url must specify a directory, bucket, or bucket subdirectory

Comment 16

ID: 128817 User: Devx198912233 Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Tue 07 Jul 2020 12:34 Selected Answer: - Upvotes: 4

option A
https://cloud.google.com/solutions/migration-to-google-cloud-transferring-your-large-datasets#options_available_from_google

90. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 52

Sequence: 278
Discussion ID: 16822
Source URL: https://www.examtopics.com/discussions/google/view/16822-exam-professional-data-engineer-topic-1-question-52/
Posted By: rickywck
Posted At: March 17, 2020, 5:01 a.m.

Question

You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud
Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?

A. Restrict the Google Cloud Storage bucket so only you can see the files
B. Grant the Project Owner role to a service account, and run the job with it
C. Use a service account with the ability to read the batch files and to write to BigQuery
D. Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery

Community Answer Votes

C: 16 most voted
B: 4

Comments 20 comments Click to expand

Comment 1

ID: 68022 User: digvijay Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Thu 25 Mar 2021 03:15 Selected Answer: - Upvotes: 34

A is wrong, if only I can see the bucket no automation is possible, besides, also needs launch the dataproc job
B is too much, does not follow the security best practices
C has one point missing…you need to submit dataproc jobs.
In D viewer role will not be able to submit dataproc jobs, the rest is ok

Thus….the only one that would work is B! BUT this service account has too many permissions. Should have dataproc editor, write big query and read from bucket

Comment 1.1

ID: 123573 User: dambilwa Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Wed 30 Jun 2021 16:54 Selected Answer: - Upvotes: 15

Hence - Contextually, Option [C] looks to be the right fit

Comment 1.2

ID: 453569 User: retep007 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 28 Sep 2022 19:47 Selected Answer: - Upvotes: 5

C doesn't need permission to submit dataproc jobs, it's workload SA. Job can be submitted by any other identity

Comment 2

ID: 65007 User: rickywck Badges: Highly Voted Relative Date: 4 years, 12 months ago Absolute Date: Wed 17 Mar 2021 05:01 Selected Answer: - Upvotes: 31

Should be C

Comment 3

ID: 961438 User: Mathew106 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Wed 24 Jul 2024 11:56 Selected Answer: B Upvotes: 1

We need permissions for submitting dataproc jobs and writing to BigQuery. Project Owner will fix all of that even though it's not a good solution. The rest won't work at all.

Comment 4

ID: 869024 User: Adswerve Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 13 Apr 2024 04:55 Selected Answer: C Upvotes: 4

C
Project Owner is too much, violates the principle of least privilege

Comment 5

ID: 788744 User: PolyMoe Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 14:07 Selected Answer: C Upvotes: 4

C. Use a service account with the ability to read the batch files and to write to BigQuery

It is best practice to use service accounts with the least privilege necessary to perform a specific task when automating jobs. In this case, the job needs to read the batch files from Cloud Storage and write the results to BigQuery. Therefore, you should create a service account with the ability to read from the Cloud Storage bucket and write to BigQuery, and use that service account to run the job.

Comment 6

ID: 766378 User: Mkumar43 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 09:23 Selected Answer: B Upvotes: 1

B works for the given requirement

Comment 7

ID: 750763 User: Krish6488 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 20 Dec 2023 11:33 Selected Answer: - Upvotes: 2

Least privilege principle. Option C. job can be submitted or triggered using a Cron or a composer which uses another SA with different set of privileges

Comment 8

ID: 744655 User: DGames Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 14 Dec 2023 03:35 Selected Answer: B Upvotes: 2

B because we need to run job .. option C mentioned permission about read and write nothing mention to run the job . In case project owner to service account it’s similar just running job and doing rest of tasks read and writing as well.

Comment 9

ID: 590321 User: ThomasChoy Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sun 23 Apr 2023 03:43 Selected Answer: C Upvotes: 2

The answer is C because Service Account is the best way to access the BigQuery API if your application can run jobs associated with service credentials rather than an end-user's credentials, such as a batch processing pipeline.
https://cloud.google.com/bigquery/docs/authentication

Comment 10

ID: 522020 User: Bhawantha Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 12 Jan 2023 10:15 Selected Answer: C Upvotes: 4

Data owners cant create jobs or queries. -> B out
We need service Account -> D out
Access only granting me does not solve the problem -> A out
The answer is C. ( Minimum rights to perform the job)

Comment 11

ID: 516728 User: medeis_jar Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 15:41 Selected Answer: C Upvotes: 1

"taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud
Dataproc cluster, and depositing the results into Google BigQuery"

Comment 12

ID: 511457 User: prasanna77 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 28 Dec 2022 19:27 Selected Answer: - Upvotes: 1

C should be okay,since he is already a project owner, I guess compute service account created will have access to run the jobs

Comment 13

ID: 504405 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 18 Dec 2022 19:57 Selected Answer: C Upvotes: 1

C,
Project Owner role to a service account - is too much

Comment 14

ID: 487267 User: JG123 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 26 Nov 2022 11:51 Selected Answer: - Upvotes: 6

Why there are so many wrong answers? Examtopics.com are you enjoying paid subscription by giving random answers from people?
Ans: C

Comment 15

ID: 462781 User: anji007 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 15 Oct 2022 21:30 Selected Answer: - Upvotes: 3

Ans: C
See this: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#dataproc_service_accounts_2

Comment 16

ID: 442108 User: Blobby Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 09 Sep 2022 19:41 Selected Answer: - Upvotes: 4

C as service account invoked to read the data into GCS and write to BQ once transformed via Data Proc. Assumes DataProc can inherit SA authorisation to perform transform and propagate.
B seems to violate key IAM principle enforcing least privilege;
https://cloud.google.com/iam/docs/recommender-overview

Comment 17

ID: 392357 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Mon 27 Jun 2022 23:05 Selected Answer: - Upvotes: 4

Vote for 'C"

Comment 17.1

ID: 402133 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 20:06 Selected Answer: - Upvotes: 3

Vote for B, (though it's too much access) - But C has one accessing missing (i.e Dataproc job execution) Thus B is correct

91. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 277

Sequence: 281
Discussion ID: 130262
Source URL: https://www.examtopics.com/discussions/google/view/130262-exam-professional-data-engineer-topic-1-question-277/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 5:14 a.m.

Question

You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub, processes the data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization. What should you do?

A. Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore.
B. Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore.
C. Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.
D. Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.

Community Answer Votes

B: 16 most voted

Comments 11 comments Click to expand

Comment 1

ID: 1117838 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 21:43 Selected Answer: B Upvotes: 13

- Hopping Window: Hopping windows are fixed-sized, overlapping intervals.
- Aggregate data over the last 30 seconds, every 2 seconds, as hopping windows allow for overlapping data analysis.
- Memorystore: Ideal for low-latency access required for real-time visualization and analysis.

Comment 1.1

ID: 1193893 User: anushree09 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 11 Apr 2024 17:18 Selected Answer: - Upvotes: 2

Hopping windows are sliding windows. It makes sense to use that over tumbling (fixed) window because the ask is to collect last 30 seconds of data every 5 second

Comment 2

ID: 1254091 User: Jeyaraj Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Wed 24 Jul 2024 06:00 Selected Answer: - Upvotes: 1

OPTION A. (IGNORE MY Previous Comment)

Tumbling windows are the best choice for this ride-hailing app because they provide accurate 2-second aggregations without the complexities of overlapping data. This is crucial for real-time decision-making and ensuring accurate visualization of supply and demand.
Hopping windows introduce potential inaccuracies and complexity, making them less suitable for this scenario. While they can be useful in other situations, they are not the optimal choice for real-time aggregation with strict accuracy requirements.

Comment 3

ID: 1254089 User: Jeyaraj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 24 Jul 2024 05:59 Selected Answer: - Upvotes: 1

Option B.

Tumbling windows are the best choice for this ride-hailing app because they provide accurate 2-second aggregations without the complexities of overlapping data. This is crucial for real-time decision-making and ensuring accurate visualization of supply and demand.
Hopping windows introduce potential inaccuracies and complexity, making them less suitable for this scenario. While they can be useful in other situations, they are not the optimal choice for real-time aggregation with strict accuracy requirements.

Comment 4

ID: 1155334 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 08:26 Selected Answer: B Upvotes: 1

Option B

Comment 5

ID: 1152502 User: ashdam Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 12:13 Selected Answer: - Upvotes: 1

hopping window is clear but memorystore vs bigquery?? Why memorystore and not bigquery?

Comment 5.1

ID: 1153259 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 13:25 Selected Answer: - Upvotes: 1

Memory store is an in-memory key-value database for use cases such as real-time application.

Comment 5.1.1

ID: 1196162 User: ea2023 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 15 Apr 2024 20:09 Selected Answer: - Upvotes: 1

Let me complete your answer MS vs BQ in this case is a matter of low latence where MS is the winner but if precision were stated about a large amount of data BQ then would"ve been the best choice.

Comment 6

ID: 1136007 User: Jordan18 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 30 Jan 2024 18:57 Selected Answer: - Upvotes: 1

why not D?

Comment 6.1

ID: 1159573 User: RenePetersen Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 10:59 Selected Answer: - Upvotes: 1

Because BigQuery is not a low latency system...

Comment 7

ID: 1113331 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 05:14 Selected Answer: B Upvotes: 2

B. Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore.

92. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 135

Sequence: 285
Discussion ID: 17234
Source URL: https://www.examtopics.com/discussions/google/view/17234-exam-professional-data-engineer-topic-1-question-135/
Posted By: -
Posted At: March 22, 2020, 10:49 a.m.

Question

You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
✑ Decoupling producer from consumer
✑ Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely
✑ Near real-time SQL query
✑ Maintain at least 2 years of historical data, which will be queried with SQL
Which pipeline should you use to meet these requirements?

A. Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files.
B. Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery.
C. Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.
D. Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

Community Answer Votes

D: 16 most voted

Comments 17 comments Click to expand

Comment 1

ID: 487119 User: JG123 Badges: Highly Voted Relative Date: 4 years, 3 months ago Absolute Date: Fri 26 Nov 2021 07:43 Selected Answer: - Upvotes: 11

Why there are so many wrong answers? Examtopics.com are you enjoying paid subscription by giving random answers from people?
Ans: D

Comment 2

ID: 116129 User: AJKumar Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Mon 22 Jun 2020 09:28 Selected Answer: - Upvotes: 5

A and B can be eliminated righaway. between C and D; C has no bigquery , Answer D.

Comment 3

ID: 1252845 User: edre Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Mon 22 Jul 2024 07:47 Selected Answer: D Upvotes: 1

Google recommended approach

Comment 4

ID: 1016195 User: juliorevk Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 23:32 Selected Answer: D Upvotes: 2

D because pub/sub decouples while dataflow processes; Cloud Storage can be used to store the raw ingested data indefinitely and BQ can be used to query.

Comment 5

ID: 1015428 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 04:18 Selected Answer: D Upvotes: 3

Here's how this option aligns with your requirements:
Decoupling Producer from Consumer: Cloud Pub/Sub provides a decoupled messaging system where the producer publishes events, and consumers (like Dataflow) can subscribe to these events. This decoupling ensures flexibility and scalability.
Space and Cost-Efficient Storage: Storing data in Avro format is more space-efficient than JSON, and Cloud Storage is a cost-effective storage solution. Additionally, Cloud Pub/Sub and Dataflow allow you to process and transform data efficiently, reducing storage costs.
Near Real-time SQL Query: By using Dataflow to transform and load data into BigQuery, you can achieve near real-time data availability for SQL queries. BigQuery is well-suited for ad-hoc SQL queries and provides excellent query performance.

Comment 6

ID: 982737 User: FP77 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 16 Aug 2023 17:48 Selected Answer: D Upvotes: 1

Should be D

Comment 7

ID: 918054 User: vaga1 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 08 Jun 2023 11:07 Selected Answer: D Upvotes: 1

For sure D

Comment 8

ID: 911808 User: forepick Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 01 Jun 2023 08:35 Selected Answer: D Upvotes: 1

D is the most suitable, however the stored format should be JSON, and AVRO isn't JSON...

Comment 9

ID: 796941 User: OberstK Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 03 Feb 2023 10:54 Selected Answer: D Upvotes: 1

Correct - D

Comment 10

ID: 786041 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 02:29 Selected Answer: - Upvotes: 1

I believe this was also on the GCP PCA exam as well! ;)

Comment 11

ID: 762718 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 17:52 Selected Answer: - Upvotes: 1

D. Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

Comment 12

ID: 717226 User: mbacelar Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 13 Nov 2022 11:00 Selected Answer: D Upvotes: 1

For sure D

Comment 13

ID: 676510 User: clouditis Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 22 Sep 2022 21:23 Selected Answer: - Upvotes: 1

D it is!

Comment 14

ID: 553074 User: Prasanna_kumar Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 19:08 Selected Answer: - Upvotes: 2

Answer is D

Comment 15

ID: 520324 User: MaxNRG Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sun 09 Jan 2022 17:25 Selected Answer: D Upvotes: 4

D:
Cloud Pub/Sub, Cloud Dataflow, Cloud Storage, BigQuery https://cloud.google.com/solutions/stream-analytics/

Comment 16

ID: 519525 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 08 Jan 2022 14:03 Selected Answer: D Upvotes: 1

OMG only D

Comment 17

ID: 422055 User: sandipk91 Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Mon 09 Aug 2021 12:43 Selected Answer: - Upvotes: 4

Answer is D for sure

93. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 204

Sequence: 286
Discussion ID: 89456
Source URL: https://www.examtopics.com/discussions/google/view/89456-exam-professional-data-engineer-topic-1-question-204/
Posted By: Atnafu
Posted At: Nov. 30, 2022, 11:08 p.m.

Question

You want to create a machine learning model using BigQuery ML and create an endpoint for hosting the model using Vertex AI. This will enable the processing of continuous streaming data in near-real time from multiple vendors. The data may contain invalid values. What should you do?

A. Create a new BigQuery dataset and use streaming inserts to land the data from multiple vendors. Configure your BigQuery ML model to use the "ingestion" dataset as the framing data.
B. Use BigQuery streaming inserts to land the data from multiple vendors where your BigQuery dataset ML model is deployed.
C. Create a Pub/Sub topic and send all vendor data to it. Connect a Cloud Function to the topic to process the data and store it in BigQuery.
D. Create a Pub/Sub topic and send all vendor data to it. Use Dataflow to process and sanitize the Pub/Sub data and stream it to BigQuery.

Community Answer Votes

D: 10 most voted

Comments 9 comments Click to expand

Comment 1

ID: 1244970 User: anyone_99 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 17:15 Selected Answer: - Upvotes: 1

Why is the answer A? After paying $44 I am getting wrong answers.

Comment 1.1

ID: 1253310 User: 987af6b Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 23 Jul 2024 00:28 Selected Answer: - Upvotes: 2

The discussion is where the real answer is.

Comment 2

ID: 1122035 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 21:36 Selected Answer: D Upvotes: 2

Option D

Comment 3

ID: 960761 User: vamgcp Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 21:08 Selected Answer: D Upvotes: 2

Option D -Dataflow provides a scalable and flexible way to process and clean the incoming data in real-time before loading it into BigQuery.

Comment 4

ID: 763431 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 02:13 Selected Answer: - Upvotes: 1

D. Create a Pub/Sub topic and send all vendor data to it. Use Dataflow to process and sanitize the Pub/Sub data and stream it to BigQuery.

Comment 5

ID: 739979 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 09 Dec 2022 11:08 Selected Answer: D Upvotes: 2

D is the best option to sanitize the data to its D

Comment 6

ID: 734966 User: jkhong Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 04 Dec 2022 10:31 Selected Answer: D Upvotes: 2

Better to use pubsub for streaming and reading message data

Dataflow ParDo can perform filtering of data

Comment 7

ID: 732427 User: vidts Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 01 Dec 2022 09:56 Selected Answer: D Upvotes: 2

It's D

Comment 8

ID: 732037 User: Atnafu Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 30 Nov 2022 23:08 Selected Answer: - Upvotes: 2

Answer is D

94. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 177

Sequence: 289
Discussion ID: 79540
Source URL: https://www.examtopics.com/discussions/google/view/79540-exam-professional-data-engineer-topic-1-question-177/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 8:30 p.m.

Question

You want to rebuild your batch pipeline for structured data on Google Cloud. You are using PySpark to conduct data transformations at scale, but your pipelines are taking over twelve hours to run. To expedite development and pipeline run time, you want to use a serverless tool and SOL syntax. You have already moved your raw data into Cloud Storage. How should you build the pipeline on Google Cloud while meeting speed and processing requirements?

A. Convert your PySpark commands into SparkSQL queries to transform the data, and then run your pipeline on Dataproc to write the data into BigQuery.
B. Ingest your data into Cloud SQL, convert your PySpark commands into SparkSQL queries to transform the data, and then use federated quenes from BigQuery for machine learning.
C. Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.
D. Use Apache Beam Python SDK to build the transformation pipelines, and write the data into BigQuery.

Community Answer Votes

C: 32 most voted
A: 5

Comments 21 comments Click to expand

Comment 1

ID: 686992 User: devaid Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Wed 05 Apr 2023 17:12 Selected Answer: C Upvotes: 14

The question is C but not because the SQL Syntax, as you can perfectly use SparkSQL on Dataproc reading files from GCS. It's because the "serverless" requirement.

Comment 2

ID: 1124344 User: GCP001 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Tue 16 Jul 2024 16:09 Selected Answer: A Upvotes: 2

A) Looks more suitable , serverless approach for handling and performance.

Comment 3

ID: 1101930 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 20 Jun 2024 20:49 Selected Answer: C Upvotes: 3

Option C is the best approach to meet the stated requirements. Here's why:

BigQuery SQL provides a fast, scalable, and serverless method for transforming structured data, easier to develop than PySpark.
Directly ingesting the raw Cloud Storage data into BigQuery avoids needing an intermediate processing cluster like Dataproc.
Transforming the data via BigQuery SQL queries will be faster than PySpark, especially since the data is already loaded into BigQuery.
Writing the transformed results to a new BigQuery table keeps the original raw data intact and provides a clean output.
So migrating to BigQuery SQL for transformations provides a fully managed serverless architecture that can significantly expedite development and reduce pipeline runtime versus PySpark. The ability to avoid clusters and conduct transformations completely within BigQuery is the most efficient approach here.

Comment 4

ID: 948733 User: MoeHaydar Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 11 Jan 2024 09:25 Selected Answer: C Upvotes: 3

Note: Dataproc by itself is not serverless
https://cloud.google.com/dataproc-serverless/docs/overview

Comment 5

ID: 876431 User: Prudvi3266 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sat 21 Oct 2023 13:01 Selected Answer: C Upvotes: 3

because of serverless nature

Comment 6

ID: 813462 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 20:02 Selected Answer: - Upvotes: 1

Answer C: need to setup SQL based job means transformation in not very complex. And Biqquery sql are faster than spark sql context. (google claims)
However, i will make a test by myself to check it.

Comment 7

ID: 786579 User: maci_f Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 13:53 Selected Answer: A Upvotes: 2

In the GCP Machine Learning Engineer practice question (Q4) there's the same question with similar answers and the correct answer is A since B "is incorrect, here transformation is done on Cloud SQL, which wouldn’t scale the process" and C "is incorrect as this process wouldn’t scale the data transformation routine. And, it is always better to transform data during ingestion": https://medium.com/@gcpguru/google-google-cloud-professional-machine-learning-engineer-practice-questions-part-1-3ee4a2b3f0a4

Comment 7.1

ID: 912556 User: evanfebrianto Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 02 Dec 2023 08:40 Selected Answer: - Upvotes: 2

Dataproc is not a serverless tool unless it mentions "Dataproc Serverless" explicitly.

Comment 8

ID: 725322 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 18:11 Selected Answer: - Upvotes: 1

C
D-is incorrect because you are rebuild your batch pipeline for structured data on Google Cloud.

Comment 8.1

ID: 725324 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 18:14 Selected Answer: - Upvotes: 2

A could be answer if it was Dataproc serverless and no conversion of code. Dp serverless support: scala,pyspark,sparksql and SparkR

Comment 9

ID: 675940 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 22 Mar 2023 12:33 Selected Answer: C Upvotes: 4

This same question is there on Google's Professional Machine Learning Engineer, Question 4
Answer is C.

Comment 10

ID: 667255 User: Wasss123 Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Mar 2023 19:54 Selected Answer: C Upvotes: 2

I choose C
BigQuery SQL is more performant but more expensive. Here, it's a performance issue ( time reduction)
Source : https://medium.com/paypal-tech/comparing-bigquery-processing-and-spark-dataproc-4c90c10e31ac

Comment 11

ID: 665291 User: John_Pongthorn Badges: - Relative Date: 3 years ago Absolute Date: Fri 10 Mar 2023 12:59 Selected Answer: - Upvotes: 1

C is the most likely , bigquery is severless and sql
D is dataflow severless but it is wrong at using python sdk but using sql beam rthen it will be correct

Comment 12

ID: 662165 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 10:28 Selected Answer: - Upvotes: 2

Answer C

Comment 13

ID: 657939 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 03:57 Selected Answer: A Upvotes: 1

A
- You have to maintain PySpark Code -> Proc

Comment 13.1

ID: 658099 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 08:32 Selected Answer: - Upvotes: 1

After thinking a while, I think the question is not clear enough. To be honest

Comment 13.1.1

ID: 658100 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 08:33 Selected Answer: - Upvotes: 1

A or C. I go for C because they said they want to use SQL syntax...

Comment 14

ID: 657717 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 21:30 Selected Answer: C Upvotes: 3

C. Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.

Keys: "Serverless" and "SQL"

Comment 14.1

ID: 657753 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 22:15 Selected Answer: - Upvotes: 3

Changing answer to A as this is a new question referring to Dataproc Serverless. Dataproc Serverless for Spark batch workloads supports Spark SQL. Why modify ETL to ELT and convert PySpark to BigQuery SQL when it can be similar to a lift-and-shift?

Comment 14.1.1

ID: 725321 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 18:09 Selected Answer: - Upvotes: 3

Dataproc is diffrent than Dataproc Serveless. This question is talking about dataproc.
By the way dp serverless support both pyspark and sparkSql no need of conversion.
C is best answer

Comment 14.2

ID: 658180 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 10:30 Selected Answer: - Upvotes: 1

The question said "use SQL syntax"
C might still correct

95. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 58

Sequence: 292
Discussion ID: 17090
Source URL: https://www.examtopics.com/discussions/google/view/17090-exam-professional-data-engineer-topic-1-question-58/
Posted By: -
Posted At: March 21, 2020, 11:31 a.m.

Question

You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your ETL process to carry out sensor calibration systematically in the future?

A. Modify the transformMapReduce jobs to apply sensor calibration before they do anything else.
B. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.
C. Add sensor calibration data to the output of the ETL process, and document that all users need to apply sensor calibration themselves.
D. Develop an algorithm through simulation to predict variance of data output from the last MapReduce job based on calibration factors, and apply the correction to all data.

Community Answer Votes

B: 9 most voted
A: 1

Comments 19 comments Click to expand

Comment 1

ID: 184289 User: SteelWarrior Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 22 Sep 2020 11:11 Selected Answer: - Upvotes: 58

Should go with B. Two reasons, it is a cleaner approach with single job to handle the calibration before the data is used in the pipeline. Second, doing this step in later stages can be complex and maintenance of those jobs in the future will become challenging.

Comment 1.1

ID: 419280 User: Yiouk Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Tue 03 Aug 2021 17:21 Selected Answer: - Upvotes: 7

B. different MR jobs execute in series, adding 1 more job makes sense in this case.

Comment 2

ID: 68716 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 28 Mar 2020 03:52 Selected Answer: - Upvotes: 20

Answer: A
Description: My take on this is for sensor calibration you just need to update the transform function, rather than creating a whole new mapreduce job and storing/passing the values to next job

Comment 2.1

ID: 368336 User: Jphix Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Fri 28 May 2021 00:38 Selected Answer: - Upvotes: 11

It's B. A would involving changing every single job (notice it said jobS, plural, not a single job). If that is computationally intensive, which it is, you're repeating a computationally intense process needlessly several times. SteelWarrior and YuriP are right on this one.

Comment 2.1.1

ID: 1212886 User: mark1223jkh Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 17 May 2024 14:31 Selected Answer: - Upvotes: 1

Why all jobs, change only the first job for calibration, right?

Comment 3

ID: 1242670 User: Marwan95 Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 05 Jul 2024 11:38 Selected Answer: A Upvotes: 1

I'll choose A. WHY? cause the process already takes DAYS and adding another step will increase the time more

Comment 4

ID: 824474 User: jin0 Badges: - Relative Date: 3 years ago Absolute Date: Tue 28 Feb 2023 07:53 Selected Answer: - Upvotes: 1

What kinds of sensor calibrations exists? I don't understand how computation in pipeline would be expense due to calibration being omitted..?

Comment 5

ID: 784886 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 03:45 Selected Answer: - Upvotes: 1

B. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.

This approach would ensure that sensor calibration is systematically carried out every time the ETL process runs, as the new MapReduce job would be responsible for calibrating the sensors before the data is processed by the other steps. This would ensure that all data is calibrated before being analyzed, thus avoiding the omission of the sensor calibration step in the future.
It also allows you to chain all other MapReduce jobs after this one, so that the calibrated data is used in all the downstream jobs.

Comment 5.1

ID: 784888 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 03:45 Selected Answer: - Upvotes: 1

Option A is not ideal, as it would be time-consuming to modify all the transformMapReduce jobs to apply sensor calibration before doing anything else, and there is a risk of introducing bugs or errors.
Option C is not ideal, as it would rely on users to apply sensor calibration themselves, which would be inefficient and could introduce inconsistencies in the data.
Option D is not ideal, as it would require a lot of simulation and testing to develop an algorithm that can predict the variance of data output accurately and it may not be as accurate as calibrating the sensor directly.

Comment 6

ID: 747157 User: DipT Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 13:16 Selected Answer: B Upvotes: 1

It is much cleaner approach

Comment 7

ID: 745463 User: DGames Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 14 Dec 2022 22:26 Selected Answer: B Upvotes: 1

Best approach is calibration will be separate job because if we need to tune the calibration later also it would be to maintain without worries about all other jobs.

Comment 8

ID: 737609 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 09:49 Selected Answer: B Upvotes: 1

Should be B. My reason, this is like an Anti corruption layer, and that's a good practice,
C- , if you modify your transformMapReduce will be harder to test and debug, so it's a bad practice.
C the idea de introduce manual operation is an anti patron and has a lot of problems
D It's overkilling, a don't have sense in this scenario.

Comment 9

ID: 525247 User: ZIMARAKI Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 16 Jan 2022 21:32 Selected Answer: B Upvotes: 3

SteelWarrior explanation is correct :)

Comment 10

ID: 524329 User: lord_ryder Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sat 15 Jan 2022 19:01 Selected Answer: B Upvotes: 1

SteelWarrior explanation is correct

Comment 11

ID: 516754 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 04 Jan 2022 16:06 Selected Answer: B Upvotes: 1

SteelWarrior explanation is correct

Comment 12

ID: 504003 User: hendrixlives Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 18 Dec 2021 04:55 Selected Answer: B Upvotes: 1

SteelWarrior's answer is correct

Comment 13

ID: 462794 User: anji007 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Fri 15 Oct 2021 22:21 Selected Answer: - Upvotes: 1

Ans: B
Adding a new job in the beginning of chain makes more sense than updating existing chain of jobs.

Comment 14

ID: 392953 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Mon 28 Jun 2021 15:10 Selected Answer: - Upvotes: 5

Vote for 'B' (introduce new job) over 'A', (instead of modifying existing job)

Comment 15

ID: 149623 User: YuriP Badges: - Relative Date: 5 years, 7 months ago Absolute Date: Mon 03 Aug 2020 10:24 Selected Answer: - Upvotes: 5

Should be B. It's a Data Quality step which has to go right after Raw Ingest. Otherwise you repeat the same step unknown (see "job_s_" in A) number of times, possibly for no reason, therefore extending ETL time.

96. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 143

Sequence: 294
Discussion ID: 79678
Source URL: https://www.examtopics.com/discussions/google/view/79678-exam-professional-data-engineer-topic-1-question-143/
Posted By: ducc
Posted At: Sept. 3, 2022, 6:44 a.m.

Question

You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?

A. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing job name
B. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to a new unique job name
C. Stop the Cloud Dataflow pipeline with the Cancel option. Create a new Cloud Dataflow job with the updated code
D. Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code

Community Answer Votes

D: 43 most voted
A: 10

Comments 22 comments Click to expand

Comment 1

ID: 739061 User: odacir Badges: Highly Voted Relative Date: 2 years, 9 months ago Absolute Date: Thu 08 Jun 2023 12:48 Selected Answer: D Upvotes: 15

It's D. → Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy.
New version is mayor changes. Stop and drain and then launch the new code is a lot is the safer way.
We recommend that you attempt only smaller changes to your pipeline's windowing, such as changing the duration of fixed- or sliding-time windows. Making major changes to windowing or triggers, like changing the windowing algorithm, might have unpredictable results on your pipeline output.
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#changing_windowing

Comment 1.1

ID: 745375 User: maggieee Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 14 Jun 2023 18:41 Selected Answer: - Upvotes: 2

Since updating the job as in A does a compatibility check, wouldn't you want to try that first? Then if the compatibility check fails then you proceed to drain current pipeline and then launch new pipeline (Answer D)?

As in A would be correct answer, then if compatibility check fails, you proceed to D.

https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#CCheck

Comment 1.1.1

ID: 1013102 User: ckanaar Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 21 Mar 2024 16:18 Selected Answer: - Upvotes: 1

You're right in your reasoning, but since the documentation specifically uses this example for stopping and draining, it's safe to assume that the compatibility check will always fail with these adjustments. Therefore, we can go straight to D.

Furthermore, answer A doesn't state: "Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing name, if the compatibility check fails, THEN proceed to stopping the pipeline with the drain option", so in itself it is not the right answer if the check fails.

Comment 2

ID: 1109734 User: patitonav Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 13:01 Selected Answer: D Upvotes: 1

D seems the right way to go

Comment 3

ID: 1104475 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 24 Jun 2024 06:33 Selected Answer: D Upvotes: 1

Option A is the first approach to try, as it allows for an in-flight update with minimal disruption. However, if the changes in the new version of the pipeline are not compatible with an in-flight update (due to significant changes in windowing or triggering), then option D should be used. The Drain option ensures a graceful shutdown of the existing pipeline, reducing the risk of data loss, and then a new job can be started with the updated code.

Comment 4

ID: 1099828 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 15:40 Selected Answer: D Upvotes: 1

A is not an option as "You want to ensure that no data is lost during the update. ":
Making major changes to windowing or triggers, like changing the windowing algorithm, might have unpredictable results on your pipeline output.
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#change_windowing

Comment 5

ID: 1015447 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 06:06 Selected Answer: D Upvotes: 1

Drain Option: The "Drain" option allows the existing Dataflow job to complete processing of any in-flight data before stopping the job. This ensures that no data is lost during the transition to the new version.
Create a New Job: After draining the existing job, you create a new Cloud Dataflow job with the updated code. This new job starts fresh and continues processing data from where the old job left off.

Option A (updating the inflight pipeline with the --update option) may not guarantee no data loss, as the update could disrupt the existing job's operation and potentially cause data loss.

Option B (updating the inflight pipeline with the --update option and a new job name) is similar to option A and may not provide data loss guarantees.

Option C (stopping the pipeline with the Cancel option and creating a new job) will abruptly stop the existing job without draining, potentially leading to data loss.

Comment 6

ID: 963660 User: knith66 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 14:28 Selected Answer: - Upvotes: 1

Look D after seeing some docs. please check the below link https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline

Comment 7

ID: 963297 User: vamgcp Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 06:11 Selected Answer: D Upvotes: 1

I will go with option D - If you want to minimize the impact of the update, then option A is the best option. However, if you are not concerned about a temporary interruption in processing, then option D is also a valid option. Option Pros Cons
A Does not stop the pipeline, so no data is lost. Requires you to create a new version of the pipeline.
B Creates a new job with the updated code, so you do not have to update the running pipeline. Can lead to data loss if the new job does not process all of the data that was in the running pipeline.
C Stops the pipeline and drains any data that is currently in flight, so no data is lost. Causes a temporary interruption in processing.

Comment 8

ID: 837774 User: midgoo Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 13 Sep 2023 08:42 Selected Answer: D Upvotes: 3

A is not recommeded for major changes in pipeline.

Comment 9

ID: 812424 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 17 Aug 2023 21:49 Selected Answer: - Upvotes: 1

Answer A:
```gcloud dataflow jobs update <JOB_ID> --update <GCS_PATH_TO_UPDATED_PIPELINE> --region <REGION>```
--update flag does not miss any data and you can execute this command even yourpipeline is running. Its safe any fast, you can continuously make some change and update this command. no problem.
Stop and Drain, is required when you want to test the pipeline and stop it without losing the data.

Comment 9.1

ID: 820886 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 24 Aug 2023 19:55 Selected Answer: - Upvotes: 3

Answer D: as per latest documents 02/2023 google has removed update flag.

Comment 10

ID: 753198 User: jkhong Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 22 Jun 2023 10:52 Selected Answer: D Upvotes: 4

agree with odacir

Comment 11

ID: 734224 User: hauhau Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sat 03 Jun 2023 05:38 Selected Answer: A Upvotes: 2

vote A
D: drain doesn't mention about update dataflow job just stop and preserve data
A: replace existing job and preserve data
(When you update your job, the Dataflow service performs a compatibility check between your currently-running job and your potential replacement job. The compatibility check ensures that things like intermediate state information and buffered data can be transferred from your prior job to your replacement job.)

https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline

Comment 12

ID: 727307 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 26 May 2023 06:12 Selected Answer: - Upvotes: 1

D
A-is not because The Dataflow service retains the job name, but runs the replacement job with an updated Job ID.
Description:
When you update a job on the Dataflow service, you replace the existing job with a new job that runs your updated pipeline code. The Dataflow service retains the job name, but runs the replacement job with an updated Job ID. This process can cause downtime while the existing job stops, the compatibility check runs, and the new job starts.'
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#python:~:text=When%20you%20update%20a,has%20the%20following%20transforms%3A
D is correct
Drain ->clone -> update -> run

Comment 12.1

ID: 727324 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 26 May 2023 06:32 Selected Answer: - Upvotes: 1

Changed my mind to A
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#python_2:~:text=Set%20the%20%2D%2Djob_name,%2D%2Dtransform_name_mapping%20option.

Comment 13

ID: 726036 User: drunk_goat82 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 16:20 Selected Answer: D Upvotes: 3

Changing windowing algorithm may break the pipeline.
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#changing_windowing

Comment 14

ID: 724918 User: ovokpus Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 05:53 Selected Answer: A Upvotes: 1

No, do not drain the current job.

Comment 15

ID: 722420 User: dish11dish Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sat 20 May 2023 06:05 Selected Answer: D Upvotes: 1

in this scenario pipline is streaming pipline with windowing algorithm and triggering strategy changes to new one without loss of data,so better to go with Drain option as it fullfile all precondition described in scenario which is :-
1.streaming
2.code changes with windowing algorithm and triggering strategy to new way
3.no loss of data during update

Referances:-
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain
Drain a job. This method applies only to streaming pipelines. Draining a job enables the Dataflow service to finish processing the buffered data while simultaneously ceasing the ingestion of new data. For more information, see Draining a job.

Comment 15.1

ID: 722422 User: dish11dish Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sat 20 May 2023 06:08 Selected Answer: - Upvotes: 1

If the pipeline was batch then ans would been A

Comment 16

ID: 713132 User: Mcloudgirl Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sun 07 May 2023 15:15 Selected Answer: - Upvotes: 3

D: They want to preserve data and updates might not be predictable.
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#changing_windowing

Comment 17

ID: 712665 User: cloudmon Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 06 May 2023 21:39 Selected Answer: A Upvotes: 3

It's A (https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#UpdateMechanics)
D would stop the pipeline, leading to loss of new data that would be sent into the pipeline before you start the new pipeline.

97. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 83

Sequence: 298
Discussion ID: 17262
Source URL: https://www.examtopics.com/discussions/google/view/17262-exam-professional-data-engineer-topic-1-question-83/
Posted By: -
Posted At: March 22, 2020, 6:13 p.m.

Question

Flowlogistic Case Study -

Company Overview -
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background -
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.

Solution Concept -
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment -
Flowlogistic architecture resides in a single data center:
✑ Databases
- 8 physical servers in 2 clusters
- SQL Server `" user data, inventory, static data
- 3 physical servers
- Cassandra `" metadata, tracking messages
10 Kafka servers `" tracking message aggregation and batch insert
✑ Application servers `" customer front end, middleware for order/customs
- 60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
Network-attached storage (NAS) image storage, logs, backups
✑ 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements -
✑ Build a reliable and reproducible environment with scaled panty of production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements -
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
Connect a VPN between the production data center and cloud environment

SEO Statement -
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.

CTO Statement -
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.

CFO Statement -
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
B. Cloud Pub/Sub, Cloud Dataflow, and Local SSD
C. Cloud Pub/Sub, Cloud SQL, and Cloud Storage
D. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage
E. Cloud Dataflow, Cloud SQL, and Cloud Storage

Community Answer Votes

A: 5 most voted

Comments 26 comments Click to expand

Comment 1

ID: 67663 User: digvijay Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Fri 24 Sep 2021 10:34 Selected Answer: - Upvotes: 25

Seems like A..Data should ingest from multiple sources which might be real time or batch .

Comment 1.1

ID: 400518 User: navemula Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 07 Jan 2023 08:39 Selected Answer: - Upvotes: 3

How is it possible to query in real time with option A. It needs Dataflow

Comment 1.1.1

ID: 400519 User: navemula Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 07 Jan 2023 08:40 Selected Answer: - Upvotes: 2

To use Dataflow SQL it needs BigQuery

Comment 2

ID: 179533 User: mikey007 Badges: Highly Voted Relative Date: 3 years, 12 months ago Absolute Date: Mon 14 Mar 2022 23:14 Selected Answer: - Upvotes: 11

Repeated Question see ques 35

Comment 2.1

ID: 399260 User: awssp12345 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 05 Jan 2023 18:47 Selected Answer: - Upvotes: 1

These exams make people over analyse. People who vote A earlier in 35 seem to be confused here.. haha

Comment 2.2

ID: 265514 User: StelSen Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 12 Jul 2022 13:08 Selected Answer: - Upvotes: 2

Well Done mikey007, Many people have already answered as A.

Comment 3

ID: 758602 User: Kyr0 Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Thu 27 Jun 2024 14:17 Selected Answer: A Upvotes: 1

Answer is A

Comment 4

ID: 712574 User: cloudmon Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 06 May 2024 18:47 Selected Answer: A Upvotes: 1

It's A

Comment 5

ID: 653669 User: ducc Badges: - Relative Date: 2 years ago Absolute Date: Thu 29 Feb 2024 01:19 Selected Answer: A Upvotes: 1

A is the answer

Comment 6

ID: 550672 User: RRK2021 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 19 Aug 2023 05:13 Selected Answer: - Upvotes: 2

ingest data from a variety of global sources - cloud pub/sub
process and query in real-time - cloud Dataflow
store the data reliably - Cloud Storage

Comment 7

ID: 518465 User: medeis_jar Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 06 Jul 2023 18:27 Selected Answer: A Upvotes: 1

PubSub (for global ingestion from multiple sources) + Dataflow (for process and query) + reliable (gcs).

Comment 8

ID: 483867 User: lifebegins Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Mon 22 May 2023 05:02 Selected Answer: A Upvotes: 1

using Dataflow you can apply the propriety analytics and you can push the data in to Cloud storage

Comment 9

ID: 466261 User: gcp_k Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 22 Apr 2023 19:59 Selected Answer: - Upvotes: 1

Also read the technical requirements section. Not just the last 3 lines of the question.

When you do that, you'll know the answer is PubSub (for global ingestion) + Dataflow (for process and query) + reliable (gcs).

Answer is: A

Comment 10

ID: 459475 User: ManojT Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 09 Apr 2023 05:33 Selected Answer: - Upvotes: 1

Answer C:
Look the 3 requirement in the question "ingest data from a variety of global sources, process and query in real-time, and store the data reliably"
Ingest data from global sources: Pub-Sub
Process and Query in realtime: Cloud SQL
Store reliably: Cloud storage
I can understand Databflow is required in case you need to analyze and transform data but question does not refer it.

Comment 10.1

ID: 620107 User: cualquiernick Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 22 Dec 2023 05:31 Selected Answer: - Upvotes: 1

Cloud SQL, is not suitable and efficient for storing real time data ingested from PUB/SUB, so A is the answer

Comment 11

ID: 445557 User: nguyenmoon Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Thu 16 Mar 2023 04:04 Selected Answer: - Upvotes: 1

Correct is A.
Kafka --> replace by PubSub, Streaming then Dataflow, store data reliably and not mention any other condition then Cloud Storage

Comment 12

ID: 395240 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 01 Jan 2023 00:46 Selected Answer: - Upvotes: 2

Vote for 'A'

SQL - will not handle the volume

Comment 13

ID: 308137 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 17:34 Selected Answer: - Upvotes: 1

Dataflow SQL cannot output to cloud storage:
https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations
but the main problem is that Cloud SQL can't do process, then response is A or C.

Comment 14

ID: 189517 User: kino2020 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 29 Mar 2022 09:57 Selected Answer: - Upvotes: 4

A
I don't expect this question to come up, but if I had to write the answer, it would be A.
The problem statement "Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the
data volume for their real-time inventory tracking system.
As it says, "we cannot determine the data volume", but it doesn't say that we can't calculate it either.

Requirement definition: The system must be able to
ingest data from a variety of global sources
process and query in real-time
Store the data reliably.

It says above, if you look at the Google page.

Logging to multiple systems. for example, a Google Compute Engine instance can write logs to a monitoring system, to a database for later querying, and so on.
https://cloud.google.com/pubsub/docs/overview#scenarios

stream processing with Dataflow
https://cloud.google.com/pubsub/docs/pubsub-dataflow?hl=en-419

The answer is A, since it is stated above.

Comment 15

ID: 183053 User: vakati Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Sun 20 Mar 2022 18:00 Selected Answer: - Upvotes: 3

A. SQL queries can be written in Dataflow too.
https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro#running-queries

Comment 15.1

ID: 197072 User: aleedrew Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Sun 10 Apr 2022 03:50 Selected Answer: - Upvotes: 2

Dataflow SQL cannot output to cloud storage only BigQuery...so I am confused on this one.

Comment 15.1.1

ID: 301412 User: Jay3244 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 01 Sep 2022 14:55 Selected Answer: - Upvotes: 1

https://cloud.google.com/pubsub/docs/pubsub-dataflow... It is possible to load the data to Cloud Storage. Can refer to above docs.

Comment 15.1.1.1

ID: 308133 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sun 11 Sep 2022 17:29 Selected Answer: - Upvotes: 1

he said correct, DataflowDataflow SQL cannot output to cloud storage:
https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations

Comment 15.1.1.1.1

ID: 440450 User: Ral17 Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 19:07 Selected Answer: - Upvotes: 2

Answer should be C then?

Comment 16

ID: 177197 User: kuntal8285 Badges: - Relative Date: 4 years ago Absolute Date: Thu 10 Mar 2022 18:54 Selected Answer: - Upvotes: 1

should be E

Comment 17

ID: 175539 User: Tanmoyk Badges: - Relative Date: 4 years ago Absolute Date: Tue 08 Mar 2022 05:38 Selected Answer: - Upvotes: 2

Should be A ...data need to feed to the propriority system and for that dataflow is required.

98. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 94

Sequence: 299
Discussion ID: 16841
Source URL: https://www.examtopics.com/discussions/google/view/16841-exam-professional-data-engineer-topic-1-question-94/
Posted By: rickywck
Posted At: March 17, 2020, 9:34 a.m.

Question

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis. Which job type and transforms should this pipeline use?

A. Batch job, PubSubIO, side-inputs
B. Streaming job, PubSubIO, JdbcIO, side-outputs
C. Streaming job, PubSubIO, BigQueryIO, side-inputs
D. Streaming job, PubSubIO, BigQueryIO, side-outputs

Community Answer Votes

C: 15 most voted
B: 1

Comments 21 comments Click to expand

Comment 1

ID: 65093 User: rickywck Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Thu 17 Sep 2020 08:34 Selected Answer: - Upvotes: 31

Why not C? Without BigQueryIO how can data be written back to BigQuery?

Comment 1.1

ID: 66253 User: xq Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Sun 20 Sep 2020 12:02 Selected Answer: - Upvotes: 8

C should be right

Comment 2

ID: 432918 User: pals_muthu Badges: Highly Voted Relative Date: 4 years ago Absolute Date: Sun 27 Feb 2022 12:00 Selected Answer: - Upvotes: 6

Answer is C,
You need pubsubIO and BigQueryIO for streaming data and writing enriched data back to BigQuery. side-inputs are a way to enrich the data
https://cloud.google.com/architecture/e-commerce/patterns/slow-updating-side-inputs

Comment 3

ID: 1106419 User: JOKKUNO Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Wed 26 Jun 2024 22:15 Selected Answer: - Upvotes: 2

Side inputs
In addition to the main input PCollection, you can provide additional inputs to a ParDo transform in the form of side inputs. A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection. When you specify a side input, you create a view of some other data that can be read from within the ParDo transform’s DoFn while processing each element.

Side inputs are useful if your ParDo needs to inject additional data when processing each element in the input PCollection, but the additional data needs to be determined at runtime (and not hard-coded). Such values might be determined by the input data, or depend on a different branch of your pipeline.

Comment 3.1

ID: 1106420 User: JOKKUNO Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 26 Jun 2024 22:16 Selected Answer: - Upvotes: 2

https://beam.apache.org/documentation/programming-guide/#side-inputs

Comment 4

ID: 981909 User: piyush7777 Badges: - Relative Date: 2 years ago Absolute Date: Thu 15 Feb 2024 21:17 Selected Answer: - Upvotes: 1

Why not side-output?

Comment 5

ID: 971694 User: TQM__9MD Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 04 Feb 2024 07:56 Selected Answer: B Upvotes: 1

B. Use multi-cluster routing to add a second cluster to the existing instance, utilizing a live traffic app profile for the regular workload and a batch analytics profile for the analytical workload.

Comment 6

ID: 960327 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 13:12 Selected Answer: C Upvotes: 2

The answer is C. It's a trap so that you answer A because of batch vs streaming but you need BigQueryIO. On the other hand, streaming is absolutely redundant here and will incur extra costs. C is right but would be better with batch.

Comment 7

ID: 764042 User: Siadd Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 02 Jul 2023 21:49 Selected Answer: - Upvotes: 1

A is the Answer.
A. Batch job, PubSubIO, side-inputs

Comment 8

ID: 675307 User: sedado77 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 21 Mar 2023 19:20 Selected Answer: C Upvotes: 3

I got this question on sept 2022. Answer is C

Comment 8.1

ID: 704162 User: chrismayola Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 22:30 Selected Answer: - Upvotes: 1

dear can you please help, i have some questions about how to prepare the cerification exam using this questionnaire. this is my email [email protected], ping me to have some conversation

Comment 9

ID: 539710 User: alex12441 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Wed 03 Aug 2022 13:04 Selected Answer: C Upvotes: 1

Answer: C

Comment 10

ID: 518488 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 06 Jul 2022 18:57 Selected Answer: C Upvotes: 5

I vote for C, because data will come from Pub/Sub, so it should be streaming, we'll need PubSubIO to be able to read from PubSub and BigQueryIO to be able to write to BigQuery, finally the side-inputs pattern let us enrich data

Comment 11

ID: 511411 User: MaxNRG Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 28 Jun 2022 17:52 Selected Answer: C Upvotes: 4

Static reference data from BigQuery will go as side-inputs and data from pub-sub will go as streaming data using PubSubIO and finally BigQueryIO is required to push the final data to BigQuery

Comment 12

ID: 487789 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Fri 27 May 2022 02:34 Selected Answer: - Upvotes: 1

Ans: C

Comment 13

ID: 427114 User: Meuter Badges: - Relative Date: 4 years ago Absolute Date: Sat 19 Feb 2022 02:26 Selected Answer: - Upvotes: 3

I choose C, because data will come from Pub/Sub, so it should be streaming, we'll need PubSubIO to be able to read from PubSub y BigQueryIO to be able to write to BigQuery, finally the side-inputs pattern let us enrich data
https://beam.apache.org/releases/javadoc/2.4.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html
https://cloud.google.com/architecture/e-commerce/patterns/slow-updating-side-inputs
https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html

Comment 14

ID: 308371 User: daghayeghi Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Sat 11 Sep 2021 21:46 Selected Answer: - Upvotes: 2

C:
we have to use Streaming job because of Pub/Sub, and side-input thanks to static reference data. and we have to leverage BigQueryIO since finally we want to write data to BigQuery. then C is the correct answer.

Comment 15

ID: 294162 User: someshsehgal Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 19 Aug 2021 11:41 Selected Answer: - Upvotes: 1

Correct A. batch is cost-effective and no need to go for streaming

Comment 15.1

ID: 294613 User: funtoosh Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 19 Aug 2021 21:20 Selected Answer: - Upvotes: 1

How you are going to write back to BQ?

Comment 16

ID: 284530 User: someshsehgal Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Fri 06 Aug 2021 04:01 Selected Answer: - Upvotes: 2

Correct A:
There are two points to defend it.
a. Its batch hence free of cost
b. static reference data hence no need to opt for streaming

Comment 17

ID: 200391 User: arghya13 Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Thu 15 Apr 2021 10:51 Selected Answer: - Upvotes: 2

After so much confusion I think correct answer is B..reason
1.PubSub is always used for streaming input..no reason to mention
2.JdbcIO can be used by apache beam to connect bigquery to use a lookup
3.Sideoutput as a result to put in bigquery

99. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 104

Sequence: 300
Discussion ID: 79318
Source URL: https://www.examtopics.com/discussions/google/view/79318-exam-professional-data-engineer-topic-1-question-104/
Posted By: damaldon
Posted At: Sept. 2, 2022, 9:40 a.m.

Question

You used Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?

A. Create a cron schedule in Dataprep.
B. Create an App Engine cron job to schedule the execution of the Dataprep job.
C. Export the recipe as a Dataprep template, and create a job in Cloud Scheduler.
D. Export the Dataprep job as a Dataflow template, and incorporate it into a Composer job.

Community Answer Votes

D: 19 most voted
A: 8
C: 5

Comments 18 comments Click to expand

Comment 1

ID: 749481 User: jkhong Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Mon 19 Jun 2023 05:14 Selected Answer: - Upvotes: 12

I'd pick D because it's the only option which allows variable execution (since we need to execute the dataprep job only after the prior load job). Although D suggests the export of Dataflow templates, this discussion suggests that the export option is no longer available (https://stackoverflow.com/questions/72544839/how-to-get-the-dataflow-template-of-a-dataprep-job), there are already Airflow Operators for Dataprep which we should be using instead - https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/dataprep.html

Comment 2

ID: 833707 User: midgoo Badges: Highly Voted Relative Date: 2 years, 6 months ago Absolute Date: Sat 09 Sep 2023 08:20 Selected Answer: D Upvotes: 9

Since the load job execution time is unexpected, schedule the Dataprep based on a fixed time window may not work.
When the Dataprep job run the first time, we can find the Dataflow job for that in the console. We can use that to create the Template --> With the help of the Composer to determine if the load job is completed, we can then trigger the Dataflow job

Comment 3

ID: 1102590 User: TVH_Data_Engineer Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 14:07 Selected Answer: A Upvotes: 2

Dataprep by Trifacta allows you to schedule the execution of recipes. You can set up a cron schedule directly within Dataprep to automatically run your recipe at specified intervals, such as daily.
WHY NOT D ? : This option involves significant additional complexity. Exporting the Dataprep job as a Dataflow template and then incorporating it into a Composer (Apache Airflow) job is a more complicated process and is typically used for more complex orchestration needs that go beyond simple scheduling.

Comment 4

ID: 1098818 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 17 Jun 2024 10:57 Selected Answer: D Upvotes: 1

We have external dependency "after the load job with variable execution time completes"
which requires DAG -> Airflow (Cloud Composer)

The reasons:

A scheduler like Cloud Scheduler won't handle the dependency on the BigQuery load completion time
Using Composer allows creating a DAG workflow that can:
Trigger the BigQuery load
Wait for BigQuery load to complete
Trigger the Dataprep Dataflow job
Dataflow template allows easy reuse of the Dataprep transformation logic
Composer coordinates everything based on the dependencies in an automated workflow

Comment 5

ID: 1090732 User: rocky48 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 08 Jun 2024 02:54 Selected Answer: D Upvotes: 1

I'd pick D because it's the only option which allows variable execution

Comment 6

ID: 994724 User: gaurav0480 Badges: - Relative Date: 2 years ago Absolute Date: Thu 29 Feb 2024 06:52 Selected Answer: - Upvotes: 3

The key here is "after the load job with variable execution time completes" which means the execution of this job depends on the completion of another job which has a variable execution time. Hence D

Comment 7

ID: 988172 User: god_brainer Badges: - Relative Date: 2 years ago Absolute Date: Fri 23 Feb 2024 13:07 Selected Answer: - Upvotes: 1

This approach ensures the dynamic triggering of the Dataprep job based on the completion of the preceding load job, ensuring data is processed accurately and in sequen

Comment 8

ID: 871387 User: Adswerve Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 16 Oct 2023 02:18 Selected Answer: A Upvotes: 2

A is correct. D is too complicated.

A is correct, because you can schedule a job right from Dataprep UI.

https://cloud.google.com/blog/products/gcp/scheduling-and-sampling-arrive-for-google-cloud-dataprep
Scheduling and sampling arrive for Google Cloud Dataprep
Throughout our early releases, users’ most common request has been Flow scheduling. As of Thursday’s release, Flows can be scheduled with minute granularity at any frequency.

Comment 9

ID: 850793 User: lucaluca1982 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Tue 26 Sep 2023 09:13 Selected Answer: C Upvotes: 4

I think C it is more straighforward

Comment 10

ID: 820630 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 24 Aug 2023 14:53 Selected Answer: - Upvotes: 3

Answer C: Use Recipe Template feature of dataprep. Don't need to change the service.

Comment 11

ID: 772438 User: jroig_ Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 11 Jul 2023 11:35 Selected Answer: C Upvotes: 1

Why not C?

Comment 12

ID: 736914 User: anicloudgirl Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 06 Jun 2023 14:51 Selected Answer: A Upvotes: 4

It's A. You can set it directly in Dataprep a job and it will use Dataflow under the hood.

Comment 13

ID: 736912 User: anicloudgirl Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 06 Jun 2023 14:50 Selected Answer: - Upvotes: 2

It's A. You can set it directly in Dataprep a job and it will use Dataflow under the hood. No need to export nor incorporate into a Composer job.
Dataprep by trifacta - https://docs.trifacta.com/display/DP/cron+Schedule+Syntax+Reference
Dataprep job uses dataflow - https://cloud.google.com/dataprep

Comment 13.1

ID: 749473 User: jkhong Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Mon 19 Jun 2023 05:00 Selected Answer: - Upvotes: 4

The question mentions after a load job with variable time, i dont think setting a dataprep cron job can address the issue of variable load times

Comment 14

ID: 712613 User: cloudmon Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 06 May 2023 19:50 Selected Answer: D Upvotes: 2

It's D

Comment 15

ID: 669134 User: John_Pongthorn Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Tue 14 Mar 2023 18:21 Selected Answer: D Upvotes: 2

Dataprep and Dataflow are same famitly

Comment 16

ID: 658412 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 15:05 Selected Answer: D Upvotes: 4

D. Export the Dataprep job as a Dataflow template, and incorporate it into a Composer job.
Reveal Solution

Comment 17

ID: 657128 User: damaldon Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 10:40 Selected Answer: - Upvotes: 2

It’s D, use composer to schedule tasks

100. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 181

Sequence: 301
Discussion ID: 79924
Source URL: https://www.examtopics.com/discussions/google/view/79924-exam-professional-data-engineer-topic-1-question-181/
Posted By: ducc
Posted At: Sept. 4, 2022, 3:19 a.m.

Question

You need to give new website users a globally unique identifier (GUID) using a service that takes in data points and returns a GUID. This data is sourced from both internal and external systems via HTTP calls that you will make via microservices within your pipeline. There will be tens of thousands of messages per second and that can be multi-threaded. and you worry about the backpressure on the system. How should you design your pipeline to minimize that backpressure?

A. Call out to the service via HTTP.
B. Create the pipeline statically in the class definition.
C. Create a new object in the startBundle method of DoFn.
D. Batch the job into ten-second increments.

Community Answer Votes

D: 39 most voted
C: 4

Comments 28 comments Click to expand

Comment 1

ID: 683247 User: John_Pongthorn Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Thu 30 Mar 2023 07:44 Selected Answer: D Upvotes: 20

D: I have insisted on this choice all aling.
please read find the keyword massive backpressure
https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1

if the call takes on average 1 sec, that would cause massive backpressure on the pipeline. In these circumstances you should consider batching these requests, instead.

Comment 1.1

ID: 713664 User: NicolasN Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 08 May 2023 10:09 Selected Answer: - Upvotes: 6

Thanks for sharing, you found exactly the same problem!
The document defitely proposes batching for this scenario.

I'm quoting another part from the same example that would be useful for a similar question with different conditions:
- If you're using a client in the DoFn that has heavy instantiation steps, rather than create that object in each DoFn call:
* If the client is thread-safe and serializable, create it statically in the class definition of the DoFn.
* If it's not thread-safe, create a new object in the startBundle method of DoFn. By doing so, the client will be reused across all elements of a bundle, saving initialization time.

Comment 1.1.1

ID: 725513 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 02:30 Selected Answer: - Upvotes: 2

By the way if you see the shared Pseudocode, it's talking about start bundle and finish bundle of DoFn. The question is which one to choose to avoid back pressure?
you can see why you need to choose bundle instead of batching in below link
Batching introduces some processing overhead as well as the need for a magic number to determine the key space.
Instead, use the StartBundle and FinishBundle lifecycle elements to batch your data. With these options, no shuffling is needed.
https://cloud.google.com/dataflow/docs/tutorials/ecommerce-java#micro-batch-calls

Comment 1.1.1.1

ID: 726510 User: NicolasN Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 25 May 2023 08:29 Selected Answer: - Upvotes: 2

Valid points. but I don't change my mind, regarding the requirements of this particular question:
- multi-threaded ability
- no mention of heavy initialization steps or a lot of disk I/O (where shuffling might be a problem).
And especially the excerpt:
"if the call takes on average 1 sec, that would cause massive backpressure on the pipeline. In these circumstances you should consider batching these requests, instead"
It's like the guys that authored the question had this sentence in front of their eyes.

Comment 2

ID: 679554 User: John_Pongthorn Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Sun 26 Mar 2023 10:51 Selected Answer: D Upvotes: 8

D
All guys ,pls read carefully on Pattern: Calling external services for data enrichment
https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1
A , B , C all of them are solution for norma case but if you need to stand for backpressure,
in last sector in Note : Note: When using this pattern, be sure to plan for the load that's placed on the external service and any associated backpressure. For example, imagine a pipeline that's processing tens of thousands of messages per second in steady state. If you made a callout per element, you would need the system to deal with the same number of API calls per second. Also, if the call takes on average 1 sec, that would cause massive backpressure on the pipeline. In these circumstances, you should consider batching these requests, instead.

Anyone can share ideas to debate with me.

Comment 3

ID: 1102293 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:08 Selected Answer: D Upvotes: 2

Option D is the best approach to minimize backpressure in this scenario. By batching the jobs into 10-second increments, you can throttle the rate at which requests are made to the external GUID service. This prevents too many simultaneous requests from overloading the service.

Option A would not help with backpressure since it just makes synchronous HTTP requests as messages arrive. Similarly, options B and C don't provide any inherent batching or throttling mechanism.

Batching into time windows is a common strategy in stream processing to deal with high velocity data. The 10-second windows allow some buffering to happen, rather than making a call immediately for each message. This provides a natural throttling that can be tuned based on the external service's capacity.

Comment 3.1

ID: 1102294 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:09 Selected Answer: - Upvotes: 1

To design a pipeline that minimizes backpressure, especially when dealing with tens of thousands of messages per second in a multi-threaded environment, it's important to consider how each option affects system performance and scalability. Let's examine each of your options:

Comment 3.1.1

ID: 1102295 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:09 Selected Answer: - Upvotes: 1

A. Call out to the service via HTTP: Making HTTP calls to an external service for each message can introduce significant latency and backpressure, especially at high throughput. This is due to the overhead of establishing a connection, waiting for the response, and handling potential network delays or failures.

Comment 3.1.1.1

ID: 1102296 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:09 Selected Answer: - Upvotes: 1

B. Create the pipeline statically in the class definition: While this approach can improve initialization time and reduce overhead during execution, it doesn't directly address the issue of backpressure caused by high message throughput.

Comment 3.1.1.1.1

ID: 1102297 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:10 Selected Answer: - Upvotes: 1

C. Create a new object in the startBundle method of DoFn: This approach is typically used in Apache Beam to initialize resources before processing a bundle of elements. While it can optimize resource usage and performance within each bundle, it doesn't inherently solve the backpressure issue caused by high message rates.

Comment 3.1.1.1.1.1

ID: 1102298 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:11 Selected Answer: - Upvotes: 1

D. Batch the job into ten-second increments: Batching messages can be an effective way to reduce backpressure. By grouping multiple messages into larger batches, you can reduce the frequency of external calls and distribute the processing load more evenly over time. This can lead to more efficient use of resources and potentially lower latency, as the system spends less time waiting on external services.

Comment 3.1.1.2

ID: 1102299 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 09:11 Selected Answer: - Upvotes: 1

Given these considerations, option D (Batch the job into ten-second increments) seems to be the most effective strategy for minimizing backpressure in your scenario. By batching messages, you can reduce the strain on your pipeline and external services, making the system more resilient and scalable under high load. However, the exact batch size and interval should be fine-tuned based on the specific characteristics of your workload and the capabilities of the external systems you are interacting with.

Additionally, it's important to consider other strategies in conjunction with batching, such as implementing efficient error handling, load balancing, and potentially using asynchronous I/O for external HTTP calls to further optimize performance and minimize backpressure.

Comment 4

ID: 886089 User: izekc Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 01 Nov 2023 13:21 Selected Answer: D Upvotes: 1

Option C is not correct because it does not address the issue of backpressure. Creating a new object in the startBundle method of DoFn will not help to reduce the number of calls that are made to the service, which can lead to backpressure.

Here are some reasons why C is not correct:

Creating a new object in the startBundle method of DoFn is not a scalable solution. As the number of messages increases, the number of objects that need to be created will also increase. This can lead to performance problems and memory usage issues.
Creating a new object in the startBundle method of DoFn does not address the issue of backpressure. The service may still experience backpressure if the number of messages exceeds the service's capacity.
A better solution would be to use batching to reduce the number of calls that are made to the service. This can help to improve performance and reduce backpressure.

Comment 5

ID: 885274 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 30 Oct 2023 16:43 Selected Answer: - Upvotes: 1

gpt: Option C is a better approach as it allows for object creation to occur in a more controlled manner within the DoFn, potentially reducing the pressure on the system. However, it could still create a large number of objects depending on the rate of incoming messages.

Option D of batching the job into ten-second increments can also be a good solution to reduce backpressure on the system. This way, you can limit the number of messages being processed at any given time, which can help prevent bottlenecks and reduce the likelihood of backpressure.

Therefore, the best approach would be to combine options C and D, creating a new object in the startBundle method of a DoFn, and batching the job into smaller time increments, such as 10 seconds. This way, you can control the rate of object creation and processing, which can help minimize backpressure on the system.

Comment 5.1

ID: 885279 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 30 Oct 2023 16:46 Selected Answer: - Upvotes: 1

another vague question, as we see...
so, i`ll choose D... if i get this test
"However, depending on the specifics of your use case, one option may be better suited than the other. For example, if you have a high volume of incoming messages with occasional spikes, option D of batching the job into smaller time increments may be more effective in managing the load. On the other hand, if the incoming messages are more evenly distributed over time, option C of creating a new object in the startBundle method of DoFn may be a better option.
Ultimately, it may be necessary to experiment with both approaches and determine which one works best for your specific use case."

Comment 6

ID: 854664 User: juliobs Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 29 Sep 2023 18:14 Selected Answer: D Upvotes: 1

D works.
Could be C, but who said that the pipeline is in Dataflow/Beam?

Comment 7

ID: 813489 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 20:38 Selected Answer: - Upvotes: 1

Answer C
batch increment in 10 sec, can improve load balancing, but overall back pressure (messages are generating more than consuming or publishing) in this case startBundle in DoFn or find other options in future like
caching,
load shedding (prioritising message flow),
messege queuing
These options handle backpressure..
If your cpu is performing bad then go with change in batch increment timing

Comment 8

ID: 786742 User: maci_f Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 17:42 Selected Answer: D Upvotes: 3

I was hesitating between C and D, but then I realised this: https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1
Here is says "If it's not thread-safe, create a new object in the startBundle method of DoFn." The task explicitly says "There will be tens of thousands of messages per second and that can be multi-threaded."
Correct me if I'm wrong, but multi-threaded == thread-safe. Therefore, no need to go for the C approach.

Comment 9

ID: 725512 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 02:27 Selected Answer: - Upvotes: 1

C
C is an answer because
First of all, no doubt that we should avoid single call of element that's why we use multi-threading else it overwhelm an external service endpoint.To avoid this issue, batch calls to external systems.
Batch calls has also issue:GroupByKey transform or Apache Beam Timer API.
these approaches both require shuffling, which introduces some processing overhead as well as the need for a magic number to determine the key space.
Instead, use the StartBundle and FinishBundle lifecycle elements to batch your data. With these options, no shuffling is needed.
Source:
https://cloud.google.com/dataflow/docs/tutorials/ecommerce-java#micro-batch-calls
https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1
Summary:
StartBundle and FinishBundle do batch with no shuffling

Comment 10

ID: 681308 User: AHUI Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 28 Mar 2023 01:58 Selected Answer: - Upvotes: 1

Ans C: reference https://cloud.google.com/architecture/e-commerce/patterns/batching-external-calls

Comment 11

ID: 672194 User: SMASL Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 18 Mar 2023 12:55 Selected Answer: C Upvotes: 3

Based on the answers in this discussion thread, I would go for C. The most important link to support this choice is as following: https://cloud.google.com/architecture/e-commerce/patterns/batching-external-calls

Comment 12

ID: 667243 User: Thobm Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Mar 2023 19:46 Selected Answer: D Upvotes: 1

Beam docs recommend batching
https://beam.apache.org/documentation/patterns/grouping-elements-for-efficient-external-service-calls/

Comment 13

ID: 665247 User: John_Pongthorn Badges: - Relative Date: 3 years ago Absolute Date: Fri 10 Mar 2023 11:39 Selected Answer: - Upvotes: 1

C , It is straight foward , You can take a look at https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1
Pattern : Calling external services for data enrichmen

Comment 14

ID: 662123 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 09:56 Selected Answer: - Upvotes: 1

Answer C
https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1

Comment 14.1

ID: 662125 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 09:57 Selected Answer: - Upvotes: 1

https://cloud.google.com/architecture/e-commerce/patterns/batching-external-calls
To support choice C

Comment 15

ID: 661076 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 13:16 Selected Answer: C Upvotes: 1

i think you are right gg

Comment 16

ID: 659959 User: nwk Badges: - Relative Date: 3 years ago Absolute Date: Sun 05 Mar 2023 11:59 Selected Answer: - Upvotes: 2

How about C?
https://cloud.google.com/architecture/e-commerce/patterns/batching-external-calls

Comment 17

ID: 659531 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Sat 04 Mar 2023 23:48 Selected Answer: D Upvotes: 2

D. Batch the job into ten-second increments.

101. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 185

Sequence: 302
Discussion ID: 79599
Source URL: https://www.examtopics.com/discussions/google/view/79599-exam-professional-data-engineer-topic-1-question-185/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 10:47 p.m.

Question

You issue a new batch job to Dataflow. The job starts successfully, processes a few elements, and then suddenly fails and shuts down. You navigate to the
Dataflow monitoring interface where you find errors related to a particular DoFn in your pipeline. What is the most likely cause of the errors?

A. Job validation
B. Exceptions in worker code
C. Graph or pipeline construction
D. Insufficient permissions

Community Answer Votes

B: 21 most voted

Comments 7 comments Click to expand

Comment 1

ID: 657833 User: AWSandeep Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 23:47 Selected Answer: B Upvotes: 13

B. Exceptions in worker code

While your job is running, you might encounter errors or exceptions in your worker code. These errors generally mean that the DoFns in your pipeline code have generated unhandled exceptions, which result in failed tasks in your Dataflow job.

Exceptions in user code (for example, your DoFn instances) are reported in the Dataflow monitoring interface.

Reference (Lists all answer choices and when to pick each one):
https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline#Causes

Comment 2

ID: 1102353 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 10:18 Selected Answer: B Upvotes: 2

The most likely cause of the errors you're experiencing in Dataflow, particularly if they are related to a particular DoFn (Dataflow's parallel processing operation), is B. Exceptions in worker code.
When a Dataflow job processes a few elements successfully before failing, it suggests that the overall job setup, permissions, and pipeline graph are likely correct, as the job was able to start and initially process data. However, if it fails during execution and the errors are associated with a specific DoFn, this points towards issues in the code that executes within the workers. This could include:
1. Runtime exceptions in the code logic of the DoFn.
2. Issues handling specific data elements that might not be correctly managed by the DoFn code (e.g., unexpected data formats, null values, etc.).
3. Resource constraints or timeouts if the DoFn performs operations that are resource-intensive or long-running.

Comment 2.1

ID: 1102354 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 10:18 Selected Answer: - Upvotes: 2

To resolve these issues, you should:
1. Inspect the stack traces and error messages in the Dataflow monitoring interface for details on the exception.
2. Test the DoFn with a variety of data inputs, especially edge cases, to ensure robust error handling.
3. Review the resource usage and performance characteristics of the DoFn if the issue is related to resource constraints.

Comment 3

ID: 900183 User: vaga1 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 17 Nov 2023 16:24 Selected Answer: B Upvotes: 3

A. Job validation - since it started successfully, it must have been validated.
B. Exceptions in worker code - possible
C. Graph or pipeline construction - same as A.
D. Insufficient permissions - no elements to say that, and it should led to invalidation.

Comment 4

ID: 725555 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 04:36 Selected Answer: - Upvotes: 1

C
Code error

Comment 5

ID: 667563 User: pluiedust Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Mon 13 Mar 2023 04:09 Selected Answer: B Upvotes: 2

B is correct

Comment 6

ID: 658884 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Sat 04 Mar 2023 04:18 Selected Answer: B Upvotes: 1

B is correct

102. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 193

Sequence: 305
Discussion ID: 79644
Source URL: https://www.examtopics.com/discussions/google/view/79644-exam-professional-data-engineer-topic-1-question-193/
Posted By: ducc
Posted At: Sept. 3, 2022, 3:52 a.m.

Question

An aerospace company uses a proprietary data format to store its flight data. You need to connect this new data source to BigQuery and stream the data into
BigQuery. You want to efficiently import the data into BigQuery while consuming as few resources as possible. What should you do?

A. Write a shell script that triggers a Cloud Function that performs periodic ETL batch jobs on the new data source.
B. Use a standard Dataflow pipeline to store the raw data in BigQuery, and then transform the format later when the data is used.
C. Use Apache Hive to write a Dataproc job that streams the data into BigQuery in CSV format.
D. Use an Apache Beam custom connector to write a Dataflow pipeline that streams the data into BigQuery in Avro format.

Community Answer Votes

D: 22 most voted
B: 5

Comments 23 comments Click to expand

Comment 1

ID: 708849 User: beanz00 Badges: Highly Voted Relative Date: 2 years, 10 months ago Absolute Date: Mon 01 May 2023 04:53 Selected Answer: - Upvotes: 17

This has to be D. How could it even be B? The source is a proprietary format. Dataflow wouldn't have a built-in template to ead the file. You will have to create something custom.

Comment 2

ID: 697465 User: devaid Badges: Highly Voted Relative Date: 2 years, 10 months ago Absolute Date: Mon 17 Apr 2023 16:23 Selected Answer: D Upvotes: 12

For me it's clearly D
It's between B and D, but read B, store raw data in Big Query? Use a Dataflow pipeline just to store raw data into Big Query, and transform later? You'd need to do another pipeline for that, and is not efficient.

Comment 3

ID: 1102811 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 17:41 Selected Answer: D Upvotes: 4

Option D is the best approach given the constraints - use an Apache Beam custom connector to write a Dataflow pipeline that streams the data into BigQuery in Avro format.
The key reasons:
• Dataflow provides managed resource scaling for efficient stream processing
• Avro format has schema evolution capabilities and efficient serialization for flight telemetry data
• Apache Beam connectors avoid having to write much code to integrate proprietary data sources
• Streaming inserts data efficiently compared to periodic batch jobs
In contrast, option A uses Cloud Functions which lack native streaming capabilities. Option B stores data in less efficient JSON format. Option C uses Dataproc which requires manual cluster management.
So leveraging Dataflow + Avro + Beam provides the most efficient way to stream proprietary flight data into BigQuery while using minimal resources.

Comment 4

ID: 1096641 User: Aman47 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 14 Jun 2024 16:23 Selected Answer: - Upvotes: 1

Its talking about streaming? none of the options talk about triggering a load to begin. We need a trigger or schedule to run first.

Comment 5

ID: 1044098 User: AjoseO Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 15 Apr 2024 13:54 Selected Answer: D Upvotes: 2

Option D allows you to use a custom connector to read the proprietary data format and write the data to BigQuery in Avro format.

Comment 6

ID: 1002883 User: sergiomujica Badges: - Relative Date: 2 years ago Absolute Date: Sat 09 Mar 2024 06:51 Selected Answer: D Upvotes: 1

the keyword is streaming

Comment 7

ID: 965227 User: knith66 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 28 Jan 2024 05:30 Selected Answer: - Upvotes: 3

Between B and D. Firstly transformation is not mentioned in the question, So B is less probable. Then Efficient import is mentioned in the question, Converting to Avro will consume less space. I am going with D

Comment 8

ID: 814109 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 19 Aug 2023 13:22 Selected Answer: - Upvotes: 1

Answer is D ,
Why not B, changing data format before uploading to bigquery is good approach.

Comment 9

ID: 785337 User: cetanx Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 12:30 Selected Answer: B Upvotes: 1

I believe keyword here is "An aerospace company uses a proprietary data format"
So if we list the connectors available in Apache Beam, we are listed with these options;
https://beam.apache.org/documentation/io/connectors/

So I believe, we have to create our own custom connector to read from the proprietary data format hence the answer should be B

Comment 9.1

ID: 785339 User: cetanx Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 12:31 Selected Answer: - Upvotes: 1

sorry the answer should be D

Comment 10

ID: 763419 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 02 Jul 2023 00:54 Selected Answer: - Upvotes: 1

D is right

Comment 11

ID: 734927 User: hauhau Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sun 04 Jun 2023 07:54 Selected Answer: B Upvotes: 2

B is the most efficient

Comment 12

ID: 687656 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Thu 06 Apr 2023 11:56 Selected Answer: - Upvotes: 1

https://cloud.google.com/spanner/docs/change-streams/use-dataflow#core-concepts

Comment 13

ID: 663133 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Wed 08 Mar 2023 08:09 Selected Answer: - Upvotes: 2

Ans B
https://cloud.google.com/architecture/streaming-avro-records-into-bigquery-using-dataflow
Is there a reason to use apache beam connector yet there is dataflow which is a standard solution for that scenario?

Comment 13.1

ID: 666712 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Mar 2023 10:49 Selected Answer: - Upvotes: 1

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-explained-data-ingestion

Comment 13.1.1

ID: 669078 User: learner2610 Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Tue 14 Mar 2023 16:58 Selected Answer: - Upvotes: 1

Can standard dataflow be used to ingest any proprietary format of file ?
shouldn't we use custom apache beam connector ?
So I think it is D ,though it isn't simple ,But in this scenario they have asked to use less resources to import data

Comment 13.1.1.1

ID: 669572 User: TNT87 Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Wed 15 Mar 2023 10:03 Selected Answer: - Upvotes: 1

Option D streams, thats not cost effective. We need something that is cost effectictive, hence B is the option

Comment 13.1.1.1.1

ID: 675763 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 22 Mar 2023 08:51 Selected Answer: - Upvotes: 1

I mean that consumes fewer resources

Comment 13.1.1.2

ID: 669569 User: TNT87 Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Wed 15 Mar 2023 10:01 Selected Answer: - Upvotes: 1

Do you mind reading the links i provided and revisiting the question, then you will understand why D isnt the best. Why use Apache beam yet there is Dataflow

Comment 13.1.1.2.1

ID: 686054 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 04 Apr 2023 11:10 Selected Answer: - Upvotes: 2

D: just have your team develop custom connector.
https://cloud.google.com/architecture/bigquery-data-warehouse#storage_management
Internally, BigQuery stores data in a proprietary columnar format called Capacitor, which has a number of benefits for data warehouse workloads. BigQuery uses a proprietary format

I suppose this matter , it mean BQ use proprietary format by itself to work internally
but the question means data as proprietary format as input for ingesting into BQ.

Comment 14

ID: 658040 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 06:52 Selected Answer: D Upvotes: 3

D. Use an Apache Beam custom connector to write a Dataflow pipeline that streams the data into BigQuery in Avro format.
Reveal Solution

Comment 15

ID: 657966 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 04:52 Selected Answer: B Upvotes: 2

B is the most efficient for me.

Comment 15.1

ID: 658160 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 10:18 Selected Answer: - Upvotes: 2

Sorry, D is correct

103. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 44

Sequence: 307
Discussion ID: 79762
Source URL: https://www.examtopics.com/discussions/google/view/79762-exam-professional-data-engineer-topic-1-question-44/
Posted By: AWSandeep
Posted At: Sept. 3, 2022, 1:26 p.m.

Question

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity 'Movie' the property 'actors' and the property
'tags' have multiple values but the property 'date released' does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

A. Manually configure the index in your index config as follows:
B. Manually configure the index in your index config as follows:
C. Set the following in your entity options: exclude_from_indexes = 'actors, tags'
D. Set the following in your entity options: exclude_from_indexes = 'date_published'

Community Answer Votes

A: 12 most voted
D: 6

Comments 11 comments Click to expand

Comment 1

ID: 668430 User: Wasss123 Badges: Highly Voted Relative Date: 1 year, 12 months ago Absolute Date: Thu 14 Mar 2024 00:34 Selected Answer: A Upvotes: 7

Correct answer is A
Read in reference : https://cloud.google.com/datastore/docs/concepts/indexes#index_limits
n this case, you can circumvent the exploding index by manually configuring an index in your index configuration file:
indexes:
- kind: Task
properties:
- name: tags
- name: created
- kind: Task
properties:
- name: collaborators
- name: created
This reduces the number of entries needed to only (|tags| * |created| + |collaborators| * |created|), or 6 entries instead of 9

Comment 2

ID: 750783 User: jkhong Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Thu 20 Jun 2024 11:27 Selected Answer: A Upvotes: 1

you can circumvent the exploding index by manually configuring an index in your index configuration file:

https://cloud.google.com/datastore/docs/concepts/indexes#index_limits

Comment 3

ID: 750314 User: Krish6488 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 19 Jun 2024 23:22 Selected Answer: D Upvotes: 3

Tempted to go with D as the syntax in Option A seems incorrect. D is still a possible answer because one of the ways to get rid of index errors is to remove the entities that are causing the index to explode. In this case its date_released and hence D appears right to me

Comment 4

ID: 744492 User: DGames Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Thu 13 Jun 2024 22:12 Selected Answer: A Upvotes: 3

Option B & D reject because mention date_publised in question date_released is column
Option C also not correct, I would go with option A.

Comment 5

ID: 681862 User: Ender_H Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 28 Mar 2024 18:30 Selected Answer: D Upvotes: 3

Correct Answer D:

This is the way the DB is typically queried:
- movies with actor=<actorname> ordered by date_released
- movies with tag=Comedy ordered by date_released

so it seems that we need indices in actor,tag and date_released for sorting.

❌ A: this would be the correct answer, however, the format is incorrect, the correct format would be '- name: date_released' correctly indented.

❌ B: This seems to be unnecessary, since typically actor and tag are not queried together. also, there is a clear indentation issue

❌ C: We don't want to ignore actor and tag, we need those indices.

✅ D: If we leave datastore to automatically create the indices and if we specify that the 'date_released' property needs to be excluded from indices, then we would have less indices (but maybe slower queries when ordering them, but hey, how many 'comedies' there could be in the world)

Comment 5.1

ID: 681863 User: Ender_H Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 28 Mar 2024 18:30 Selected Answer: - Upvotes: 1

*Findings for this answer*:
Indices, if not defined, will be automatically created:
"By default, a Datastore mode database automatically predefines an index for each property of each entity kind. These single property indexes are suitable for simple types of queries."
source: https://cloud.google.com/datastore/docs/concepts/indexes

In the index limits section we see this:
"a Datastore mode database creates an entry in a predefined index for every property of every entity except those you have explicitly declared as excluded from your indexes."
source: https://cloud.google.com/datastore/docs/concepts/indexes#index_limits

Comment 5.2

ID: 681867 User: Ender_H Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 28 Mar 2024 18:34 Selected Answer: - Upvotes: 3

And here is the correct way to configure indices:
https://cloud.google.com/datastore/docs/tools/indexconfig

so this would be the best answer:
indexes:
- kind: Movie
properties:
- name: actors
- name: date_released
direction: asc. <This could be left out, it defaults to direction: asc if excluded>

- kind: Movie
properties:
- name: tag
- name: date_released
direction: asc. <This could be left out, it defaults to direction: asc if excluded>

Comment 6

ID: 672466 User: Hm92730 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 18 Mar 2024 17:43 Selected Answer: - Upvotes: 1

What do people think about C? The question is asking how to avoid a combinatorial explosion in the number of indexes. It says "You have entities with multiple properties, some of which can take on multiple values". Put this with the below text from the documentation for Datastore indexes, it seems they're looking for "exclude the properties that will cause combinatorial explosion" which would be C.

"The situation becomes worse in the case of entities with multiple properties, each of which can take on multiple values. To accommodate such an entity, the index must include an entry for every possible combination of property values. Custom indexes that refer to multiple properties, each with multiple values, can "explode" combinatorially, requiring large numbers of entries for an entity with only a relatively small number of possible property values."[1]
[1] https://cloud.google.com/datastore/docs/concepts/indexes#index_limits

Comment 7

ID: 661122 User: soichirokawa Badges: - Relative Date: 2 years ago Absolute Date: Wed 06 Mar 2024 13:59 Selected Answer: - Upvotes: 1

B. is correct
To avoid combinatoric explosion of indexes.
"Two queries of the same form but with different filter values use the same index."
https://cloud.google.com/datastore/docs/concepts/indexes

Comment 7.1

ID: 668431 User: Wasss123 Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Thu 14 Mar 2024 00:36 Selected Answer: - Upvotes: 1

Correct answer is A
In the same reference you provided
In this case, you can circumvent the exploding index by manually configuring an index in your index configuration file:
indexes:
- kind: Task
properties:
- name: tags
- name: created
- kind: Task
properties:
- name: collaborators
- name: created
This reduces the number of entries needed to only (|tags| * |created| + |collaborators| * |created|), or 6 entries instead of 9

Comment 8

ID: 658383 User: AWSandeep Badges: - Relative Date: 2 years ago Absolute Date: Sun 03 Mar 2024 14:26 Selected Answer: A Upvotes: 1

A. Manually configure the index in your index config as follows:

104. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 167

Sequence: 310
Discussion ID: 79492
Source URL: https://www.examtopics.com/discussions/google/view/79492-exam-professional-data-engineer-topic-1-question-167/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 6:57 p.m.

Question

Your company currently runs a large on-premises cluster using Spark, Hive, and HDFS in a colocation facility. The cluster is designed to accommodate peak usage on the system; however, many jobs are batch in nature, and usage of the cluster fluctuates quite dramatically. Your company is eager to move to the cloud to reduce the overhead associated with on-premises infrastructure and maintenance and to benefit from the cost savings. They are also hoping to modernize their existing infrastructure to use more serverless offerings in order to take advantage of the cloud. Because of the timing of their contract renewal with the colocation facility, they have only 2 months for their initial migration. How would you recommend they approach their upcoming migration strategy so they can maximize their cost savings in the cloud while still executing the migration in time?

A. Migrate the workloads to Dataproc plus HDFS; modernize later.
B. Migrate the workloads to Dataproc plus Cloud Storage; modernize later.
C. Migrate the Spark workload to Dataproc plus HDFS, and modernize the Hive workload for BigQuery.
D. Modernize the Spark workload for Dataflow and the Hive workload for BigQuery.

Community Answer Votes

B: 9 most voted
D: 2

Comments 17 comments Click to expand

Comment 1

ID: 1100929 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 19 Jun 2024 19:16 Selected Answer: B Upvotes: 2

Based on the time constraint of 2 months and the goal to maximize cost savings, I would recommend option B - Migrate the workloads to Dataproc plus Cloud Storage; modernize later.
The key reasons are:
• Dataproc provides a fast, native migration path from on-prem Spark and Hive to the cloud. This allows meeting the 2 month timeline.
• Using Cloud Storage instead of HDFS avoids managing clusters for variable workloads and provides cost savings.
• Further optimizations and modernization to serverless (Dataflow, BigQuery) can happen incrementally later without time pressure.

Comment 1.1

ID: 1100930 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 19 Jun 2024 19:16 Selected Answer: - Upvotes: 3

Option A still requires managing HDFS.
Option C and D require full modernization of workloads in 2 months which is likely infeasible.
Therefore, migrating to Dataproc with Cloud Storage fast tracks the migration within 2 months while realizing immediate cost savings, enabling the flexibility to iteratively modernize and optimize the workloads over time.

Comment 2

ID: 685444 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Mon 03 Apr 2023 12:51 Selected Answer: B Upvotes: 3

B is most likely
1. migrate job and infrastructure to dataproc on clound
2. any data, move from hdfs on-premise to google cloud storage ( one of them is Hive)
If you want to modernize Hive to Bigquery , you are need to move it into GCS(preceding step) first and load it into bigquery
that is all.

https://cloud.google.com/blog/products/data-analytics/apache-hive-to-bigquery
https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc
https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-data

Comment 3

ID: 669683 User: TNT87 Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Wed 15 Mar 2023 11:34 Selected Answer: D Upvotes: 1

Answer D

Comment 3.1

ID: 676289 User: dn_mohammed_data Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 22 Mar 2023 18:11 Selected Answer: - Upvotes: 1

you sould migrate spark to apache beam which is not the case here

Comment 3.1.1

ID: 682388 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 29 Mar 2023 08:03 Selected Answer: - Upvotes: 1

apache beam for what???

Comment 3.1.1.1

ID: 693433 User: adarifian Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Thu 13 Apr 2023 01:01 Selected Answer: - Upvotes: 1

dataflow uses apache beam

Comment 3.1.1.1.1

ID: 772487 User: TNT87 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 11 Jul 2023 12:40 Selected Answer: - Upvotes: 1

@adarifian Why use apache beam yet there is Dataflow an inhouse gcp solution to solve the problem? hence i said apache beam for what

Comment 3.1.1.1.1.1

ID: 1060883 User: ExamCtechs Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 02 May 2024 20:40 Selected Answer: - Upvotes: 1

Dataflow IS apache beam, Dataflow is a Beam Runner.
If you go for that soulution you will need to modify your pipeline to use Beam

Comment 4

ID: 665362 User: GyaneswarPanigrahi Badges: - Relative Date: 3 years ago Absolute Date: Fri 10 Mar 2023 14:40 Selected Answer: - Upvotes: 3

D isn't feasible, within 2 months. Anyone who has worked in a Hadoop/ Big Data data warehousing or data lake project, knows how less time 2 months is, given the amount of data and associated complexities abound.

It should be B to begin with. And then gradually move towards D.

Comment 5

ID: 664238 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Thu 09 Mar 2023 08:07 Selected Answer: B Upvotes: 2

Ans B
-cost saving
-time factor
-Spark -Data proc

Comment 5.1

ID: 664240 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Thu 09 Mar 2023 08:10 Selected Answer: - Upvotes: 1

Ans D is also relevant if you read this. Onthe other hand cloud storage isnt severless but bigquery is
https://cloud.google.com/hadoop-spark-migration

Comment 6

ID: 661569 User: damaldon Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 21:32 Selected Answer: - Upvotes: 1

Ans.B as per the following link
https://blog.devgenius.io/migrating-spark-jobs-to-google-cloud-file-event-sensor-to-dynamically-create-spark-cluster-7eff2c75423d

Comment 7

ID: 660915 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 10:08 Selected Answer: B Upvotes: 2

For the time window of two month I would recommend B and then start to implement D.

Comment 8

ID: 658186 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 10:34 Selected Answer: - Upvotes: 2

It is B or D, still confusing

Comment 9

ID: 657627 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 19:57 Selected Answer: D Upvotes: 1

D because the Apache Spark Runner can be used to execute Beam pipelines using Apache Spark. Also, Hive to BigQuery is not a difficult modernization/migration.

Comment 9.1

ID: 1060884 User: ExamCtechs Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 02 May 2024 20:41 Selected Answer: - Upvotes: 1

Dataflow is a Runner of Beam it self

105. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 178

Sequence: 312
Discussion ID: 79547
Source URL: https://www.examtopics.com/discussions/google/view/79547-exam-professional-data-engineer-topic-1-question-178/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 9:02 p.m.

Question

You are testing a Dataflow pipeline to ingest and transform text files. The files are compressed gzip, errors are written to a dead-letter queue, and you are using
SideInputs to join data. You noticed that the pipeline is taking longer to complete than expected; what should you do to expedite the Dataflow job?

A. Switch to compressed Avro files.
B. Reduce the batch size.
C. Retry records that throw an error.
D. Use CoGroupByKey instead of the SideInput.

Community Answer Votes

D: 21 most voted
B: 4
A: 3

Comments 18 comments Click to expand

Comment 1

ID: 679459 User: John_Pongthorn Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Sun 26 Mar 2023 08:10 Selected Answer: D Upvotes: 16

D: it is most likely.
There are a lot of reference doc to tell about comparison between them
https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-developing-and-testing#choose_correctly_between_side_inputs_or_cogroupbykey_for_joins

https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-2

https://stackoverflow.com/questions/58080383/sideinput-i-o-kills-performance

Comment 2

ID: 1101938 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Thu 20 Jun 2024 21:02 Selected Answer: D Upvotes: 1

To expedite the Dataflow job that involves ingesting and transforming text files, especially if the pipeline is taking longer than expected, the most effective strategy would be:

D. Use CoGroupByKey instead of the SideInput.

Comment 2.1

ID: 1101940 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 20 Jun 2024 21:02 Selected Answer: - Upvotes: 1

Here's why this approach is beneficial:

1. Efficiency in Handling Large Datasets: SideInputs are not optimal for large datasets because they require that the entire dataset be available to each worker. This can lead to performance bottlenecks, especially if the dataset is large. CoGroupByKey, on the other hand, is more efficient for joining large datasets because it groups elements by key and allows the pipeline to process each key-group separately.

2. Scalability: CoGroupByKey is more scalable than SideInputs for large-scale data processing. It distributes the workload more evenly across the Dataflow workers, which can significantly improve the performance of your pipeline.

3. Better Resource Utilization: By using CoGroupByKey, the Dataflow job can make better use of its resources, as it doesn't need to replicate the entire dataset to each worker. This results in faster processing times and better overall efficiency.

Comment 2.1.1

ID: 1101941 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 20 Jun 2024 21:03 Selected Answer: - Upvotes: 1

The other options may not be as effective:

• A (Switch to compressed Avro files): While Avro is a good format for certain types of data processing, simply changing the file format from gzip to Avro may not address the underlying issue causing the delay, especially if the problem is related to the way data is being joined or processed.

• B (Reduce the batch size): Reducing the batch size could potentially increase overhead and might not significantly improve the processing time, especially if the bottleneck is due to the method of data joining.

• C (Retry records that throw an error): Retrying errors could be useful in certain contexts, but it's unlikely to speed up the pipeline if the delay is due to inefficiencies in data processing methods like the use of SideInputs.

Comment 3

ID: 813470 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 20:16 Selected Answer: - Upvotes: 1

Answer: B,
reducing the batch size improve the speed performance also improve cpu utilisation.
Dead letter queues are generated for messages that are errorrly acknowledged and its good to use sideinputs for that to check small amount of errors in memory.
Cogroupbykey is not necessary for error messages.
I see only batch size that can be customized to improve the performance.
In practical use case:
you check these tools Stackdriver Monitoring and Logging, Cloud Trace, and Cloud Profiler. and try to find the cause, if its file type issue in compression, or batch size.

Comment 4

ID: 747564 User: Atnafu Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 16 Jun 2023 20:25 Selected Answer: - Upvotes: 1

D
Flatten will just merge all results into a single PCollection. To join them you can use CoGroupByKey

Comment 5

ID: 688345 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 07 Apr 2023 07:41 Selected Answer: A Upvotes: 2

When optimizing for load speed, Avro file format is preferred. Avro is a binary row-based format which can be split and read in parallel by multiple slots including compressed files.

Comment 5.1

ID: 696716 User: devaid Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Mon 17 Apr 2023 02:54 Selected Answer: - Upvotes: 1

that is for Big Query isn't?

Comment 5.1.1

ID: 772507 User: TNT87 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 11 Jul 2023 12:55 Selected Answer: - Upvotes: 1

datflow can use avro format sir. streaming or batching to bigquery in avro format it can

Comment 6

ID: 688344 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 07 Apr 2023 07:40 Selected Answer: - Upvotes: 1

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-explained-data-ingestion

Comment 7

ID: 686993 User: devaid Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 05 Apr 2023 17:15 Selected Answer: D Upvotes: 1

D probably, side inputs have to fit in memory. If the p-collection in the side input doesn't fit well in memory it's better to use CoGroupByKey.

Comment 8

ID: 682352 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 29 Mar 2023 07:14 Selected Answer: A Upvotes: 1

Answer A
the same question is in number 70 you transform the files to Avro using Dataflow

Comment 8.1

ID: 931198 User: KC_go_reply Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 23 Dec 2023 07:03 Selected Answer: - Upvotes: 3

Avro requires the data to be at least semi-structured, because it wants a fixed schema. Text files are unstructured data, therefore it doesn't make sense to use Avro files for them

Comment 9

ID: 675285 User: csd1fggfhfgvh234 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 21 Mar 2023 18:57 Selected Answer: - Upvotes: 1

A switching to avro. No serialisation

Comment 10

ID: 662172 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 10:30 Selected Answer: - Upvotes: 2

Switch to Avro format
Answer A

Comment 10.1

ID: 675941 User: TNT87 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 22 Mar 2023 12:35 Selected Answer: - Upvotes: 1

https://docs.confluent.io/platform/current/schema-registry/serdes-develop/serdes-avro.html

Comment 11

ID: 661056 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 12:53 Selected Answer: D Upvotes: 3

D probably, side inputs have to fit in memory. If the p-collection in the side input doesn't fit well in memory it's better to use CoGroupByKey.

Comment 12

ID: 657748 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 22:02 Selected Answer: B Upvotes: 4

B. Reduce the batch size.

106. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 150

Sequence: 314
Discussion ID: 16868
Source URL: https://www.examtopics.com/discussions/google/view/16868-exam-professional-data-engineer-topic-1-question-150/
Posted By: rickywck
Posted At: March 17, 2020, 2:58 p.m.

Question

You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence.
To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?

A. Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
B. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
C. Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
D. Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage

Community Answer Votes

B: 7 most voted
A: 1

Comments 21 comments Click to expand

Comment 1

ID: 70571 User: Rajokkiyam Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 02:28 Selected Answer: - Upvotes: 15

Answer B
Its google recommended approach to use LocalDisk/HDFS to store Intermediate result and use Cloud Storage for initial and final results.

Comment 1.1

ID: 455233 User: Chelseajcole Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 01 Apr 2022 03:43 Selected Answer: - Upvotes: 1

Any link to support this recommended approach?

Comment 2

ID: 218587 User: Alasmindas Badges: Highly Voted Relative Date: 4 years, 10 months ago Absolute Date: Thu 13 May 2021 15:37 Selected Answer: - Upvotes: 5

Correct Answer is Option B - Adding persistent disk space, reasons:-
- The question mentions that the particular job is "disk I/O intensive - so the word "disk" is explicitly mentioned.
- Option B also mentions about local HDFS storage, which is ideally a good option of general I/O intensive work.
-

Comment 3

ID: 1099867 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 16:40 Selected Answer: B Upvotes: 1

Local HDFS storage is a good option if:
- You have workloads that involve heavy I/O. For example, you have a lot of partitioned writes such as the following:
spark.read().write.partitionBy(...).parquet("gs://")
- You have I/O workloads that are especially sensitive to latency. For example, you require single-digit millisecond latency per storage operation.
- Your jobs require a lot of metadata operations—for example, you have thousands of partitions and directories, and each file size is relatively small.
- You modify the HDFS data frequently or you rename directories. (Cloud Storage objects are immutable, so renaming a directory is an expensive operation because it consists of copying all objects to a new key and deleting them afterwards.)
- You heavily use the append operation on HDFS files.

Comment 3.1

ID: 1099868 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 16:40 Selected Answer: - Upvotes: 1

We recommend using Cloud Storage as the initial and final source of data in a big-data pipeline. For example, if a workflow contains five Spark jobs in series, the first job retrieves the initial data from Cloud Storage and then writes shuffle data and intermediate job output to HDFS. The final Spark job writes its results to Cloud Storage.
https://cloud.google.com/solutions/migration/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#choose_storage_options

Comment 4

ID: 1058355 User: squishy_fishy Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 29 Apr 2024 23:41 Selected Answer: - Upvotes: 1

The correct answer is B.

Comment 5

ID: 1015476 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 06:56 Selected Answer: B Upvotes: 1

Disk I/O Performance: In a Cloud Dataproc cluster, the default setup uses local persistent disks for HDFS storage. These disks offer good disk I/O performance and are well-suited for storing intermediate data generated during Hadoop jobs.

Data Locality: Storing intermediate data on native HDFS allows for better data locality. This means that the data is stored on the same nodes where computation occurs, reducing the need for data transfer over the network. This can significantly improve the performance of disk I/O-intensive jobs.

Scalability: Cloud Dataproc clusters can be easily scaled up or down to meet the specific requirements of your jobs. You can allocate additional disk space as needed to accommodate the intermediate data generated by this particular Hadoop job.

Comment 6

ID: 1009176 User: DeepakVenkatachalam Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Sat 16 Mar 2024 18:46 Selected Answer: - Upvotes: 1

Correct: A
I'd choose A as the doc states adding more SSDs are good for disk-intensive jobs especially those with many individual read and write operations
https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs

Comment 6.1

ID: 1012723 User: DeepakVenkatachalam Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 21 Mar 2024 04:29 Selected Answer: - Upvotes: 1

Typo Correct Answer is B. . Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS

Comment 7

ID: 985685 User: arien_chen Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 12:33 Selected Answer: A Upvotes: 1

I would choose A.

Google Storage is faster than HDFS in many cases.

https://cloud.google.com/architecture/hadoop#:~:text=It%27s%20faster%20than%20HDFS%20in%20many%20cases.

The question mention '(8-core nodes with 100-GB RAM)' on-premises Hadoop.
the problem may caused by insufficient memory,
and does not mention cost would be an issue,
so A 'memory' approach would be a better option.

Comment 8

ID: 963263 User: vamgcp Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 05:31 Selected Answer: B Upvotes: 1

Best option is B. However allocating sufficient persistent disk space to the Hadoop cluster, and storing the intermediate data of that particular Hadoop job on native HDFS, would not improve the performance of the Hadoop job. In fact, it might even slow down the Hadoop job, as the data would have to be read and written to disk twice.

Comment 9

ID: 677249 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Thu 23 Mar 2023 17:40 Selected Answer: B Upvotes: 1

https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs#choosing_primary_disk_options

Comment 10

ID: 653694 User: rrr000 Badges: - Relative Date: 3 years ago Absolute Date: Tue 28 Feb 2023 02:15 Selected Answer: - Upvotes: 3

B is not the right answer. The problem says that for intermediate data cloud storage is to be used, while B option says:

B ... the intermediate data of that particular Hadoop job on native HDFS

A is the right answer. If you have enough memory then the shuffle wont spill on the disk.

Comment 10.1

ID: 653695 User: rrr000 Badges: - Relative Date: 3 years ago Absolute Date: Tue 28 Feb 2023 02:16 Selected Answer: - Upvotes: 2

Further the question states that original on prem machines has 100gb ram.
8-core nodes with 100-GB RAM

Comment 11

ID: 520725 User: SoerenE Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sun 10 Jul 2022 07:57 Selected Answer: - Upvotes: 1

B should be the right answer: https://cloud.google.com/compute/docs/disks/performance#optimize_disk_performance

Comment 12

ID: 519575 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 14:15 Selected Answer: B Upvotes: 3

https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-jobs

Comment 13

ID: 487260 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Thu 26 May 2022 10:44 Selected Answer: - Upvotes: 3

Why there are so many wrong answers? Examtopics.com are you enjoying paid subscription by giving random answers from people?
Ans: B

Comment 14

ID: 307088 User: RT30 Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Fri 10 Sep 2021 10:32 Selected Answer: - Upvotes: 3

If your job is disk-intensive and is executing slowly on individual nodes, you can add more primary disk space. For particularly disk-intensive jobs, especially those with many individual read and write operations, you might be able to improve operation by adding local SSDs. Add enough SSDs to contain all of the space you need for local execution. Your local execution directories are spread across however many SSDs you add.
Its B
https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-jobs

Comment 15

ID: 251439 User: ashuchip Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 24 Jun 2021 04:29 Selected Answer: - Upvotes: 2

yes B is correct

Comment 16

ID: 163573 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Mon 22 Feb 2021 14:49 Selected Answer: - Upvotes: 3

B is correct . I/O

Comment 17

ID: 148775 User: Archy Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Tue 02 Feb 2021 01:23 Selected Answer: - Upvotes: 4

B, this job is high on I/O, local HDFS on dish is best option

107. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 28

Sequence: 315
Discussion ID: 79388
Source URL: https://www.examtopics.com/discussions/google/view/79388-exam-professional-data-engineer-topic-1-question-28/
Posted By: arthur2385
Posted At: Sept. 2, 2022, 1:45 p.m.

Question

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.

You want to improve the performance of this data read. What should you do?

A. Specify the TableReference object in the code.
B. Use .fromQuery operation to read specific fields from the table.
C. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
D. Call a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.

Community Answer Votes

B: 22 most voted
D: 3

Comments 21 comments Click to expand

Comment 1

ID: 657369 User: arthur2385 Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 14:45 Selected Answer: - Upvotes: 12

B BigQueryIO.read.fromQuery() executes a query and then reads the results received after the query execution. Therefore, this function is more time-consuming, given that it requires that a query is first executed (which will incur in the corresponding economic and computational costs).

Comment 2

ID: 689207 User: maxdataengineer Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Sat 08 Apr 2023 13:00 Selected Answer: - Upvotes: 7

Since we want to be able to analyze data from a new ML feature (column) we only need to check values from that column. By doing a fromQuery(SELECT featueColum FROM table)
we are optimizing costs and performance since we are not checking all columns.

https://cloud.google.com/bigquery/docs/best-practices-costs#avoid_select_

Comment 2.1

ID: 689208 User: maxdataengineer Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 08 Apr 2023 13:01 Selected Answer: - Upvotes: 2

The answer is B

Comment 2.1.1

ID: 900155 User: cetanx Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 17 Nov 2023 15:54 Selected Answer: - Upvotes: 3

According to Chat GPT, it is also B
In general, if your "primary goal is to reduce the amount of data read and transferred", and the downstream processing mainly focuses on a subset of fields, using .fromQuery to select specific fields would be a good choice.

On the other hand, if you need to simplify downstream processing and optimize resource utilization, transforming data into TableRow objects might be more suitable.

Comment 3

ID: 1098035 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Sun 16 Jun 2024 09:46 Selected Answer: B Upvotes: 1

B as BigQueryIO.read.from() directly reads the whole table from BigQuery.
This function exports the whole table to temporary files in Google Cloud Storage, where it will later be read from.
This requires almost no computation, as it only performs an export job, and later Dataflow reads from GCS (not from BigQuery).
BigQueryIO.read.fromQuery() executes a query and then reads the results received after the query execution. Therefore, this function is more time-consuming, given that it requires that a query is first executed (which will incur in the corresponding economic and computational costs).
https://stackoverflow.com/questions/54413681/bigqueryio-read-vs-fromquery

Comment 4

ID: 1076373 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 14:31 Selected Answer: B Upvotes: 1

B works for me

Comment 5

ID: 1075430 User: pue_dev_anon Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 20 May 2024 13:33 Selected Answer: B Upvotes: 1

We are trying to optimize reading each row is not optimal, we want columns

Comment 6

ID: 1050537 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 14:13 Selected Answer: B Upvotes: 5

B. Use the .fromQuery operation to read specific fields from the table.

Using the .fromQuery operation allows you to specify the exact fields you need to read from the table, which can significantly improve performance by reducing the amount of data that needs to be processed. This is particularly important when dealing with large and growing datasets.

Option A (specifying the TableReference object) provides information about the table but doesn't inherently improve the performance of reading specific fields.

Option C (using Google BigQuery TableSchema and TableFieldSchema classes) is related to specifying the schema of the data but doesn't directly address improving the performance of reading specific fields.

Option D (calling a transform that returns TableRow objects) is more about how the data is processed after it's read, not how it's initially read from BigQuery.

Comment 7

ID: 1013306 User: emmylou Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 21 Mar 2024 20:27 Selected Answer: - Upvotes: 2

When I have a different answer then the "Correct Answer", I run it through AI and it keeps saying ExamTopics is wrong. Is there any way to know if I am going to pass or fail this exam?

Comment 7.1

ID: 1076368 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 14:27 Selected Answer: - Upvotes: 1

AI is just a LLM model, not a silver bullet at all

Comment 8

ID: 1008770 User: suku2 Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Sat 16 Mar 2024 03:23 Selected Answer: B Upvotes: 1

Since the requirement is to read the data for a *new* key features in the logs, it makes sense to select limited columns, which are required rather than using .from() method which exports the entire BigQuery table.
B makes sense here.

Comment 9

ID: 999763 User: gudguy1a Badges: - Relative Date: 2 years ago Absolute Date: Tue 05 Mar 2024 20:46 Selected Answer: B Upvotes: 1

SHOULD be B.
Not quite sure how D is the correct answer (Red herring....?) when you want to improve the query, which is .fromQuery and NOT transform and PCollection....

Comment 10

ID: 966774 User: odiez3 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 30 Jan 2024 04:26 Selected Answer: - Upvotes: 1

Answer Is D, imagine that you dont have permission on BQ AND you cant see the table info or anything else about the table you only aré working whit dataflow the only way Is transform the data using apache beam

Comment 11

ID: 961301 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 24 Jan 2024 10:57 Selected Answer: B Upvotes: 1

I have seen people explain why B is not right because it doesn't optimize performance but only cost, which is not true, or because fromQuery is still not performant.

I think it's B because no other option is more performant, even if you claim it's not good.

As for option D, the transform given by the description is already a transform that provides as output a PCollection of TableRow objects. So how would that be any different?

https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html

Comment 12

ID: 954256 User: theseawillclaim Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 17 Jan 2024 16:34 Selected Answer: - Upvotes: 1

Why should it be D?
"fromQuery()" allows us to read only the columns we want, I see no point in using a Transform for each row of a "SELECT *", which, moreover, is a bad BQ Practice.

Comment 13

ID: 778732 User: jkh_goh Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 17 Jul 2023 09:19 Selected Answer: B Upvotes: 1

Does BigQuery have a pCollections? I thought it's unique to Apache Beam i.e. Cloud Dataflow

Comment 14

ID: 731587 User: kelvintoys93 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 14:15 Selected Answer: - Upvotes: 4

Guys, how is B the answer? Like all the justifications given here, BigQueryIO.read.fromQuery() is time consuming and the question asked for a better performance solution.

Comment 14.1

ID: 791671 User: Lestrang Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 29 Jul 2023 13:36 Selected Answer: - Upvotes: 2

That part is the docs trying to explain the side effects of using it, however, the part that is important to us is the fact that it reads from a query. "Read" reads the whole table. If we specify a query we can say select col1 only, which makes it all more efficient.

Comment 15

ID: 693068 User: gcm7 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 12 Apr 2023 14:21 Selected Answer: B Upvotes: 6

reading only relevant cols

Comment 16

ID: 688938 User: devaid Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 08 Apr 2023 01:27 Selected Answer: D Upvotes: 1

Answer is D, apparently.

Comment 17

ID: 678576 User: Kowalski Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 25 Mar 2023 11:24 Selected Answer: - Upvotes: 3

Answer is Use .fromQuery operation to read specific fields from the table.

BigQueryIO.read.from() directly reads the whole table from BigQuery. This function exports the whole table to temporary files in Google Cloud Storage, where it will later be read from. This requires almost no computation, as it only performs an export job, and later Dataflow reads from GCS (not from BigQuery).

BigQueryIO.read.fromQuery() executes a query and then reads the results received after the query execution. Therefore, this function is more time-consuming, given that it requires that a query is first executed (which will incur in the corresponding economic and computational costs).

Reference:
https://cloud.google.com/bigquery/docs/best-practices-costs#avoid_select_

108. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 172

Sequence: 316
Discussion ID: 79521
Source URL: https://www.examtopics.com/discussions/google/view/79521-exam-professional-data-engineer-topic-1-question-172/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 7:48 p.m.

Question

You are analyzing the price of a company's stock. Every 5 seconds, you need to compute a moving average of the past 30 seconds' worth of data. You are reading data from Pub/Sub and using DataFlow to conduct the analysis. How should you set up your windowed pipeline?

A. Use a fixed window with a duration of 5 seconds. Emit results by setting the following trigger: AfterProcessingTime.pastFirstElementInPane().plusDelayOf (Duration.standardSeconds(30))
B. Use a fixed window with a duration of 30 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow().plusDelayOf (Duration.standardSeconds(5))
C. Use a sliding window with a duration of 5 seconds. Emit results by setting the following trigger: AfterProcessingTime.pastFirstElementInPane().plusDelayOf (Duration.standardSeconds(30))
D. Use a sliding window with a duration of 30 seconds and a period of 5 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow ()

Community Answer Votes

D: 21 most voted

Comments 5 comments Click to expand

Comment 1

ID: 961974 User: vamgcp Badges: Highly Voted Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 21:27 Selected Answer: D Upvotes: 9

Option D: Sliding Window: Since you need to compute a moving average of the past 30 seconds' worth of data every 5 seconds, a sliding window is appropriate. A sliding window allows overlapping intervals and is well-suited for computing rolling aggregates.

Window Duration: The window duration should be set to 30 seconds to cover the required 30 seconds' worth of data for the moving average calculation.

Window Period: The window period or sliding interval should be set to 5 seconds to move the window every 5 seconds and recalculate the moving average with the latest data.

Trigger: The trigger should be set to AfterWatermark.pastEndOfWindow() to emit the computed moving average results when the watermark advances past the end of the window. This ensures that all data within the window is considered before emitting the result.

Comment 2

ID: 657674 User: AWSandeep Badges: Highly Voted Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 19:48 Selected Answer: D Upvotes: 7

D. Use a sliding window with a duration of 30 seconds and a period of 5 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow ()
Reveal Solution

Comment 3

ID: 1230759 User: Anudeep58 Badges: Most Recent Relative Date: 1 year, 9 months ago Absolute Date: Sat 15 Jun 2024 05:30 Selected Answer: D Upvotes: 1

Option D is the correct configuration because it uses a sliding window of 30 seconds with a period of 5 seconds, ensuring that the moving average is computed every 5 seconds based on the past 30 seconds of data. The trigger AfterWatermark.pastEndOfWindow() ensures timely and accurate results are emitted as the watermark progresses.

Comment 4

ID: 1085214 User: Kimich Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 01 Dec 2023 13:49 Selected Answer: - Upvotes: 2

AfterWatermark is an essential triggering condition in Dataflow that allows computations to be triggered based on event time rather than processing time. Then eliminate A&C. Comparing B&D, B will generate outcome every 30 seconds which is not what we want

D. Using a sliding window with a duration of 30 seconds and a period of 5 seconds, and setting the trigger as AfterWatermark.pastEndOfWindow(), is a sliding window that generates results every 5 seconds, and each result includes data from the past 30 seconds. In other words, every 5 seconds, you get the average value of the most recent 30 seconds' data, and there is a 5-second overlap between these windows. This is what we want.

Comment 5

ID: 663421 User: pluiedust Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 11:20 Selected Answer: D Upvotes: 4

Moving average ——> sliding window

109. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 111

Sequence: 319
Discussion ID: 17248
Source URL: https://www.examtopics.com/discussions/google/view/17248-exam-professional-data-engineer-topic-1-question-111/
Posted By: -
Posted At: March 22, 2020, 1:15 p.m.

Question

You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the
Data Science team runs a query filtered on a date column and limited to 30`"90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries.
What should you do?

A. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
B. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
C. Modify your pipeline to maintain the last 30ג€"90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
D. Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.

Community Answer Votes

A: 13 most voted

Comments 21 comments Click to expand

Comment 1

ID: 222497 User: arghya13 Badges: Highly Voted Relative Date: 3 years, 9 months ago Absolute Date: Thu 19 May 2022 06:40 Selected Answer: - Upvotes: 5

I will go with Option A

Comment 2

ID: 185008 User: SteelWarrior Badges: Highly Voted Relative Date: 3 years, 11 months ago Absolute Date: Wed 23 Mar 2022 07:58 Selected Answer: - Upvotes: 5

Should be A. With partitions the performance will improve for selecting 30-90 days data. Also the storage cost will reduce as the old partitions (not updated in last 90 days) will qualify for Long-Term storage rates.

Comment 3

ID: 738233 User: odacir Badges: Most Recent Relative Date: 1 year, 9 months ago Absolute Date: Fri 07 Jun 2024 18:38 Selected Answer: A Upvotes: 1

Answer: A, has no cost to reload the data, Also Partition is the solution for reducing cost and time

Comment 4

ID: 673958 User: John_Pongthorn Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 20 Mar 2024 11:58 Selected Answer: A Upvotes: 1

it is not B in the sense of cost-effective certainly. read below in limitation
https://cloud.google.com/bigquery/docs/querying-wildcard-tables#limitations
Currently, cached results are not supported for queries against multiple tables using a wildcard even if the Use Cached Results option is checked. If you run the same wildcard query multiple times, you are billed for each query.

Comment 5

ID: 673957 User: John_Pongthorn Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 20 Mar 2024 11:52 Selected Answer: A Upvotes: 1

https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard
Partitioning is recommended over table sharding, because partitioned tables perform better

Comment 6

ID: 668621 User: John_Pongthorn Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Thu 14 Mar 2024 08:07 Selected Answer: A Upvotes: 1

A AND D , they are the most likely choiced but the questionn want
issue as cost-effectively as possible while maintaining the ability to conduct SQL queries.
1 table may be cheaper so partition is better than wildcarf

Comment 7

ID: 591753 User: Didine_22 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 25 Oct 2023 15:50 Selected Answer: A Upvotes: 2

answer A

Comment 8

ID: 518518 User: medeis_jar Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 06 Jul 2023 19:25 Selected Answer: A Upvotes: 2

https://cloud.google.com/bigquery/docs/partitioned-tables

Comment 9

ID: 516247 User: MaxNRG Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 05:54 Selected Answer: A Upvotes: 1

A. Partiotioning
https://cloud.google.com/bigquery/docs/partitioned-tables

Comment 10

ID: 513451 User: Tomi1313 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 14:43 Selected Answer: - Upvotes: 2

Why not D? You can use SQL.
This is the cheapest and fastest option
https://cloud.google.com/bigquery/docs/querying-wildcard-tables

Comment 10.1

ID: 683636 User: John_Pongthorn Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 30 Mar 2024 16:59 Selected Answer: - Upvotes: 1

Partitioning is recommended over table sharding, because partitioned tables perform better
This is a google recommendation nowaday.

Comment 11

ID: 490676 User: StefanoG Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 12:40 Selected Answer: A Upvotes: 4

The D solution is obviously discarded.
The request NOT require ONLY LAST 30-90 days, so the C solution is not the right solution.
In addition to this, the request ask to keep the possibility to made queries, so B is wrost.
Is not mandatory make the queries while you make the modify so the right answer is A

Comment 12

ID: 475024 User: JayZeeLee Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 09 May 2023 19:16 Selected Answer: - Upvotes: 1

B sounds more feasible.
The point is 'historical' data, not new table/data. Recreating tables from the past three years is a lot of work. Might as well export the table and run analyses there. No cost for exporting in BigQuery.

Comment 13

ID: 396958 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 17:34 Selected Answer: - Upvotes: 5

Vote for A

Comment 14

ID: 216354 User: Alasmindas Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Tue 10 May 2022 04:20 Selected Answer: - Upvotes: 3

I will go with Option A, although at first instance I felt Option C would be correct.
Option A : Because partitioning will help to address both the concerns mentioned in the question - i.e. faster query and reducing cost.
Option C : Modifying the data pipeline to store last 30-90 days data would have possible, if there was a point mentioned that only the latest data (30-90 days) is kept and the older data - beyond 90 days is moved to the master table. Since that point is mot mentioned, we will land up having multiple - 30-90 days data in separate tables + the master table.

Comment 14.1

ID: 294804 User: karthik89 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 20 Aug 2022 05:57 Selected Answer: - Upvotes: 2

but how will you append the data that is older than 90days in to the master table?

Comment 15

ID: 216018 User: Cloud_Enthusiast Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Mon 09 May 2022 15:06 Selected Answer: - Upvotes: 4

Answer is A. Recreating the DDL with new parition is easy and does not require any changes on applications that read data from it

Comment 16

ID: 163073 User: haroldbenites Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 19:21 Selected Answer: - Upvotes: 3

A is correct

Comment 17

ID: 127419 User: Rajuuu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 06 Jan 2022 07:29 Selected Answer: - Upvotes: 2

PArtition the tables is the key for query improvement.

Comment 17.1

ID: 131916 User: Rajuuu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 11 Jan 2022 09:07 Selected Answer: - Upvotes: 1

I think C will be more cost effective than using A as recreating the whole DDL is more expensive..

Comment 17.1.1

ID: 134271 User: tprashanth Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Thu 13 Jan 2022 22:46 Selected Answer: - Upvotes: 4

No, if a seperate table is maintained for last 30-90 days data, we end up creating a table on daily basis

Google Professional Data Engineer Pipelines and Processing

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes