Google Professional Data Engineer Storage and Data Modeling

1. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 41

Sequence: 5
Discussion ID: 16659
Source URL: https://www.examtopics.com/discussions/google/view/16659-exam-professional-data-engineer-topic-1-question-41/
Posted By: jvg637
Posted At: March 15, 2020, 1:42 p.m.

Question

MJTelco Case Study -

Company Overview -
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.

Company Background -
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.

Solution Concept -
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments `" development/test, staging, and production `" to meet the needs of running experiments, deploying new features, and serving production customers.

Business Requirements -
✑ Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed research workers
✑ Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.

Technical Requirements -
Ensure secure and efficient transport and storage of telemetry data

✑ Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.

CEO Statement -
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.

CTO Statement -
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.

CFO Statement -
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day.
Which schema should you use?

A. Rowkey: date#device_id Column data: data_point
B. Rowkey: date Column data: device_id, data_point
C. Rowkey: device_id Column data: date, data_point
D. Rowkey: data_point Column data: device_id, date
E. Rowkey: date#data_point Column data: device_id

Community Answer Votes

A: 20 most voted
C: 18
D: 2

Comments 26 comments Click to expand

Comment 1

ID: 76117 User: itche_scratche Badges: Highly Voted Relative Date: 5 years, 10 months ago Absolute Date: Sat 18 Apr 2020 17:59 Selected Answer: - Upvotes: 97

None, rowkey should be Device_Id+Date(reverse)

Comment 1.1

ID: 134576 User: Rajuuu Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Tue 14 Jul 2020 07:02 Selected Answer: - Upvotes: 6

A is a better option then other ..though not perfect as you mentioned.

Comment 1.2

ID: 523237 User: sraakesh95 Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Fri 14 Jan 2022 02:11 Selected Answer: - Upvotes: 1

Totally agree if we have to avoid hotspotting! , but, incase we need to choose one of the options below, would you be going for A?

Comment 1.3

ID: 401970 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 08 Jul 2021 15:52 Selected Answer: - Upvotes: 4

For READ operation it's is correct. i.e. Date#Device (so that data read from single node) -
For write operation it should be DeviceID#Date (so that data write via multiple nodes)

Comment 1.4

ID: 142235 User: Ankit267 Badges: - Relative Date: 5 years, 7 months ago Absolute Date: Thu 23 Jul 2020 20:34 Selected Answer: - Upvotes: 17

A - Key should be less granular item first to more granular item, there are more devices than date key (every 15 min)

Comment 2

ID: 64259 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Sun 15 Mar 2020 13:42 Selected Answer: - Upvotes: 19

think is A, since “The most common query is for all the data for a given device for a given day”, rowkey should have info for both devcie and date.

Comment 2.1

ID: 441952 User: michaelkhan3 Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 09 Sep 2021 13:56 Selected Answer: - Upvotes: 13

Google specifically mentions that it's a bad idea to use a timestamp at the start of a rowkey
https://cloud.google.com/bigtable/docs/schema-design#row-keys-avoid
The answer really should be Device_id#Timestamp but with the answers we were given you would be better off leaving the timestamp out all together

Comment 2.1.1

ID: 753567 User: Whoswho Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 22 Dec 2022 19:24 Selected Answer: - Upvotes: 2

I remember seeing it as well. the answer should be A. (reversed)

Comment 2.1.2

ID: 722620 User: wan2three Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 20 Nov 2022 14:07 Selected Answer: - Upvotes: 5

but it didnt say cant use date, date and timestamp are different

Comment 2.1.2.1

ID: 990598 User: FP77 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 26 Aug 2023 10:34 Selected Answer: - Upvotes: 1

The date is even worse than timestamp for the problem of hot-spotting

Comment 3

ID: 1711582 User: 9a58d2c Badges: Most Recent Relative Date: 1 month ago Absolute Date: Wed 04 Feb 2026 10:03 Selected Answer: C Upvotes: 1

the only option that follows the correct design principle is the C; the one that starts with device_id. If device_id#date were an option, it would be perfect.

Comment 4

ID: 1335053 User: Ronn27 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 31 Dec 2024 23:13 Selected Answer: A Upvotes: 2

Its very confusing but what I found is timebucket concept and day can be used instead of timestamp.

https://cloud.google.com/bigtable/docs/schema-design-time-series#time-buckets

Comment 5

ID: 1318857 User: cloud_rider Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 27 Nov 2024 20:23 Selected Answer: C Upvotes: 3

The correct option should be device_id#Date as it will distribute the load while writing and also be performant while reading. C is the second best option in my understanding as device Id will ensure that data sent by all the devices on a day is distributed between nodes and will not create hotspot.

Comment 6

ID: 1301393 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 22 Oct 2024 06:20 Selected Answer: A Upvotes: 3

I would go to date#device_id. However, i don't find this combination. A should be then chosen.

Comment 7

ID: 1264653 User: cmira123 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Mon 12 Aug 2024 14:50 Selected Answer: - Upvotes: 1

A-https://cloud.google.com/bigtable/docs/schema-design-time-series?hl=es-419#use_tall_and_narrow_tables

Comment 8

ID: 1243377 User: Lenifia Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 06 Jul 2024 15:19 Selected Answer: A Upvotes: 2

showed up in my exam. picked A. passed the exam. still not sure it's correct though

Comment 9

ID: 1213871 User: 39405bb Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sun 19 May 2024 18:11 Selected Answer: - Upvotes: 1

A. Rowkey: date#device_id Column data: data_point

Explanation:

Optimized for Most Common Query: The most common query is for all data for a given device on a given day. This schema directly matches the query pattern by including both date and device_id in the row key. This enables efficient retrieval of the required data using a single row key prefix scan.
Scalability: As the number of devices and data points increases, this schema distributes the data evenly across nodes in the Bigtable cluster, avoiding hotspots and ensuring scalability.
Data Organization: By storing data points as column values within each row, you can easily add new data points or timestamps without modifying the table structure.

Comment 10

ID: 1212797 User: mark1223jkh Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 17 May 2024 10:50 Selected Answer: - Upvotes: 1

Answer C:
https://cloud.google.com/bigtable/docs/schema-design#time-based:~:text=Don%27t%20use%20a%20timestamp%20by%20itself%20or%20at%20the%20beginning%20of%20a%20row%20key%2C

Comment 11

ID: 1166641 User: 0725f1f Badges: - Relative Date: 2 years ago Absolute Date: Tue 05 Mar 2024 18:20 Selected Answer: C Upvotes: 2

c without any doubt

Comment 12

ID: 1140358 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 04 Feb 2024 18:24 Selected Answer: - Upvotes: 1

The right answer should be Reverse A, but since we don't have that, the best answer is C.

Comment 13

ID: 1126660 User: gise Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 19 Jan 2024 13:32 Selected Answer: C Upvotes: 3

C. This schema is best suited for historical analysis of device data over time when the most common query is to retrieve all data for a **specific device** on a **given day**.

* **Row Key as `device_id`:** This allows for efficient retrieval of all data points related to a particular device in a single operation. Bigtable sorts data lexicographically by row key, so all data for a single device will be stored together.

* **Column with `date` and `data_point`:**
- Using `date` as a column name or part of the column qualifier allows you to quickly filter and retrieve data for specific date ranges.
- Storing `data_point` as the column value provides the actual data associated with each timestamp.

**Example:**

With this schema, a query to get all data for `device_12345` on `2023-12-20` would efficiently target the specific row key `device_12345` and fetch the relevant columns (with dates around `2023-12-20`).

Comment 14

ID: 1096145 User: JonFrow Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 14 Dec 2023 08:37 Selected Answer: - Upvotes: 1

C - the answer should the right answer.
Key is "all the data for a given device for a given day"
as in, Device first, and all the data + data points after.
This has nothing to do with Date-based search.

Comment 15

ID: 1065335 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 08 Nov 2023 06:38 Selected Answer: A Upvotes: 1

A - Key should be less granular item first to more granular item, there are more devices than date key (every 15 min)

Comment 16

ID: 1027681 User: imran79 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 08 Oct 2023 04:27 Selected Answer: - Upvotes: 1

the closest match to this in the provided options is:

C. Rowkey: device_id Column data: date, data_point

Thus, option C would be the best choice from the given option

Comment 17

ID: 906801 User: kenwilliams Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 25 May 2023 19:05 Selected Answer: A Upvotes: 3

It all comes down to the most common query

Comment 17.1

ID: 990599 User: FP77 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 26 Aug 2023 10:37 Selected Answer: - Upvotes: 1

Exactly
"all the data for a given device for a given day"
That's why the answer is C. You start by selecting the device and then the date. This solution is not prone to hot-spotting, yours is.

2. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 21

Sequence: 14
Discussion ID: 16929
Source URL: https://www.examtopics.com/discussions/google/view/16929-exam-professional-data-engineer-topic-1-question-21/
Posted By: -
Posted At: March 18, 2020, 4:13 p.m.

Question

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

A. Assign global unique identifiers (GUID) to each data entry.
B. Compute the hash value of each data entry, and compare it with all historical data.
C. Store each data entry as the primary key in a separate database and apply an index.
D. Maintain a database table to store the hash value and other metadata for each data entry.

Community Answer Votes

A: 44 most voted
D: 33
B: 11

Comments 35 comments Click to expand

Comment 1

ID: 126113 User: dg63 Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Sat 04 Jul 2020 13:24 Selected Answer: - Upvotes: 69

The best answer is "A".
Answer "D" is not as efficient or error-proof due to two reasons
1. You need to calculate hash at sender as well as at receiver end to do the comparison. Waste of computing power.
2. Even if we discount the computing power, we should note that the system is sending inventory information. Two messages sent at different can denote same inventory level (and thus have same hash). Adding sender time stamp to hash will defeat the purpose of using hash as now retried messages will have different timestamp and a different hash.
if timestamp is used as message creation timestamp than that can also be used as a UUID.

Comment 1.1

ID: 202898 User: retax Badges: - Relative Date: 5 years, 4 months ago Absolute Date: Tue 20 Oct 2020 03:06 Selected Answer: - Upvotes: 13

If the goal is to ensure at least ONE of each pair of entries is inserted into the db, then how is assigning a GUID to each entry resolving the duplicates? Keep in mind if the 1st entry fails, then hopefully the 2nd (duplicate) is successful.

Comment 1.1.1

ID: 502054 User: MarcoDipa Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Wed 15 Dec 2021 11:35 Selected Answer: - Upvotes: 5

Answer is D. Using Hash values we can remove duplicate values from a database. Hash values will be same for duplicate data and thus can be easily rejected. Obviously you won't check hash for timestmp.
D is better thatn B because maintaning a different table will reduce cost for hash computation for all historical data

Comment 1.1.1.1

ID: 955216 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 18 Jul 2023 12:13 Selected Answer: - Upvotes: 1

Why can't it be A, where the GUID is a hash value? Why do we need to store the hash with the metadata in a separate database to do the deduplication?

Comment 1.1.2

ID: 392857 User: ralf_cc Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Mon 28 Jun 2021 13:21 Selected Answer: - Upvotes: 12

A - In D, same message with different timestamp will have different hash, though the message content is the same.

Comment 1.1.2.1

ID: 407482 User: omakin Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Fri 16 Jul 2021 01:57 Selected Answer: - Upvotes: 8

Strong Answer is A - in another question on the gcp sample questions: the correct answer to that particular question was "You are building a new real-time data warehouse for your company and will use BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?"
This means you need a "uniqueid" and timestamps to properly dedupe a data.

Comment 1.1.2.1.1

ID: 530397 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 10:05 Selected Answer: - Upvotes: 1

U need a uniqueid but in this scenario, there is none. So u have to calculate by hashing w/ some of the fields in the dataset.

A is assigning guid in processing side will not solve the issue. Cause u will assign diff. ids...

Comment 1.1.2.1.1.1

ID: 786609 User: cetanx Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 15:36 Selected Answer: - Upvotes: 5

Answer - D
Key statement is "Transmitted data includes a payload of several fields and the timestamp of the transmission."

So the timestamp is appended to message while sending, in other words that field is subject to change if message is retransmitted. However, adding a GUID doesn't help much because if message is transmitted twice you will have different GUID for both messages but they will be the same/duplicate data.

You can simply calculate a hash based on not all data but from a select of columns (with the payload of several fields AND definitely by excluding the timestamp). By doing so, you can assure a different hash for each message.

Comment 1.1.2.2

ID: 528632 User: MaxNRG Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Thu 20 Jan 2022 18:48 Selected Answer: - Upvotes: 2

agreed, the key here is "payload of several fields and the timestamp"

Comment 1.1.2.2.1

ID: 528633 User: MaxNRG Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Thu 20 Jan 2022 18:49 Selected Answer: - Upvotes: 2

"payload of several fields and the timestamp of the transmission"

Comment 1.1.2.2.1.1

ID: 537740 User: BigDataBB Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 01 Feb 2022 10:03 Selected Answer: - Upvotes: 1

Hi Max, I also think that the hash value would be worng because the timestamp is part of payload and is not written that the hash value is generated without the ts; but it also not written if GUID is linked or not with sending. Often this is a point where the answer is vague. Because don't specify if the GUID is related to the data or to the send.

Comment 1.2

ID: 1024081 User: emmylou Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Tue 03 Oct 2023 18:45 Selected Answer: - Upvotes: 5

If you add a unique ID aren't you by definition not getting a duplicate record. Honestly I hate all these answers.

Comment 1.2.1

ID: 1212409 User: billalltf Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 16 May 2024 14:04 Selected Answer: - Upvotes: 1

You can add a function or condition that verifies if the global unique id already exists or just do a deduplication later

Comment 2

ID: 516514 User: medeis_jar Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Tue 04 Jan 2022 12:20 Selected Answer: A Upvotes: 7

Transmitted data includes fields and timestamp of transmission.
So, hash value changes with re-transmission ==> Option B & D are wrong.

Comment 2.1

ID: 530401 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 10:12 Selected Answer: - Upvotes: 1

U dont have to put timestamp in your hash algorithm.

Comment 2.1.1

ID: 568743 User: JK007 Badges: - Relative Date: 3 years, 12 months ago Absolute Date: Wed 16 Mar 2022 04:03 Selected Answer: - Upvotes: 3

But this question is not saying how you are calculating the hash algo. It says hash of "data entry" which includes other fields and timestamp field. So taking hash of the entire data entry the hash value will be different as timestamp will be different each time.

Comment 2.2

ID: 529719 User: exnaniantwort Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sat 22 Jan 2022 10:18 Selected Answer: - Upvotes: 1

The clearest explanation why D is wrong.

Comment 2.2.1

ID: 530406 User: Tanzu Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sun 23 Jan 2022 10:17 Selected Answer: - Upvotes: 2

Hashing is inevitable. So only b or d is appropriate.
The main diff. b/w them is persistence. At least 6+6 hours, u need to persist the hashes in either storage or a table.

The is no persistence in b. So d is more ok then ..

Comment 3

ID: 1702979 User: arwa_eiad Badges: Most Recent Relative Date: 2 months, 1 week ago Absolute Date: Thu 01 Jan 2026 01:42 Selected Answer: D Upvotes: 1

Why not the others:
A. GUIDs: You can only assign GUIDs if the source system supports it. Here, the system retransmits without changing the payload, so GUIDs won't help unless added upstream.
B. Compute hash and compare with all historical data: This is inefficient because you'd need to scan all previous entries every time.
C. Store each data entry as the primary key: Payloads can be large and variable, making them unsuitable as primary keys. Also, indexing on large text fields is costly.

Comment 4

ID: 1699963 User: lmch Badges: - Relative Date: 2 months, 3 weeks ago Absolute Date: Wed 17 Dec 2025 00:42 Selected Answer: D Upvotes: 1

Option D correctly specifies that the hash value must be stored in a database table (or a high-speed key/value store like Redis/Memcached) with an index applied to it. This allows the ingestion service to perform a single, fast primary key lookup using the hash.

Comment 5

ID: 1607186 User: Pulkit2706 Badges: - Relative Date: 6 months ago Absolute Date: Mon 08 Sep 2025 07:04 Selected Answer: A Upvotes: 1

The correct answer is: Assign global unique identifiers (GUID) to each data entry. This method allows highly efficient deduplication because each entry carries a unique ID, making it easy to identify and ignore duplicates during ingestion or later analysis.

Assigning a GUID to each entry ensures that every record—even across re-transmissions—has a unique identifier. At ingestion, the deduplication system can simply filter by GUID, drastically reducing the need for costly comparisons or hash storage. This approach avoids computational overhead and is scalable for frequent data transmissions.

Alternative methods like comparing hashes with all historical data, indexing the full data payload, or maintaining tables of hashes and metadata are less efficient due to increased computational, storage, or operational complexity. Using GUIDs is the simplest and most scalable solution for deduplication in this scenario.

Comment 6

ID: 1604926 User: Bugnumber1 Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Sun 31 Aug 2025 20:51 Selected Answer: D Upvotes: 1

This question is bullsh*t, because the correct answer would be "A", but for that you have to imagine that this GUID gets added to the original message. Instead, it says "assign". For me that sounds like when the message arrives, you assign it an ID, which is utterly useless for deduplication.

D is not efficient, but is the most efficient of the bunch here, plus, it's not that different from what "A" does. In fact, once you do "A", you still have to store that in a database and look it up, but it doesn't say that anywhere in the answer.

Me? I believe "A" it's a trap. But as we'll never get official confirmation, feel free to believe it's just a question where you assume "A" has all the magic things that make deduplication happen.

Comment 7

ID: 1601729 User: 1479 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Sat 23 Aug 2025 14:37 Selected Answer: B Upvotes: 1

Maintain a table with hash can be painfull and we can write it in hystorical data.

Comment 8

ID: 1600159 User: Surabhi20 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Wed 20 Aug 2025 12:48 Selected Answer: A Upvotes: 1

timestamp may be changed while re-transmitting the same message

Comment 9

ID: 1590172 User: jsg Badges: - Relative Date: 7 months, 3 weeks ago Absolute Date: Fri 25 Jul 2025 03:06 Selected Answer: D Upvotes: 1

Answer-D , A- Assign GUIDs: This only works if the source system can guarantee unique IDs per original transmission. But if re-transmissions recreate the GUID, or if the system doesn't assign one, this fails.

Comment 10

ID: 1578243 User: Annie00000 Badges: - Relative Date: 8 months, 4 weeks ago Absolute Date: Tue 17 Jun 2025 10:54 Selected Answer: A Upvotes: 2

Assigning a GUID (Globally Unique Identifier) at the SOURCE system to each payload ensures:
Idempotency: You can identify if a record has already been processed.
Efficiency: A simple lookup on the GUID is much faster than comparing hashes or checking entire payloads.
Scalability: Works well even when millions of records are transmitted.

Comment 10.1

ID: 1601725 User: 1479 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Sat 23 Aug 2025 14:35 Selected Answer: - Upvotes: 1

When you compare GUID, your system will hash it first ...
Compute can only copare digits

Comment 11

ID: 1570062 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Mon 19 May 2025 07:20 Selected Answer: B Upvotes: 2

Use hashing (answer B) for efficient, scalable, and fast deduplication — it's a well-established best practice in streaming and batch data pipelines.

Comment 11.1

ID: 1601724 User: 1479 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Sat 23 Aug 2025 14:34 Selected Answer: - Upvotes: 1

Why maintain an hastable as you can add it directly in a new column ? B better than D.
GUID is poorly designed as it need to hash the GUID's to compare it ... waste of compute

Comment 12

ID: 1570061 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Mon 19 May 2025 07:17 Selected Answer: D Upvotes: 1

Even If you defien GUID. How will the source know about it? he hold no such ID.

Comment 13

ID: 1559574 User: fassil Badges: - Relative Date: 11 months ago Absolute Date: Thu 10 Apr 2025 14:19 Selected Answer: D Upvotes: 1

A is incorrect. how can you find duplicates if you assign a unique id to every record? The answer is D.

Comment 14

ID: 1426961 User: Mo5454545454 Badges: - Relative Date: 11 months, 1 week ago Absolute Date: Thu 03 Apr 2025 11:34 Selected Answer: D Upvotes: 1

The most efficient way to deduplicate your inventory data would be:
D. Maintain a database table to store the hash value and other metadata for each data entry.
This approach is optimal because:

It creates a lightweight reference table that stores just the hash values and essential metadata (like timestamps) rather than the full payload data
Hash values can be quickly compared to identify duplicates without expensive full-data comparisons
The metadata can help with auditing and troubleshooting transmission issues
This solution scales well as your data volume grows

Option A (using GUIDs) doesn't address the retransmission scenario well, as new GUIDs might be generated each time. Option B requires comparing against all historical data, which becomes increasingly inefficient over time. Option C creates unnecessary storage overhead by using entire data entries as primary keys when only a hash value is needed for comparison.

Comment 15

ID: 1398865 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 15:09 Selected Answer: D Upvotes: 2

Deduplicate data with retransmissions. Use a database table with hash

Comment 16

ID: 1364868 User: Abizi Badges: - Relative Date: 1 year ago Absolute Date: Tue 04 Mar 2025 12:09 Selected Answer: A Upvotes: 1

most obvious answer

Comment 17

ID: 1331533 User: Rav761 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Wed 25 Dec 2024 13:06 Selected Answer: D Upvotes: 1

Option D: Maintain a database table to store the hash value and other metadata for each data entry.

This approach is efficient and scalable. By storing a computed hash value (as a compact representation of the data) along with metadata, deduplication can be performed by comparing new entries with the stored hashes. This minimizes storage requirements and improves lookup efficiency.

3. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 159

Sequence: 19
Discussion ID: 17213
Source URL: https://www.examtopics.com/discussions/google/view/17213-exam-professional-data-engineer-topic-1-question-159/
Posted By: -
Posted At: March 22, 2020, 7:42 a.m.

Question

You need to choose a database for a new project that has the following requirements:
✑ Fully managed
✑ Able to automatically scale up
✑ Transactionally consistent
✑ Able to scale up to 6 TB
✑ Able to be queried using SQL
Which database do you choose?

A. Cloud SQL
B. Cloud Bigtable
C. Cloud Spanner
D. Cloud Datastore

Community Answer Votes

C: 33 most voted
A: 29

Comments 21 comments Click to expand

Comment 1

ID: 68173 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Wed 25 Mar 2020 17:45 Selected Answer: - Upvotes: 37

Correct: A
It asks for scaling up which can be done in cloud sql, horizontal scaling is not possible in cloud sql
Automatic storage increase
If you enable this setting, Cloud SQL checks your available storage every 30 seconds. If the available storage falls below a threshold size, Cloud SQL automatically adds additional storage capacity. If the available storage repeatedly falls below the threshold size, Cloud SQL continues to add storage until it reaches the maximum of 30 TB.

Comment 1.1

ID: 168458 User: google_learner123 Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Fri 28 Aug 2020 14:54 Selected Answer: - Upvotes: 10

C - CloudSQL does not scale automatically.

Comment 1.1.1

ID: 188979 User: zxing233 Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Mon 28 Sep 2020 13:33 Selected Answer: - Upvotes: 5

Cloud SQL can automatically scale up storage capacity when you are near your limit

Comment 1.1.1.1

ID: 690411 User: dmzr Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Sun 09 Oct 2022 19:53 Selected Answer: - Upvotes: 4

it does not say about type of scaling, Cloud SQL scale up automatically with storage, that should works

Comment 2

ID: 465745 User: gcp_k Badges: Highly Voted Relative Date: 4 years, 4 months ago Absolute Date: Thu 21 Oct 2021 17:02 Selected Answer: - Upvotes: 9

Answer could be : A

Both Cloud Spanner and Cloud SQL does not have auto scaling feature natively. You need to build automation around it based on metrics.

Both Cloud SQL and Cloud Spanner supports SQL.

With Cloud SQL, you can go up to 10 TB of storage which also satisfies the other requirement.

Consistency - Of course, with Cloud SQL, you have single master and read replicas. So technically the data will be consistent across all instances so to speak.

The reason why I didn't choose Spanner is - There is no requirement for HA DR, multi region, secondary indexes, etc.. So I choose "A"

Comment 3

ID: 1700144 User: lmch Badges: Most Recent Relative Date: 2 months, 3 weeks ago Absolute Date: Wed 17 Dec 2025 23:51 Selected Answer: C Upvotes: 1

Cloud Spanner can automatically scale in both storage and compute.

Comment 4

ID: 1590999 User: arnauredi Badges: - Relative Date: 7 months, 2 weeks ago Absolute Date: Mon 28 Jul 2025 12:46 Selected Answer: A Upvotes: 1

Cloud SQL can scale up to 64To

Comment 5

ID: 1580981 User: Ben_oso Badges: - Relative Date: 8 months, 2 weeks ago Absolute Date: Fri 27 Jun 2025 04:01 Selected Answer: C Upvotes: 1

Correct: C - Cloud Spanner
hint is automatically scale, CloudSQL dont scale

Comment 6

ID: 1361310 User: MBNR Badges: - Relative Date: 1 year ago Absolute Date: Tue 25 Feb 2025 04:44 Selected Answer: C Upvotes: 1

Cloud SQL
Cloud SQL can store up to 30 TB of data.
It offers limited scalability as per the lesser load.
You can easily work with MySQL code in Cloud SQL.
Cloud SQL is a cost-effective service.

Cloud Spanner:
Cloud Spanner is used to store more than 30 TB of data.
It provides better scalability and SLOs.
Cloud Spanner is an expensive service.
It provides strong transactional consistency.
It is built on Google Cloud’s dedicated network that ensures low latency, security, and reliability.

Comment 7

ID: 1352648 User: Siahara Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Thu 06 Feb 2025 21:24 Selected Answer: C Upvotes: 1

C. Cloud Spanner and an explanation of the other options:
Why Cloud Spanner is the best fit:
Fully managed service: Cloud Spanner is fully managed by Google, simplifying database administration.
Automatic scaling It handles scaling seamlessly, both horizontally and vertically.
Transactional consistency: Cloud Spanner is known for its strong transactional consistency, including globally distributed transactions.
Scalable up to 6 TB (and beyond): It easily accommodates your 6 TB requirement and can scale much larger if needed.
SQL Support: Cloud Spanner offers a familiar SQL interface.

Comment 8

ID: 1335510 User: b3e59c2 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 02 Jan 2025 11:54 Selected Answer: C Upvotes: 2

Although the type of automatic scaling isn't specified, Cloud SQL does not allow for outright dynamic capacity auto scaling, so I believe the answer would be C

Comment 9

ID: 1332122 User: Rav761 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 26 Dec 2024 22:59 Selected Answer: C Upvotes: 4

C. Cloud Spanner

Here's why:

Fully Managed: Cloud Spanner is a fully managed database service provided by Google Cloud.

Automatic Scaling: It automatically scales horizontally to handle increased workloads and data volumes.

Transactional Consistency: Cloud Spanner provides strong transactional consistency with support for ACID transactions.

Scalability: It can easily scale up to and beyond 6 TB while maintaining performance and consistency.

SQL Queries: Cloud Spanner supports SQL queries, making it compatible with existing SQL-based analytics and applications.

Comment 10

ID: 1303868 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 28 Oct 2024 08:46 Selected Answer: C Upvotes: 3

I would go to C.

Comment 11

ID: 1291908 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 01 Oct 2024 12:28 Selected Answer: C Upvotes: 2

All the arguments below can essentially be boiled down to two questions. 1) is Cloud SQL fully managed? (yes), 2) Does it autoscale? It depends, is the answer. The question is horrifyingly worded, as it comes down to an ambiguity coinflip. I'm going with C, it feels like it's a better fit.

Comment 12

ID: 1263492 User: JamesKarianis Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 14:57 Selected Answer: A Upvotes: 2

Did not refer to global, thus not spanner. A is correct

Comment 13

ID: 1247653 User: Anudeep58 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 14 Jul 2024 08:47 Selected Answer: C Upvotes: 1

Transactional consistency: spanner provides strong consistency across rows, regions, and continents.

Comment 14

ID: 1163914 User: mothkuri Badges: - Relative Date: 2 years ago Absolute Date: Sat 02 Mar 2024 05:17 Selected Answer: - Upvotes: 1

Answer : A
Cloud SQL and Cloud Spanner are the options for the questions. But here as per requirement they don't need horizontal scaling, they want manager SQL instance and it should support 6 TB of storage. Cloud SQL can support up to 64 TB of storage
https://cloud.google.com/sql/docs/quotas#:~:text=Cloud%20SQL%20storage%20limits,core%3A%20Up%20to%203%20TB.

Comment 15

ID: 1151232 User: cuadradobertolinisebastiancami Badges: - Relative Date: 2 years ago Absolute Date: Thu 15 Feb 2024 19:37 Selected Answer: A Upvotes: 2

Not horizontal scaling is required, cloud SQL will work for 10 TB

Comment 16

ID: 1144035 User: williamvinct Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Thu 08 Feb 2024 05:10 Selected Answer: - Upvotes: 1

i will go with A, since the question doesnt specifically say it should be scaling horizontally

Comment 17

ID: 1082840 User: arturido Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 28 Nov 2023 19:29 Selected Answer: A Upvotes: 2

"Able to scale up to 6 TB" -seems to be the key
it looks like autoscaling is related to storage - possible in case of Cloud SQL

Comment 17.1

ID: 1101302 User: LaxmanTiwari Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 20 Dec 2023 07:25 Selected Answer: - Upvotes: 1

no way , u can automaticly scale the Cloud SQL , please read the documents of Cloud SQL, Spanner is the solution .

4. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 70

Sequence: 21
Discussion ID: 16631
Source URL: https://www.examtopics.com/discussions/google/view/16631-exam-professional-data-engineer-topic-1-question-70/
Posted By: Rajokkiyam
Posted At: March 15, 2020, 6:11 a.m.

Question

You are designing storage for very large text files for a data pipeline on Google Cloud. You want to support ANSI SQL queries. You also want to support compression and parallel load from the input locations using Google recommended practices. What should you do?

A. Transform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and query.
B. Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery permanent linked tables for query.
C. Compress text files to gzip using the Grid Computing Tools. Use BigQuery for storage and query.
D. Compress text files to gzip using the Grid Computing Tools. Use Cloud Storage, and then import into Cloud Bigtable for query.

Community Answer Votes

B: 35 most voted
A: 15

Comments 25 comments Click to expand

Comment 1

ID: 73310 User: Ganshank Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 11 Apr 2020 14:08 Selected Answer: - Upvotes: 59

B.
The question is focused on designing storage for very large files, with support for compression, ANSI SQL queries, and parallel loading from the input locations. This can be met using GCS for storage and Bigquery permanent tables with external data source in GCS.

Comment 1.1

ID: 146057 User: atnafu2020 Badges: - Relative Date: 5 years, 7 months ago Absolute Date: Tue 28 Jul 2020 21:35 Selected Answer: - Upvotes: 10

why GCS as external since Bigquery can be used as storage as well?

Comment 1.1.1

ID: 146058 User: atnafu2020 Badges: - Relative Date: 5 years, 7 months ago Absolute Date: Tue 28 Jul 2020 21:35 Selected Answer: - Upvotes: 11

A seems correct for me

Comment 1.1.1.1

ID: 162644 User: atnafu2020 Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Fri 21 Aug 2020 05:17 Selected Answer: - Upvotes: 4

Since its best practice, i go by with B not A

Comment 1.1.1.2

ID: 308909 User: gopinath_k Badges: - Relative Date: 5 years ago Absolute Date: Fri 12 Mar 2021 15:23 Selected Answer: - Upvotes: 2

They want to store the files if you try with bq I think you will need to strike the word compression.

Comment 1.1.2

ID: 748114 User: jkhong Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 17 Dec 2022 15:24 Selected Answer: - Upvotes: 5

The question focuses on "designing storage", rather than designing a data warehouse.

Comment 2

ID: 66574 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 21 Mar 2020 16:38 Selected Answer: - Upvotes: 15

Should be A

Comment 2.1

ID: 616556 User: tavva_prudhvi Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 15 Jun 2022 08:04 Selected Answer: - Upvotes: 7

Not A : Importing data into BigQuery may take more time compared to creating external tables on data. Additional storage costs by BigQuery is another issue which can be more expensive than Google Storage.

Comment 3

ID: 1699340 User: 50336e5 Badges: Most Recent Relative Date: 2 months, 4 weeks ago Absolute Date: Sun 14 Dec 2025 12:13 Selected Answer: A Upvotes: 1

It's not B because Permanent linked tables it's not recommanded for a large data

Comment 4

ID: 1602412 User: forepick Badges: - Relative Date: 6 months, 2 weeks ago Absolute Date: Mon 25 Aug 2025 19:22 Selected Answer: B Upvotes: 1

The task here is to design a storage for very big FILES, not tables. And only then, these big files should be queries.
So it's B.

Comment 5

ID: 1571235 User: AdriHubert Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Thu 22 May 2025 08:28 Selected Answer: A Upvotes: 1

Here's why:
Cloud Dataflow is a fully managed service for stream and batch data processing. It’s ideal for transforming large text files into a more efficient format like Avro, which supports schema evolution and is optimized for BigQuery ingestion.
Avro is a row-based storage format that supports compression and is well-suited for BigQuery.
BigQuery is Google Cloud’s serverless, highly scalable, and cost-effective multi-cloud data warehouse that supports ANSI SQL.
Using compressed Avro allows for parallel loading into BigQuery, which is a Google-recommended best practice for performance and cost-efficiency.
Why not the others?
B: Using Cloud Storage with permanent external tables in BigQuery is possible, but it’s less performant and flexible than loading the data into native BigQuery storage.
C: Gzip-compressed text files are not as efficient for parallel processing or schema enforcement as Avro.
D: Cloud Bigtable is not designed for SQL queries; it’s a NoSQL wide-column store, and not suitable for ANSI SQL workloads.

Comment 6

ID: 1400769 User: oussama7 Badges: - Relative Date: 11 months, 4 weeks ago Absolute Date: Wed 19 Mar 2025 23:25 Selected Answer: A Upvotes: 1

Avro is a Google-recommended format for BigQuery because it supports schema evolution, efficient compression, and parallel processing. Using Cloud Dataflow ensures scalable transformation, and storing the data in BigQuery allows for optimized ANSI SQL queries.

Comment 7

ID: 1398899 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 16:01 Selected Answer: A Upvotes: 1

Storing large text files with SQL support**

Avro in BigQuery (A) supports compression and efficient queries.

Comment 8

ID: 1364820 User: dcruzado Badges: - Relative Date: 1 year ago Absolute Date: Tue 04 Mar 2025 10:01 Selected Answer: B Upvotes: 1

For me it would make more sense to store the data directly in BQ, but A does not make sense, because why compress to Avro if you dont store the avro files and directly saves the data to BQ?
You are not using the compression

Comment 9

ID: 1357082 User: imarri876 Badges: - Relative Date: 1 year ago Absolute Date: Sun 16 Feb 2025 00:14 Selected Answer: A Upvotes: 1

BigQuery now has physical storage, which makes storage cost fairly cheap on BigQuery with compression. I would go with A.

Comment 10

ID: 1325210 User: deineiveu Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 11 Dec 2024 21:03 Selected Answer: B Upvotes: 1

Gros fichier + sql = GCS + Bigquery

Comment 11

ID: 1273931 User: Nittin Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Wed 28 Aug 2024 11:08 Selected Answer: B Upvotes: 1

copy to gcs and use external tble in bq

Comment 12

ID: 1255232 User: carmltekai Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 25 Jul 2024 22:13 Selected Answer: A Upvotes: 2

Should be A.

Check this link for the advantage of load Avro data to BigQuery https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#advantages_of_avro

"""The Avro binary format:
* Is faster to load. The data can be read in parallel, even if the data blocks are compressed.
* Doesn't require typing or serialization.
* Is easier to parse because there are no encoding issues found in other formats such as ASCII.
When you load Avro files into BigQuery, the table schema is automatically retrieved from the self-describing source data."""

Comment 12.1

ID: 1255233 User: carmltekai Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 25 Jul 2024 22:15 Selected Answer: - Upvotes: 1

While option B can work, it introduces additional complexity by linking Cloud Storage with BigQuery. Directly storing data in BigQuery is more efficient for querying purposes.

There are no requirements about cost, So simple is better

Comment 13

ID: 1185264 User: SK1594 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Fri 29 Mar 2024 07:42 Selected Answer: - Upvotes: 2

B makes sense

Comment 14

ID: 1098715 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 17 Dec 2023 08:42 Selected Answer: B Upvotes: 3

1. Store Avro files in GCS
2. Query them in BigQuery (federated tables)

Comment 15

ID: 911135 User: forepick Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 31 May 2023 13:17 Selected Answer: B Upvotes: 6

Answer is B.
The requirements are:
- storage for compressed text files
- parallel loads to SQL tool

AVRO is a compressed format for text files, which makes it possible to load chunks of a very large file in parallel to BigQuery.

gzip files are seamless in GCS though, but cannot load in parallel to BQ.

Comment 16

ID: 784746 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 22 Jan 2023 23:35 Selected Answer: - Upvotes: 4

Correct Answer:
A. Transform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and query.

This option offers several advantages:

- Transforming the text files to compressed Avro using Cloud Dataflow allows for parallel processing of the input data, improving the efficiency of the pipeline.

- Compressing the data in Avro format further reduces the storage space required and improves data transfer performance.

- Storing the data in BigQuery supports ANSI SQL queries and allows for easy querying of the data.

- BigQuery is a fully-managed data warehousing solution, it's scalable and can handle large datasets and concurrent queries, so it's suitable for large text files.

Comment 16.1

ID: 784747 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 22 Jan 2023 23:35 Selected Answer: - Upvotes: 1

Option B is similar to option A but it's using a permanent linked table between Cloud Storage and BigQuery, this approach is not recommended as it's not efficient and could lead to data duplication, and it doesn't take advantage of the parallel processing capabilities of Cloud Dataflow.

Option C and D are incorrect because they don't take advantage of the parallel processing capabilities of Cloud Dataflow, and they don't use Avro format for compression which is more efficient and recommended by Google. Storing the data in Cloud Bigtable also doesn't support ANSI SQL queries which is a requirement for this use case.

Comment 17

ID: 748123 User: jkhong Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 17 Dec 2022 15:32 Selected Answer: B Upvotes: 3

Designing storage solution, not data warehousing -> So Cloud Storage.

Support compression -> just use Avro
Parallel load -> refers to upload from input locations, NOT download.

Load in parallel using -m flag for gsutil cp

https://cloud.google.com/storage/docs/uploads-downloads#console

5. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 188

Sequence: 25
Discussion ID: 79606
Source URL: https://www.examtopics.com/discussions/google/view/79606-exam-professional-data-engineer-topic-1-question-188/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 11:06 p.m.

Question

Your startup has a web application that currently serves customers out of a single region in Asia. You are targeting funding that will allow your startup to serve customers globally. Your current goal is to optimize for cost, and your post-funding goal is to optimize for global presence and performance. You must use a native
JDBC driver. What should you do?

A. Use Cloud Spanner to configure a single region instance initially, and then configure multi-region Cloud Spanner instances after securing funding.
B. Use a Cloud SQL for PostgreSQL highly available instance first, and Bigtable with US, Europe, and Asia replication after securing funding.
C. Use a Cloud SQL for PostgreSQL zonal instance first, and Bigtable with US, Europe, and Asia after securing funding.
D. Use a Cloud SQL for PostgreSQL zonal instance first, and Cloud SQL for PostgreSQL with highly available configuration after securing funding.

Community Answer Votes

A: 35 most voted
D: 20

Comments 23 comments Click to expand

Comment 1

ID: 657843 User: AWSandeep Badges: Highly Voted Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 23:06 Selected Answer: A Upvotes: 11

A. Use Cloud Spanner to configure a single region instance initially, and then configure multi-region Cloud Spanner instances after securing funding.

When you create a Cloud Spanner instance, you must configure it as either regional (that is, all the resources are contained within a single Google Cloud region) or multi-region (that is, the resources span more than one region).

You can change the instance configuration to multi-regional (or global) at anytime.

Comment 2

ID: 886148 User: izekc Badges: Highly Voted Relative Date: 2 years, 10 months ago Absolute Date: Mon 01 May 2023 13:17 Selected Answer: D Upvotes: 9

Although A is good, but concerning about the cost. Then D will be much more suitable

Comment 3

ID: 1626231 User: b2aaace Badges: Most Recent Relative Date: 3 months, 3 weeks ago Absolute Date: Mon 17 Nov 2025 00:01 Selected Answer: D Upvotes: 1

Native JDBC → eliminates Bigtable (B & C)
Cost optimization initially → eliminates Spanner (A)
Global expansion later → Cloud SQL can add read replicas

Comment 4

ID: 1304103 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 28 Oct 2024 19:32 Selected Answer: D Upvotes: 2

I go to D. Cloud SQL is usually used for web application (CRM) (https://cloud.google.com/blog/topics/developers-practitioners/your-google-cloud-database-options-explained?hl=en)

Comment 5

ID: 1285930 User: 4a8ffd7 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 18 Sep 2024 19:10 Selected Answer: D Upvotes: 1

Although A is good, but concerning about the cost. Then D will be much more suitable

Comment 6

ID: 1190126 User: CGS22 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 06 Apr 2024 00:25 Selected Answer: D Upvotes: 1

Although A is good, but concerning about the cost. Then D will be much more suitable

Comment 6.1

ID: 1336818 User: Ronn27 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sun 05 Jan 2025 17:57 Selected Answer: - Upvotes: 2

While a zonal instance is cost-effective, transitioning to a highly available Cloud SQL instance does not support global replication. Cloud SQL lacks the scalability and global presence needed for your post-funding goals.

So I beleive Spanner is the right answer

Comment 7

ID: 1123221 User: tibuenoc Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 10:27 Selected Answer: D Upvotes: 1

I think is D

The best for Web app is Cloud SQL, and Spanner is the best for data more than 30GB

Comment 7.1

ID: 1336817 User: Ronn27 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sun 05 Jan 2025 17:56 Selected Answer: - Upvotes: 1

While a zonal instance is cost-effective, transitioning to a highly available Cloud SQL instance does not support global replication. Cloud SQL lacks the scalability and global presence needed for your post-funding goals.

Comment 8

ID: 1102375 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 21 Dec 2023 12:03 Selected Answer: A Upvotes: 6

A - This option allows for optimization for cost initially with a single region Cloud Spanner instance, and then optimization for global presence and performance after funding with multi-region instances.
Cloud Spanner supports native JDBC drivers and is horizontally scalable, providing very high performance. A single region instance minimizes costs initially. After funding, multi-region instances can provide lower latency and high availability globally.
Cloud SQL does not scale as well and has higher costs for multiple high availability regions. Bigtable does not support JDBC drivers natively. Therefore, Spanner is the best choice here for optimizing both for cost initially and then performance and availability globally post-funding.

Comment 9

ID: 847809 User: lucaluca1982 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Thu 23 Mar 2023 05:40 Selected Answer: - Upvotes: 3

Spanner has some limitations with JDBC. Maybe the quetion wants to help us tp choose Cloud SQL

Comment 10

ID: 821030 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Sat 25 Feb 2023 00:51 Selected Answer: - Upvotes: 1

Answer D:
Cost effective transactional database Cloud SQL. Spanner is good case for data more than 30 GB

Comment 11

ID: 739961 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 09 Dec 2022 10:35 Selected Answer: A Upvotes: 7

B and C has no sense because of the driver.
D looks like a good option, but HA it's not to improve performance or global presence:
The purpose of an HA configuration is to reduce downtime when a zone or instance becomes unavailable. This might happen during a zonal outage, or when an instance runs out of memory. With HA, your data continues to be available to client applications.
So the best option is A.

Comment 12

ID: 675780 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 22 Sep 2022 08:18 Selected Answer: A Upvotes: 5

https://cloud.google.com/spanner/docs/jdbc-drivers
Ans A
https://cloud.google.com/spanner/docs/instance-configurations#tradeoffs_regional_versus_multi-region_configurations
The last part of the question makes it easy

Comment 13

ID: 667702 User: TNT87 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Tue 13 Sep 2022 08:07 Selected Answer: - Upvotes: 3

Yes Spanner is expensive , but the question expresslty states that "after securing funding you want to have a global presence" the word globally is repeatedly stated there.
Answer is A.

Comment 13.1

ID: 669579 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 15 Sep 2022 09:13 Selected Answer: - Upvotes: 1

https://cloud.google.com/spanner/docs/instance-configurations#tradeoffs_regional_versus_multi-region_configurations
Ans A

Comment 14

ID: 666660 User: badrisrinivas9 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 12 Sep 2022 08:35 Selected Answer: D Upvotes: 3

Spanner is expensive, they haven't mentioned the size of db... optimize for cost then option is Cloud SQL which cost effective and highly available in case of multi region.

Comment 15

ID: 664270 User: Quevedo Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 09 Sep 2022 08:00 Selected Answer: A Upvotes: 3

A is the best option. It is globally scalable and it also meets the cost goal as it says that initially it will be configurated as single region wich is cheaper than multi region.

Comment 16

ID: 661105 User: YorelNation Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Tue 06 Sep 2022 12:47 Selected Answer: D Upvotes: 2

Spanner is expensive can't be A

Would choose D

Comment 16.1

ID: 661114 User: YorelNation Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Tue 06 Sep 2022 12:53 Selected Answer: - Upvotes: 1

Actually maybe C as you don't really need relational database for a webapp and BigTable is super performant and highly available

Comment 16.1.1

ID: 688297 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Fri 07 Oct 2022 06:53 Selected Answer: - Upvotes: 1

no its A

Comment 16.2

ID: 669586 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 15 Sep 2022 09:16 Selected Answer: - Upvotes: 2

The fact that its global cloud spanner is the answer. Secondly Option D, the fact that it has to be highly avaible and multi regional its already more expensive than Cloud spanner Regional instance. https://cloud.google.com/spanner/docs/instance-configurations#tradeoffs_regional_versus_multi-region_configurations

Comment 17

ID: 657953 User: ducc Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 03:36 Selected Answer: A Upvotes: 3

Spanner still support JDBC
https://cloud.google.com/spanner/docs/jdbc-drivers

6. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 18

Sequence: 36
Discussion ID: 16654
Source URL: https://www.examtopics.com/discussions/google/view/16654-exam-professional-data-engineer-topic-1-question-18/
Posted By: jvg637
Posted At: March 15, 2020, 12:32 p.m.

Question

Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use? (Choose three.)

A. Supervised learning to determine which transactions are most likely to be fraudulent.
B. Unsupervised learning to determine which transactions are most likely to be fraudulent.
C. Clustering to divide the transactions into N categories based on feature similarity.
D. Supervised learning to predict the location of a transaction.
E. Reinforcement learning to predict the location of a transaction.
F. Unsupervised learning to predict the location of a transaction.

Community Answer Votes

BCD: 26 most voted
ABC: 7
BCE: 2
ACD: 1

Comments 19 comments Click to expand

Comment 1

ID: 64232 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Sun 15 Mar 2020 12:32 Selected Answer: - Upvotes: 71

BCD makes more sense to me. Its for sure not unsupervised, since locations are in the data already. Reinforcement also doesn't fit, as there no AI and no interactions with data from the observer.

Comment 1.1

ID: 435583 User: sergio6 Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Mon 30 Aug 2021 16:38 Selected Answer: - Upvotes: 4

D make sense, but i have a doubt: location is a discrete value (no regression), so a multiclass classification model should be applied ... to predict locations?

Comment 1.1.1

ID: 459440 User: hellofrnds Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sat 09 Oct 2021 02:53 Selected Answer: - Upvotes: 5

yes. multiclass classification model should be applied

Comment 2

ID: 487242 User: StefanoG Badges: Highly Voted Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:29 Selected Answer: BCD Upvotes: 8

As wrote by RP123
B - Not labelled as Fraud or not. So Unsupervised.
C - Clustering can be done based on location, amount etc.
D - Location is already given. So labelled. Hence supervised.

Comment 3

ID: 1617377 User: af17139 Badges: Most Recent Relative Date: 4 months, 4 weeks ago Absolute Date: Wed 15 Oct 2025 13:58 Selected Answer: ABC Upvotes: 1

No sense in predicting the location in a dataset that already contains it.

Comment 4

ID: 1398851 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 14:53 Selected Answer: ABC Upvotes: 1

The three most applicable machine learning applications for analyzing bank transactions are

Comment 5

ID: 351647 User: Bulleen Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:30 Selected Answer: - Upvotes: 7

BCD makes sense, but I now agree that BCE is the correct answer.
Say the model predict a location, guessing US or Sweden are both wrong when the answer is Canada. But US is closer, the distance from the correct location can be used to calculate a reward. Through reinforcement learning (E) the model could guess a location with better accuracy than supervised (D).

Comment 6

ID: 462668 User: anji007 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:30 Selected Answer: - Upvotes: 2

Ans: B, C and D
i) Fraudulent transaction, is nothing but anomaly detection which falls under Unsupervised.
ii) All transactions can be categorized using type etc - clustering algorithm.
iii) Using location as a label, supervised classification can be developed to predict location.

Comment 7

ID: 754051 User: ler_mp Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:29 Selected Answer: BCD Upvotes: 1

BCD makes more sense. B, C should not be controversial. For D vs E, in this use case D fits better than usage of reinfocement learning

Comment 8

ID: 757341 User: Kyr0 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:29 Selected Answer: BCD Upvotes: 1

BCD makes more sens to me

Comment 9

ID: 819124 User: musumusu Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:29 Selected Answer: - Upvotes: 5

Anwer: BCD
Things to understand:
Supervised learning will only predict the column that is labeled. In this case, there is not Fraud or not Fraud column inside which he will train on. So Option A, wrong.
option D: Supervised learning for column (transaction location) is possible as column exist to train on.
Option C: Custering N-type is possible and also an unsupervised learning to make cluster of similar pattern.
Option B: Its a weaker point here, User should be able to know which clusters are fraud in history. As it doesn't give enough information about past analysis whether user knows potential frauds or not. Ignore this option, if question asked for 2 right options only.

Comment 10

ID: 1258331 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 30 Jul 2024 19:55 Selected Answer: ABC Upvotes: 3

Why would you need to predict a location...

Comment 11

ID: 1244854 User: Roulle Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 12:16 Selected Answer: ACD Upvotes: 1

C and D are good for sureCliquez pour utiliser cette solution et E, F wrong for sure.

Then, to choose between A and B. Both options indicate that we know which transactions are fraudulent and which are not. Indeed, in order to use unsupervised classification to determine the characteristics of fraudulent transactions, we must already know which ones are fraudulent, either because all transactions in the dataset are fraudulent, or because a variable allows us to identify them. If all transactions were fraudulent, this would probably have been specified in the statement. It is therefore more likely that the "type of transaction" variable can be used to distinguish fraudulent transactions from others.

In this case, we have a target variable to predict, enabling us to build interpretable supervised models to understand the typology of fraudulent transactions. I therefore opt for A, C and D

Comment 12

ID: 1078284 User: TVH_Data_Engineer Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 23 Nov 2023 10:19 Selected Answer: - Upvotes: 3

Options B, E, and F are not as suitable for the given scenario:

B. Unsupervised learning to determine which transactions are most likely to be fraudulent.

Unsupervised learning, while useful for anomaly detection, might not be as effective for fraud detection without labeled data indicating which transactions are fraudulent.
E. Reinforcement learning to predict the location of a transaction.

Reinforcement learning is more suitable for scenarios where an agent learns to make decisions through trial and error, which doesn't seem to align with predicting transaction locations.
F. Unsupervised learning to predict the location of a transaction.

Unsupervised learning typically doesn't involve predicting specific values (like location) without labeled data for training.
In summary, A, C, and D are the most appropriate machine learning applications for investigating the provided bank transactions dataset.

Comment 13

ID: 1063261 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 05 Nov 2023 21:37 Selected Answer: BCD Upvotes: 1

Answer: BCD

Comment 14

ID: 1006426 User: Waqasghaloo Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 13 Sep 2023 11:41 Selected Answer: - Upvotes: 3

Location is already given as attribite so what value is served with predicting location?

Comment 15

ID: 978401 User: youare87 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Fri 11 Aug 2023 09:20 Selected Answer: - Upvotes: 1

A, B: Data features without the definition of fraudulent, so we can not obtain the answer even if using the unsupervise learning.
C: Kmeans solve this.
D: logistic regression. Just put the location into target.
E: Give the positive reward when the model predicts correct location.
F: Same as C. Use all features but locations, and use similarity to predict new data.

Comment 16

ID: 974452 User: xiaofeng_0226 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 07 Aug 2023 08:43 Selected Answer: BCD Upvotes: 1

Absolutely

Comment 17

ID: 971699 User: Dip1994 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Fri 04 Aug 2023 07:16 Selected Answer: BCD Upvotes: 1

makes more sense

7. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 234

Sequence: 44
Discussion ID: 130177
Source URL: https://www.examtopics.com/discussions/google/view/130177-exam-professional-data-engineer-topic-1-question-234/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 12:56 p.m.

Question

You migrated a data backend for an application that serves 10 PB of historical product data for analytics. Only the last known state for a product, which is about 10 GB of data, needs to be served through an API to the other applications. You need to choose a cost-effective persistent storage solution that can accommodate the analytics requirements and the API performance of up to 1000 queries per second (QPS) with less than 1 second latency. What should you do?

A. 1. Store the historical data in BigQuery for analytics.
2. Use a materialized view to precompute the last state of a product.
3. Serve the last state data directly from BigQuery to the API.
B. 1. Store the products as a collection in Firestore with each product having a set of historical changes.
2. Use simple and compound queries for analytics.
3. Serve the last state data directly from Firestore to the API.
C. 1. Store the historical data in Cloud SQL for analytics.
2. In a separate table, store the last state of the product after every product change.
3. Serve the last state data directly from Cloud SQL to the API.
D. 1. Store the historical data in BigQuery for analytics.
2. In a Cloud SQL table, store the last state of the product after every product change.
3. Serve the last state data directly from Cloud SQL to the API.

Community Answer Votes

D: 15 most voted
A: 6

Comments 15 comments Click to expand

Comment 1

ID: 1116609 User: einchkrein Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Mon 08 Jan 2024 13:49 Selected Answer: - Upvotes: 7

Serve the last state data directly from Cloud SQL to the API.
Here's why this option is most suitable:

BigQuery for Analytics: BigQuery is an excellent choice for storing and analyzing large datasets like your 10 PB of historical product data. It is designed for handling big data analytics efficiently and cost-effectively.

Cloud SQL for Last State Data: Cloud SQL is a fully managed relational database that can effectively handle the storage of the last known state of products. Storing this subset of data (about 10 GB) in Cloud SQL allows for optimized and faster query performance for your API needs. Cloud SQL can comfortably handle the requirement of up to 1000 QPS with sub-second latency.

Separation of Concerns: This approach separates the analytics workload (BigQuery) from the operational query workload (Cloud SQL). This separation ensures that analytics queries do not interfere with the operational performance of the API and vice versa.

Comment 2

ID: 1124012 User: datapassionate Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Tue 16 Jan 2024 09:20 Selected Answer: D Upvotes: 7

D. 1. Store the historical data in BigQuery for analytics.
2. In a Cloud SQL table, store the last state of the product after every product change.
3. Serve the last state data directly from Cloud SQL to the AP

This approach leverages BigQuery's scalability and efficiency for handling large datasets for analytics. BigQuery is well-suited for managing the 10 PB of historical product data. Meanwhile, Cloud SQL provides the necessary performance to handle the API queries with the required low latency. By storing the latest state of each product in Cloud SQL, you can efficiently handle the high QPS with sub-second latency, which is crucial for the API's performance. This combination of BigQuery and Cloud SQL offers a balanced solution for both the large-scale analytics and the high-performance API needs.

Comment 3

ID: 1353806 User: zanhsieh Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sun 09 Feb 2025 10:00 Selected Answer: D Upvotes: 1

Why not A? Because BQ API 100 reqs per second for API method. Other possible limits do not meet the 1000 QPS requirement. Yes, max number of tabledata.list reqs per second is 1000 but we won't always call tabledata.list all time.
https://cloud.google.com/bigquery/quotas#api_request_quotas

Comment 4

ID: 1326611 User: clouditis Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sat 14 Dec 2024 22:14 Selected Answer: A Upvotes: 2

A is the most plausible option - Cloud SQL can not retrieve results out with 1 second latency as the requirement here is, with BQ MV"s that could be a possibility as its pre-computed.

Comment 5

ID: 1305583 User: ToiToi Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Thu 31 Oct 2024 21:28 Selected Answer: A Upvotes: 1

Why A? Because:

Materialized View for API: A materialized view in BigQuery pre-computes the last known state of each product. This ensures that your API can quickly retrieve the latest product information without needing to query the entire historical dataset.
BigQuery for API Serving: BigQuery can handle high query volumes with low latency, meeting your requirement of 1000 QPS with sub-second latency.
Cost-Effectiveness: This solution avoids the need for a separate database like Cloud SQL, minimizing costs and management overhead.

Why not D:
While Cloud SQL is a good option for transactional workloads, it's not as cost-effective or scalable as BigQuery for analytical queries on 10 PB of data. It might also not be the ideal choice for serving high-volume API requests with low latency.

Comment 5.1

ID: 1606653 User: Rip696 Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Sat 06 Sep 2025 16:06 Selected Answer: - Upvotes: 1

In option D, it says to store the historical data in BQ not in CSQL.

Comment 6

ID: 1225117 User: Anudeep58 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Thu 06 Jun 2024 04:46 Selected Answer: D Upvotes: 1

Why not A:
Serving data directly from BigQuery to the API may not meet the low latency requirements for high QPS operations, as BigQuery is optimized for analytical queries rather than transactional workloads.

Comment 6.1

ID: 1348040 User: Ryannn23 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Tue 28 Jan 2025 17:59 Selected Answer: - Upvotes: 1

what transactional workload? you just need to provide latest status for each product through an API. Select from a 10GB BQ MV will provide the result in 1 sec.

Comment 7

ID: 1213474 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 18 May 2024 23:21 Selected Answer: A Upvotes: 1

Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. Materialized views can optimize queries with high computation cost and small dataset results. https://cloud.google.com/bigquery/docs/materialized-views-intro#use_cases
https://cloud.google.com/bigquery/docs/materialized-views-intro

Comment 8

ID: 1191238 User: CGS22 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 08 Apr 2024 00:53 Selected Answer: D Upvotes: 1

Why D is the best choice:

Cost-Effective Analytics: BigQuery excels at handling large datasets (10 PB) and complex analytical queries. Its columnar storage and massively parallel processing make it ideal for analyzing historical product data.
High-Performance API: Cloud SQL provides a managed relational database service optimized for transactional workloads. It can easily handle the 1000 QPS requirement with low latency, ensuring fast API responses.
Separation of Concerns: Storing historical data in BigQuery and the last known state in Cloud SQL separates analytical and transactional workloads, optimizing performance and cost for each use case.

Comment 9

ID: 1154428 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 03:29 Selected Answer: D Upvotes: 1

Option D

Comment 10

ID: 1152216 User: ML6 Badges: - Relative Date: 2 years ago Absolute Date: Fri 16 Feb 2024 21:27 Selected Answer: D Upvotes: 1

BigQuery = data warehouse that is optimized for querying and analyzing large datasets using SQL. Can easily process petabytes of data.
Cloud SQL = designed for transactional workloads and traditional relational database use cases, such as web applications, e-commerce platforms, and content management systems.

Comment 11

ID: 1121562 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 12:30 Selected Answer: D Upvotes: 3

Option D is the right one, compared to option A, Cloud SQL is more efficient and cost effective for the amount of time the data needs to be accessed by the api

Comment 12

ID: 1112724 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 12:56 Selected Answer: A Upvotes: 2

A. 1. Store the historical data in BigQuery for analytics.
2. Use a materialized view to precompute the last state of a product.
3. Serve the last state data directly from BigQuery to the API.

Comment 12.1

ID: 1159802 User: RenePetersen Badges: - Relative Date: 2 years ago Absolute Date: Mon 26 Feb 2024 14:42 Selected Answer: - Upvotes: 2

I believe the latency of BigQuery is too high to accommodate the sub-second latency requirement.

8. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 298

Sequence: 46
Discussion ID: 129911
Source URL: https://www.examtopics.com/discussions/google/view/129911-exam-professional-data-engineer-topic-1-question-298/
Posted By: chickenwingz
Posted At: Dec. 30, 2023, 9:18 p.m.

Question

One of your encryption keys stored in Cloud Key Management Service (Cloud KMS) was exposed. You need to re- encrypt all of your CMEK-protected Cloud Storage data that used that key, and then delete the compromised key. You also want to reduce the risk of objects getting written without customer-managed encryption key (CMEK) protection in the future. What should you do?

A. Rotate the Cloud KMS key version. Continue to use the same Cloud Storage bucket.
B. Create a new Cloud KMS key. Set the default CMEK key on the existing Cloud Storage bucket to the new one.
C. Create a new Cloud KMS key. Create a new Cloud Storage bucket. Copy all objects from the old bucket to the new one bucket while specifying the new Cloud KMS key in the copy command.
D. Create a new Cloud KMS key. Create a new Cloud Storage bucket configured to use the new key as the default CMEK key. Copy all objects from the old bucket to the new bucket without specifying a key.

Community Answer Votes

D: 24 most voted
C: 2
A: 1

Comments 10 comments Click to expand

Comment 1

ID: 1115441 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sat 06 Jul 2024 21:04 Selected Answer: D Upvotes: 12

- New Key Creation: A new Cloud KMS key ensures a secure replacement for the compromised one.
- New Bucket: A separate bucket prevents potential conflicts with existing objects and configurations.
- Default CMEK: Setting the new key as default enforces encryption for all objects in the bucket, reducing the risk of unencrypted data.
- Copy Without Key Specification: Copying objects without specifying a key leverages the default key, simplifying the process and ensuring consistent encryption.
- Old Key Deletion: After copying, the compromised key can be safely deleted.

Comment 2

ID: 1109955 User: chickenwingz Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 20:18 Selected Answer: D Upvotes: 8

Wrong:
A - rotating external key doesn't trigger re-encryption of data already in GCS: https://cloud.google.com/kms/docs/rotate-key#rotate-external-coordinated
C - Setting key during copy doesn't take care of objects that are later uploaded to the bucket, that will still use the default key

Comment 3

ID: 1606040 User: judy_data Badges: Most Recent Relative Date: 6 months, 1 week ago Absolute Date: Thu 04 Sep 2025 08:47 Selected Answer: C Upvotes: 1

passing the cloud KMS key in the command is more secure and explicit
https://cloud.google.com/storage/docs/encryption/customer-managed-keys?utm_source=chatgpt.com#key-replacement

Comment 4

ID: 1411721 User: desertlotus1211 Badges: - Relative Date: 11 months, 2 weeks ago Absolute Date: Sat 29 Mar 2025 15:01 Selected Answer: C Upvotes: 1

If no key is specified, and the bucket's default CMEK key is used, there's a risk that some objects might fall back to Google-managed encryption, especially if misconfigured

Comment 5

ID: 1156062 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 22 Aug 2024 03:11 Selected Answer: D Upvotes: 1

Option D

Comment 6

ID: 1153573 User: ML6 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 18 Aug 2024 21:11 Selected Answer: D Upvotes: 3

The correct answer is D. Rotating the key does not seem to re-encrypt:

In the event that a key is compromised, regular rotation (!!) limits the number of actual messages vulnerable to compromise (!!).
If you suspect that a key version is compromised, disable it and revoke access to it as soon as possible.
Source: https://cloud.google.com/kms/docs/key-rotation#why_rotate_keys

Comment 6.1

ID: 1153575 User: ML6 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 18 Aug 2024 21:13 Selected Answer: - Upvotes: 3

Note: When you rotate a key, data encrypted with previous key versions is not automatically re-encrypted with the new key version. You can learn more about re-encrypting data.
Source: https://cloud.google.com/kms/docs/key-rotation#how_often_to_rotate_keys

Comment 7

ID: 1131093 User: Medmah Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 24 Jul 2024 20:02 Selected Answer: - Upvotes: 2

I don't understand why only Matt select A

https://cloud.google.com/sdk/gcloud/reference/kms/keys/update

This seems to do the job, am I wrong ?

Comment 8

ID: 1121928 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 17:43 Selected Answer: A Upvotes: 1

Definitely A

Comment 8.1

ID: 1153572 User: ML6 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 18 Aug 2024 21:10 Selected Answer: - Upvotes: 1

Rotating does not mean you re-encrypt data.

9. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 300

Sequence: 51
Discussion ID: 130318
Source URL: https://www.examtopics.com/discussions/google/view/130318-exam-professional-data-engineer-topic-1-question-300/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 12:51 p.m.

Question

You currently have transactional data stored on-premises in a PostgreSQL database. To modernize your data environment, you want to run transactional workloads and support analytics needs with a single database. You need to move to Google Cloud without changing database management systems, and minimize cost and complexity. What should you do?

A. Migrate and modernize your database with Cloud Spanner.
B. Migrate your workloads to AlloyDB for PostgreSQL.
C. Migrate to BigQuery to optimize analytics.
D. Migrate your PostgreSQL database to Cloud SQL for PostgreSQL.

Community Answer Votes

B: 34 most voted
D: 33

Comments 18 comments Click to expand

Comment 1

ID: 1238272 User: 8ad5266 Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Thu 27 Jun 2024 17:59 Selected Answer: D Upvotes: 7

Minimize cost. https://cloud.google.com/alloydb?hl=en

AlloyDB offers superior performance, 4x faster than standard PostgreSQL for transactional workloads. That does not come without cost.

Comment 2

ID: 1602984 User: MvazM Badges: Most Recent Relative Date: 6 months, 2 weeks ago Absolute Date: Tue 26 Aug 2025 23:59 Selected Answer: D Upvotes: 1

D is clean

Comment 3

ID: 1581315 User: Ben_oso Badges: - Relative Date: 8 months, 2 weeks ago Absolute Date: Sat 28 Jun 2025 02:11 Selected Answer: D Upvotes: 1

D is clean

Comment 4

ID: 1560787 User: duers Badges: - Relative Date: 11 months ago Absolute Date: Tue 15 Apr 2025 08:28 Selected Answer: D Upvotes: 1

AlloyDB for PostgreSQL is a fully managed, PostgreSQL-compatible database service offered by Google Cloud. It's designed for high-performance transactional and analytical workloads and offers performance and scalability benefits over standard PostgreSQL. While it meets the requirement of not changing the database system in a broad sense (as it's PostgreSQL-compatible), it's a different service than standard PostgreSQL and might introduce a level of complexity and cost beyond simply migrating to Cloud SQL for PostgreSQL.

Comment 5

ID: 1560628 User: aaaaaaaasdasdasfs Badges: - Relative Date: 11 months ago Absolute Date: Mon 14 Apr 2025 18:02 Selected Answer: B Upvotes: 3

The correct answer is B. Migrate your workloads to AlloyDB for PostgreSQL.
Here's why:
Your requirements are:

Run both transactional and analytics workloads in a single database
Stay with PostgreSQL (don't change database systems)
Minimize cost and complexity

AlloyDB for PostgreSQL is specifically designed for this scenario - it's fully PostgreSQL-compatible but optimized for both transactional and analytical workloads. It offers:

PostgreSQL compatibility (minimizing migration complexity)
Enhanced analytics capabilities with column store indexes
Better performance for mixed workloads

Comment 6

ID: 1560345 User: rajshiv Badges: - Relative Date: 11 months ago Absolute Date: Sun 13 Apr 2025 16:53 Selected Answer: D Upvotes: 1

While B looks good too but It’s more expensive than Cloud SQL and better suited when you need advanced analytics and heavy transactional performance. I think it is an Overkill if you're looking to minimize cost/complexity which the question states.

Comment 7

ID: 1354945 User: mednoun Badges: - Relative Date: 1 year ago Absolute Date: Tue 11 Feb 2025 11:23 Selected Answer: B Upvotes: 2

The question specifies that the analytical needs need to reside in a single database. This can't be done using Cloud SQL. The database that supports all of that is AlloyDB that's why I will go with the B answer.

Comment 8

ID: 1351344 User: plum21 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Tue 04 Feb 2025 13:35 Selected Answer: B Upvotes: 2

"support analytics needs" -> columnar storage -> AlloyDB

Comment 9

ID: 1346547 User: juliorevk Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 25 Jan 2025 17:47 Selected Answer: D Upvotes: 1

Cloud SQL natively supports PostgreSQL
AlloyDB for PostgreSQL is a great option if you're specifically looking for high performance in both transactional and analytical workloads. However, it might be more complex and costly than Cloud SQL

Comment 10

ID: 1328161 User: joelcaro Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 17 Dec 2024 22:24 Selected Answer: B Upvotes: 3

B
AlloyDB es la mejor opción para modernizar el entorno, mantener compatibilidad con PostgreSQL y manejar tanto cargas transaccionales como analíticas en un único sistema, minimizando costos y complejidad.

Comment 11

ID: 1294706 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 08 Oct 2024 14:22 Selected Answer: B Upvotes: 4

In real life clearly how performant it needed to be would be a massive factor. AlloyDB is more expensive (see https://cloud.google.com/alloydb/pricing, vs https://cloud.google.com/sql/pricing), but when they say "minimise cost" is that per query, or is it per year assuming similar instance size. There's no way for us to know, we have to guess. I'm guessing AlloyDB, as the question seem to be telegraphing that, but it could just as easily be CloudSQL postgres based on the cheaper costs. We simply cannot know.

Comment 12

ID: 1246868 User: Antmal Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 12 Jul 2024 17:51 Selected Answer: B Upvotes: 4

Because AlloyDB is optimised for hybrid transactional and analytical processing (HTAP), meaning you can run both transactional workloads and analytics on the same database with excellent performance.

Comment 13

ID: 1246441 User: Anudeep58 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 12 Jul 2024 05:09 Selected Answer: B Upvotes: 2

AlloyDB

Comment 14

ID: 1236980 User: finixd Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 25 Jun 2024 18:12 Selected Answer: B Upvotes: 2

It's a little complicated, considering it says minimize costs (Cloud SQL) and run transactional workloads and support analytics needs (AlloyDB). I consider B. because you can minimize costs in the long-term instead of doing it immediately with possible extra costs in the long-term. Think about it

Comment 15

ID: 1228116 User: extraego Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 10 Jun 2024 23:23 Selected Answer: D Upvotes: 3

AlloyDB is for large scale and more expensive. We want to minimize cost and complexity, so the answer is D.

Comment 16

ID: 1220726 User: virat_kohli Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 29 May 2024 08:25 Selected Answer: B Upvotes: 2

B. Migrate your workloads to AlloyDB for PostgreSQL.

Comment 16.1

ID: 1220728 User: virat_kohli Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 29 May 2024 08:28 Selected Answer: - Upvotes: 2

Sorry its D. Migrate your PostgreSQL database to Cloud SQL for PostgreSQL.

Comment 17

ID: 1184577 User: omkarr24 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 28 Mar 2024 06:34 Selected Answer: D Upvotes: 4

They currently have transactional data stored on-premises in a PostgreSQL database and they want to modernize their database that supports transactional workloads and analytics .If they select cloud Sql (postgreSQL) it will minimize the cost and complexity.
and for analytics purpose they can create federated queries over cloudSql(postgreSql)
https://cloud.google.com/bigquery/docs/federated-queries-intro
This approach will minimze the cost

10. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 47

Sequence: 55
Discussion ID: 16820
Source URL: https://www.examtopics.com/discussions/google/view/16820-exam-professional-data-engineer-topic-1-question-47/
Posted By: rickywck
Posted At: March 17, 2020, 4:50 a.m.

Question

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:
✑ The user profile: What the user likes and doesn't like to eat
✑ The user account information: Name, address, preferred meal times
✑ The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

A. BigQuery
B. Cloud SQL
C. Cloud Bigtable
D. Cloud Datastore

Community Answer Votes

A: 45 most voted
B: 44
D: 20

Comments 22 comments Click to expand

Comment 1

ID: 67066 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sun 22 Mar 2020 20:12 Selected Answer: - Upvotes: 62

You want to optimize the data schema + Machine Learning --> Bigquery. So A

Comment 1.1

ID: 444423 User: yoshik Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Tue 14 Sep 2021 10:39 Selected Answer: - Upvotes: 28

BigQuery is a datawarehouse, not a transactional db. You need to store transactional data as a requirement.

Comment 1.1.1

ID: 458615 User: alexmirmao Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Thu 07 Oct 2021 10:41 Selected Answer: - Upvotes: 9

In my opinion transactional data doesnt mean transactions they could be grouped so there is no need to write register by register.

Comment 1.1.1.1

ID: 463261 User: yoshik Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 16 Oct 2021 21:37 Selected Answer: - Upvotes: 5

In other questions they talk about 'transactional log data' when referring to past transactions, but you could be right, agree. In that case ok A BigQuery. Nevertheless, the question is formulated ambiguously.

Comment 1.1.2

ID: 632192 User: alecuba16 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sat 16 Jul 2022 15:59 Selected Answer: - Upvotes: 4

Biquery Supports transactions:
https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions
, but indeed is not a good DB for OLTP.

But I would said or CloudSQL or BigQuery

Comment 2

ID: 76122 User: itche_scratche Badges: Highly Voted Relative Date: 5 years, 10 months ago Absolute Date: Sat 18 Apr 2020 18:30 Selected Answer: - Upvotes: 9

A; ML should be BQ; stored all transactional data (not use for transactional) so should BQ

Comment 2.1

ID: 219072 User: GeeBeeEl Badges: - Relative Date: 5 years, 3 months ago Absolute Date: Sat 14 Nov 2020 12:10 Selected Answer: - Upvotes: 1

Do you have a link to back this up?

Comment 3

ID: 1601747 User: 1479 Badges: Most Recent Relative Date: 6 months, 3 weeks ago Absolute Date: Sat 23 Aug 2025 17:34 Selected Answer: A Upvotes: 1

BQ for storing ML tables

Comment 4

ID: 1601745 User: 1479 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Sat 23 Aug 2025 17:33 Selected Answer: A Upvotes: 1

bq storing for machine learning

Comment 5

ID: 1581490 User: 56d02cd Badges: - Relative Date: 8 months, 2 weeks ago Absolute Date: Sat 28 Jun 2025 20:26 Selected Answer: D Upvotes: 1

The docs for Datastore call out user profiles as a use case. https://cloud.google.com/appengine/docs/legacy/standard/go111/datastore#what_its_good_for

Comment 6

ID: 1574701 User: 22c1725 Badges: - Relative Date: 9 months, 1 week ago Absolute Date: Wed 04 Jun 2025 09:11 Selected Answer: A Upvotes: 1

"for a machine learning-based"
This is the why. So, the objective of storege here is not (ACID) opeartion at all. there is only two requirment here:
Store the data.
Do machine learning.

Cloud SQL is perfect for ACID opeartion which the question didn't hint. You only want to store the data, improve data schame for ML. I think alot of people here are misunderstanding the question.

Comment 7

ID: 1410172 User: abhaya2608 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Tue 25 Mar 2025 22:36 Selected Answer: A Upvotes: 1

Cloud SQL doesn't support ML so the right answer will be BigQuery

Comment 8

ID: 1362559 User: dcruzado Badges: - Relative Date: 1 year ago Absolute Date: Thu 27 Feb 2025 15:32 Selected Answer: B Upvotes: 1

Transactional db -> CloudSQL

Comment 9

ID: 1362549 User: dcruzado Badges: - Relative Date: 1 year ago Absolute Date: Thu 27 Feb 2025 15:09 Selected Answer: B Upvotes: 1

Since they need transactional data i would say B
However thinking on machine learning is better A

Comment 10

ID: 1349827 User: cqrm3n Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 01 Feb 2025 11:30 Selected Answer: B Upvotes: 2

The answer should be Cloud SQL because it is a relational database suitable for transactional data.

BigQuery is for analytics and querying - not suitable for transactional workload.
Bigtable is for unstructured and time series data.
Datastore is a nosql document database for semi structured data.

Comment 11

ID: 1344714 User: Yad_datatonic Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Wed 22 Jan 2025 11:58 Selected Answer: A Upvotes: 1

For a machine learning-based food ordering service that requires optimised storage of transactional data, Google Cloud BigQuery is a suitable choice

Comment 12

ID: 1342439 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 07:55 Selected Answer: A Upvotes: 1

ithin Google Cloud, the database that most readily allows for data schema optimization is BigQuery; it provides features like schema auto-detection, columnar storage, and the ability to manually define your schema to tailor it for efficient querying and analysis of large datasets.

Comment 13

ID: 1337505 User: manikolbe Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 07 Jan 2025 10:26 Selected Answer: B Upvotes: 3

Cloud SQL is the best choice for your application as it provides relational database management and is optimized for storing transactional data with SQL querying capabilities. It is well-suited for managing user profiles, account information, and orders, ensuring data integrity, and supporting complex queries necessary for the food ordering service.

Comment 14

ID: 1335711 User: Ronn27 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 02 Jan 2025 19:46 Selected Answer: B Upvotes: 2

Use BigQuery for analyzing aggregated data (e.g., predicting food trends or training ML models).
Use Cloud Bigtable for large-scale real-time recommendation engines if needed in the future.
Use Firestore for dynamic, semi-structured data with real-time updates if you need flexibility over transactional consistency.
Cloud SQL strikes the right balance for this use case due to its support for structured data, transactions, and easy integration with other GCP services.
So B. CloudSQL is the right answer

Comment 15

ID: 1329504 User: sravi1200 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 20 Dec 2024 15:34 Selected Answer: B Upvotes: 1

Cloud SQL can store transactional data not Big Query. Big Query is an analytical service.

Comment 16

ID: 1328861 User: DGames Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 19 Dec 2024 06:27 Selected Answer: A Upvotes: 1

Easy implement data schema + Machine Learning model in Big Query

Comment 17

ID: 1324109 User: julydev82 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Mon 09 Dec 2024 16:06 Selected Answer: B Upvotes: 1

database will be used to storage all transactional data.... I think that you need a relational database for that, then federated tables to bigquery to analysis.

11. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 22

Sequence: 61
Discussion ID: 16624
Source URL: https://www.examtopics.com/discussions/google/view/16624-exam-professional-data-engineer-topic-1-question-22/
Posted By: Rajokkiyam
Posted At: March 15, 2020, 3:36 a.m.

Question

Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a
Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?

A. Run a local version of Jupiter on the laptop.
B. Grant the user access to Google Cloud Shell.
C. Host a visualization tool on a VM on Google Compute Engine.
D. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.

Community Answer Votes

D: 55 most voted
A: 1

Comments 22 comments Click to expand

Comment 1

ID: 64120 User: Rajokkiyam Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 15 Sep 2020 02:36 Selected Answer: - Upvotes: 48

Answer should be D.

Comment 2

ID: 64981 User: rickywck Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Thu 17 Sep 2020 02:45 Selected Answer: - Upvotes: 7

Obviously the answer is D

Comment 3

ID: 1364870 User: Abizi Badges: Most Recent Relative Date: 1 year ago Absolute Date: Tue 04 Mar 2025 12:13 Selected Answer: D Upvotes: 1

D is the right answer

Comment 4

ID: 1217191 User: VictorBa Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sun 24 Nov 2024 05:46 Selected Answer: D Upvotes: 2

Google Cloud Datalab is a powerful interactive tool for data exploration, analysis, and machine learning.

Comment 5

ID: 1208682 User: trashbox Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Sat 09 Nov 2024 07:08 Selected Answer: D Upvotes: 5

My answer is Google Cloud Datalab, but since that service has already been discontinued, I question whether a problem like this would actually be asked on the actual exam.

Comment 5.1

ID: 1600162 User: Surabhi20 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Wed 20 Aug 2025 12:52 Selected Answer: - Upvotes: 1

The intention is to build the logic which will be ofcourse always helpful for any exam

Comment 6

ID: 1124928 User: GCanteiro Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 17 Jul 2024 12:05 Selected Answer: D Upvotes: 1

D sounds good for me

Comment 7

ID: 1096332 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 14 Jun 2024 10:22 Selected Answer: A Upvotes: 1

Hash Value for Deduplication: By computing a hash value for each data entry, you create a unique identifier based on the content of the data. This allows you to efficiently identify duplicates, as entries with identical content will have the same hash value.

Storing Hash Value and Metadata: Maintaining a database table that includes the hash value and other relevant metadata (like the timestamp of transmission) allows for quick lookups and comparisons. This way, when new data is received, you can check if an entry with the same hash value already exists.
Assign global unique identifiers (GUID) to each data entry: While GUIDs are unique, they do not inherently identify duplicate content. Two transmissions of the same data would have different GUIDs.

Comment 7.1

ID: 1600163 User: Surabhi20 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Wed 20 Aug 2025 12:53 Selected Answer: - Upvotes: 1

Wrong question palette

Comment 7.2

ID: 1356827 User: simpa17 Badges: - Relative Date: 1 year ago Absolute Date: Sat 15 Feb 2025 13:00 Selected Answer: - Upvotes: 1

You mistakenly answered the question above haha

Comment 8

ID: 1076326 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 13:51 Selected Answer: D Upvotes: 1

D sounds good for me

Comment 9

ID: 1065187 User: RT_G Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 07 May 2024 21:51 Selected Answer: D Upvotes: 1

Agree with D

Comment 10

ID: 1050525 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 13:58 Selected Answer: D Upvotes: 3

D. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.

Google Cloud Datalab is a powerful interactive tool for data exploration, analysis, and machine learning. By deploying it to a VM on Google Compute Engine, you can provide her with a robust and scalable environment where she can work with large datasets, create labeled datasets, and perform data analyses efficiently.

Option A (running a local version of Jupyter on her laptop) might not be sufficient for very large datasets, and her laptop's limited power could still be a bottleneck.

Option B (granting access to Google Cloud Shell) is useful for running command-line tools but may not provide the interactive and visual capabilities she needs.

Option C (hosting a visualization tool on a VM on Google Compute Engine) is beneficial for visualization tasks but does not cover the full spectrum of data analysis and machine learning tasks that Google Cloud Datalab offers.

Comment 11

ID: 999556 User: gudguy1a Badges: - Relative Date: 2 years ago Absolute Date: Tue 05 Mar 2024 16:32 Selected Answer: D Upvotes: 1

D - as it is a FULL set up, not a shell that is needed...

Comment 12

ID: 994675 User: sergiomujica Badges: - Relative Date: 2 years ago Absolute Date: Thu 29 Feb 2024 05:54 Selected Answer: - Upvotes: 2

Nowadays it should be similar to D, deploy a Vertex workbench

Comment 13

ID: 984230 User: yash12 Badges: - Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 09:45 Selected Answer: - Upvotes: 1

As per Options , Correct Answer should be D. ie Datalab
However Datalab is no longer used in GCP (Deprecated in Sep2022), It is Vertex AI or Deep Learning VM Images

Comment 14

ID: 984170 User: HeoMaTo Badges: - Relative Date: 2 years ago Absolute Date: Sun 18 Feb 2024 07:19 Selected Answer: D Upvotes: 1

I think.
Answer is D

Comment 15

ID: 960075 User: Acocado Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 07:56 Selected Answer: - Upvotes: 2

Datalab is deprecated. This question should appear in the exam.

Comment 15.1

ID: 960076 User: Acocado Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 07:56 Selected Answer: - Upvotes: 6

typo- should NOT appear in the exam

Comment 15.1.1

ID: 1057110 User: axantroff Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 29 Apr 2024 19:44 Selected Answer: - Upvotes: 1

Good point - https://cloud.google.com/datalab/deprecation-notice. Google recommends using Vertex AI Workbench instead

Comment 16

ID: 919251 User: dgteixeira Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 09 Dec 2023 14:44 Selected Answer: D Upvotes: 3

Should be D, because Cloud shell alone does not provide access to what they need.
Nowadays is Vertex AI, but still, correct answer is D

Comment 17

ID: 909750 User: Maurilio_Cardoso Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 30 Nov 2023 01:02 Selected Answer: D Upvotes: 3

Google Cloud Datalab is now Vertex AI. So, letter D make more sense.

12. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 29

Sequence: 62
Discussion ID: 17029
Source URL: https://www.examtopics.com/discussions/google/view/17029-exam-professional-data-engineer-topic-1-question-29/
Posted By: -
Posted At: March 20, 2020, 8:08 a.m.

Question

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?

A. Use a row key of the form <timestamp>.
B. Use a row key of the form <sensorid>.
C. Use a row key of the form <timestamp>#<sensorid>.
D. Use a row key of the form >#<sensorid>#<timestamp>.

Community Answer Votes

D: 13 most voted

Comments 21 comments Click to expand

Comment 1

ID: 391401 User: sumanshu Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Sun 26 Dec 2021 19:23 Selected Answer: - Upvotes: 12

Vote for 'D' - Store multiple delimited values in each row key. (But avoid starting with Timestamp)

"Row keys to avoid"
https://cloud.google.com/bigtable/docs/schema-design

Comment 1.1

ID: 401851 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 08 Jan 2022 14:30 Selected Answer: - Upvotes: 9

A is not correct because this will cause most writes to be pushed to a single node (known as hotspotting)
B is not correct because this will not allow for multiple readings from the same sensor as new readings will overwrite old ones.
C is not correct because this will cause most writes to be pushed to a single node (known as hotspotting)
D is correct because it will allow for retrieval of data based on both sensor id and timestamp but without causing hotspotting.

Comment 1.1.1

ID: 1600170 User: Surabhi20 Badges: - Relative Date: 6 months, 3 weeks ago Absolute Date: Wed 20 Aug 2025 13:45 Selected Answer: - Upvotes: 1

Agree, this is the correct justification for the options

Comment 2

ID: 530873 User: samdhimal Badges: Highly Voted Relative Date: 3 years, 7 months ago Absolute Date: Sat 23 Jul 2022 22:52 Selected Answer: - Upvotes: 9

A. Use a row key of the form <timestamp>.
---> Incorrect, because google says don't use a timestamp by itself or at the beginning of a row key.
B. Use a row key of the form <sensorid>.
--->Incorrect, because google says Include a timestamp as part of your row key.
C. Use a row key of the form <timestamp>#<sensorid>.
---> Incorrect, because google says don't use a timestamp by itself or at the beginning of a row key.
D. Use a row key of the form >#<sensorid>#<timestamp>.
---> Correct answer, because of option A,B,C reasons.
- Timestamp isn't by itself, neither at the beginning.
- Timestamp is included.

Reference: https://cloud.google.com/bigtable/docs/schema-design#row-keys

Comment 3

ID: 1562087 User: vosang5299 Badges: Most Recent Relative Date: 10 months, 3 weeks ago Absolute Date: Sun 20 Apr 2025 03:32 Selected Answer: D Upvotes: 1

D is correct

Comment 4

ID: 1076375 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 14:35 Selected Answer: D Upvotes: 2

Looks like D is the best option
Reference: https://cloud.google.com/bigtable/docs/schema-design#time-based

Comment 4.1

ID: 1212704 User: mark1223jkh Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sun 17 Nov 2024 08:04 Selected Answer: - Upvotes: 1

Thank you that is right.

Comment 5

ID: 1050538 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 14:14 Selected Answer: D Upvotes: 3

D. Use a row key of the form <sensorid>#<timestamp>.

By using the sensor ID as the prefix in the row key, you can achieve better distribution of data across Bigtable tablets. This can help balance the workload and prevent hotspots in the table. Additionally, placing the timestamp after the sensor ID allows you to perform range scans for a specific sensor and retrieve data efficiently within a time frame.

Option C (using a row key of the form <timestamp>#<sensorid>) can work for some use cases but may not be as efficient for range scans when you want to retrieve data for a specific sensor within a time range.

Option A (using a row key of the form <timestamp>) may lead to hotspots and inefficient range scans because it doesn't consider sensor IDs.

Option B (using a row key of the form <sensorid>) is not optimal because it doesn't allow for efficient time-based filtering and could lead to uneven data distribution in Bigtable.

Comment 6

ID: 766085 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 21:00 Selected Answer: - Upvotes: 1

D is right
Best practices of bigtable states that rowkey should not be only timestamp or have timestamp at starting. It’s better to have sensorid and timestamp as rowkey.

Reference:
https://cloud.google.com/bigtable/docs/schema-design

Comment 7

ID: 744018 User: Nirca Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 13 Jun 2023 12:54 Selected Answer: D Upvotes: 5

#<sensorid>#<timestamp> ------> low cardinality # high cardinality
This is current Bigtable Best Practice (to avoid Hotspots on the inserts)

Comment 8

ID: 689230 User: maxdataengineer Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 08 Apr 2023 13:15 Selected Answer: D Upvotes: 1

Discard:
A -> timestamp unique id could not be unique in the case that sensors transmit data at the same time.
B -> sensorId repeated id for messages coming from the same sensor
C -> a bad performance choice

D -> BEST CHOICE. Each time BigTable looks for data in a table it does a scan and sort operations. By starting each unique id by sensorId it will make it easier to group and sort data since it has the lowest cardinality
https://cloud.google.com/bigtable/docs/schema-design#general-concepts

Comment 9

ID: 663153 User: John_Pongthorn Badges: - Relative Date: 3 years ago Absolute Date: Wed 08 Mar 2023 08:29 Selected Answer: - Upvotes: 2

as I look at https://cloud.google.com/bigtable/docs/schema-design#row-keys
asia#india#bangalore
asia#india#mumbai
they didn't have # ahead of this first value.
asia#india#bangalore OR #asia#india#bangalore
Are both valid?

Comment 10

ID: 649479 User: crisimenjivar Badges: - Relative Date: 3 years ago Absolute Date: Mon 20 Feb 2023 18:24 Selected Answer: - Upvotes: 1

ANSWER: D

Comment 11

ID: 617190 User: som_420 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 12:36 Selected Answer: D Upvotes: 1

Answer is D

Comment 12

ID: 461146 User: anji007 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 12 Apr 2022 17:57 Selected Answer: - Upvotes: 2

Ans: D

Comment 13

ID: 285009 User: naga Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Fri 06 Aug 2021 18:05 Selected Answer: - Upvotes: 2

Correct D

Comment 14

ID: 243362 User: NamitSehgal Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Mon 14 Jun 2021 07:37 Selected Answer: - Upvotes: 3

Should be D
Reverse of timestamp even better but no options for that.
Also changing sensor ID if they are in sequential to hash or changing data to bits even better.
Idea is not to use timestamp or sequential ID as first key.

Comment 14.1

ID: 531428 User: Tanzu Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sun 24 Jul 2022 15:51 Selected Answer: - Upvotes: 1

reverse TS or hashing is not always first choice or better. never.

Comment 15

ID: 220259 User: Radhika7983 Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Sun 16 May 2021 10:50 Selected Answer: - Upvotes: 3

The correct answer is D.
Refer to the link https://cloud.google.com/bigtable/docs/schema-design for Big table schema design.

C is not the right answer becuase
Timestamps
If you often need to retrieve data based on the time when it was recorded, it's a good idea to include a timestamp as part of your row key. Using the timestamp by itself as the row key is not recommended, as most writes would be pushed onto a single node. For the same reason, avoid placing a timestamp at the start of the row key.

For example, your application might need to record performance-related data, such as CPU and memory usage, once per second for a large number of machines. Your row key for this data could combine an identifier for the machine with a timestamp for the data (for example, machine_4223421#1425330757685).

Comment 16

ID: 210571 User: arghya13 Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Sat 01 May 2021 15:42 Selected Answer: - Upvotes: 2

answer would be D to avoid hotspoting..

Comment 17

ID: 114456 User: ch3n6 Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Sun 20 Dec 2020 08:44 Selected Answer: - Upvotes: 4

correct: D
why not C? Using the timestamp by itself as the row key is not recommended, as most writes would be pushed onto a single node. For the same reason, avoid placing a timestamp at the start of the row key. https://cloud.google.com/bigtable/docs/schema-design#row-keys

13. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 160

Sequence: 66
Discussion ID: 17214
Source URL: https://www.examtopics.com/discussions/google/view/17214-exam-professional-data-engineer-topic-1-question-160/
Posted By: -
Posted At: March 22, 2020, 7:43 a.m.

Question

You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20
TB in size. Which database should you choose?

A. Cloud SQL
B. Cloud Bigtable
C. Cloud Spanner
D. Cloud Datastore

Community Answer Votes

A: 6 most voted
C: 5

Comments 24 comments Click to expand

Comment 1

ID: 68975 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 28 Mar 2020 21:18 Selected Answer: - Upvotes: 33

A. Cloud SQL (30TB)

Comment 1.1

ID: 249441 User: Gcpyspark Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Mon 21 Dec 2020 16:23 Selected Answer: - Upvotes: 2

Sure, however in future if the capacity grows beyond 30 TB then Cloud SQL won't work right then Spanner would be the option?

Comment 1.1.1

ID: 789216 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 27 Jan 2023 02:02 Selected Answer: - Upvotes: 3

you can always call GCP to add quota.. .Spanner is for global reach, ideally...

Comment 1.2

ID: 88006 User: vindahake Badges: - Relative Date: 5 years, 10 months ago Absolute Date: Wed 13 May 2020 01:53 Selected Answer: - Upvotes: 7

Up to 30,720 GB, depending on the machine type. This looks like correct choice.
https://cloud.google.com/sql/docs/quotas#fixed-limits

Comment 1.2.1

ID: 739185 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 08 Dec 2022 15:54 Selected Answer: - Upvotes: 3

https://cloud.google.com/sql/docs/quotas#storage_limits
64TB

Comment 1.2.2

ID: 1267864 User: Satishjuly18 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 18 Aug 2024 00:59 Selected Answer: - Upvotes: 1

65 TB now in Aug 2024

Comment 1.2.2.1

ID: 1267865 User: Satishjuly18 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 18 Aug 2024 01:00 Selected Answer: - Upvotes: 1

*64 TB

Comment 1.3

ID: 452808 User: dagoat Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Tue 28 Sep 2021 00:15 Selected Answer: - Upvotes: 14

65 TB now in Sept 2021

Comment 2

ID: 128920 User: Rajuuu Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Tue 07 Jul 2020 14:20 Selected Answer: - Upvotes: 6

A as limit is now 30 TB for Cloud SQL

Comment 3

ID: 1586176 User: imrane1995 Badges: Most Recent Relative Date: 8 months ago Absolute Date: Sun 13 Jul 2025 17:11 Selected Answer: C Upvotes: 1

C. Cloud Spanner
📌 Breakdown of Requirements:
20 TB transactional data → A large-scale, transactional, and reliable database is needed.

From an operational system → Implies high consistency, availability, and performance.
Moving to GCP → Should use a cloud-native, fully managed GCP service.

Comment 4

ID: 1303871 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 28 Oct 2024 08:55 Selected Answer: A Upvotes: 1

Cloud SQL storage limit: dedicated core up to 64 TB.

Comment 5

ID: 1053758 User: drpay Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 25 Oct 2023 15:32 Selected Answer: C Upvotes: 2

two keywords: Transactional data, 20 TB

Comment 6

ID: 1015892 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 16:26 Selected Answer: C Upvotes: 2

Scalability: Cloud Spanner is designed to handle large volumes of data, making it suitable for a 20 TB database. It can scale horizontally and vertically to accommodate growing data needs.

Global Distribution: Cloud Spanner allows you to distribute data globally for low-latency access across regions, which can be advantageous for operational systems.

Strong Consistency: It provides strong transactional consistency, which is important for operational systems that require ACID compliance.

SQL Support: Cloud Spanner supports SQL, which is a familiar query language for developers.

While Cloud SQL, Cloud Bigtable, and Cloud Datastore have their use cases, Cloud Spanner is better suited for larger databases with strong consistency requirements, making it a suitable choice for migrating a 20 TB operational system database to GCP.

Comment 7

ID: 1007822 User: ashu381 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Thu 14 Sep 2023 19:07 Selected Answer: - Upvotes: 1

Cloud SQL, upto 64 TB now, you can always call GCP for increasing the quota though !!

Comment 8

ID: 928383 User: vaga1 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 20 Jun 2023 13:16 Selected Answer: A Upvotes: 2

Cloud SQL is generally better for OLTP, and Cloud SQL is up to 64 TB now.
https://cloud.google.com/sql/docs/quotas#storage_limits

Comment 9

ID: 895840 User: vaga1 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Fri 12 May 2023 12:43 Selected Answer: - Upvotes: 2

"move its operational system transaction data from an on-premises database to GCP". Cloud SQL may be plug-and-play

Comment 10

ID: 812980 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Sat 18 Feb 2023 14:08 Selected Answer: - Upvotes: 1

Not 100% in favour of A, Should i recommend my client Cloud SQL, when they are coming to me with 20TB already 30TB is limit, its transactional data, which i can't compromise. I will propose cloud spanner. There is nothing mentioned that they want to save cost.

Comment 11

ID: 762826 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 19:30 Selected Answer: - Upvotes: 1

A. Cloud SQL

Comment 12

ID: 722573 User: Jay_Krish Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 20 Nov 2022 12:38 Selected Answer: A Upvotes: 3

With the given requirements A. Cloud SQL is more than sufficient. Don't try to overthink scenarios like what if it grows.. what if there's additional requirement in future.. what if this what if that.. just look at the question and see the stated requirement. If there are more than one answer try to see which is simple and doesn't come with extra frills.

Comment 13

ID: 717775 User: Atnafu Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 14 Nov 2022 08:52 Selected Answer: - Upvotes: 2

A
65 TB now in Nov 2022

Comment 14

ID: 638261 User: WZH Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Wed 27 Jul 2022 19:49 Selected Answer: - Upvotes: 1

it is already 20 TB at the moment, and you probably want to change the database because the capacity of your current storage solution is not enough. Then you decide to change it to Cloud SQL(up to 30 TB) which may not increase much capacity? I am not sure about the answer but A looks weird imho.

Comment 15

ID: 633513 User: Dan226 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Tue 19 Jul 2022 13:24 Selected Answer: - Upvotes: 1

Cloud SQL can store 64 Tb, but in the intial set up the operation are 20tb. It will reach the limitation soon if you choose Cloud SQL

Comment 16

ID: 467075 User: gcp_k Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sun 24 Oct 2021 20:00 Selected Answer: - Upvotes: 4

Depends.. I mean, C is correct if the exam is not updated. A is correct if the exam is updated. So ... kinda in catch 22 situation ...

Comment 17

ID: 464334 User: KokkiKumar Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Tue 19 Oct 2021 01:21 Selected Answer: - Upvotes: 2

Hi everyone, Can i purchase this exam? is it worthable?

14. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 183

Sequence: 67
Discussion ID: 79580
Source URL: https://www.examtopics.com/discussions/google/view/79580-exam-professional-data-engineer-topic-1-question-183/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 10:25 p.m.

Question

You are using Bigtable to persist and serve stock market data for each of the major indices. To serve the trading application, you need to access only the most recent stock prices that are streaming in. How should you design your row key and tables to ensure that you can access the data with the simplest query?

A. Create one unique table for all of the indices, and then use the index and timestamp as the row key design.
B. Create one unique table for all of the indices, and then use a reverse timestamp as the row key design.
C. For each index, have a separate table and use a timestamp as the row key design.
D. For each index, have a separate table and use a reverse timestamp as the row key design.

Community Answer Votes

A: 36 most voted
B: 26
D: 20

Comments 19 comments Click to expand

Comment 1

ID: 679610 User: John_Pongthorn Badges: Highly Voted Relative Date: 3 years, 5 months ago Absolute Date: Mon 26 Sep 2022 12:13 Selected Answer: - Upvotes: 17

This is special case , plese Take a look carefully the below link and read at last paragraph at the bottom of this comment, let everyone share idea, We will go with B, C
https://cloud.google.com/bigtable/docs/schema-design#time-based

Don't use a timestamp by itself or at the beginning of a row key, because this will cause sequential writes to be pushed onto a single node, creating a hotspot.

If you usually retrieve the most recent records first, you can use a reversed timestamp in the row key by subtracting the timestamp from your programming language's maximum value for long integers (in Java, java.lang.Long.MAX_VALUE). With a reversed timestamp, the records will be ordered from most recent to least recent.

Comment 1.1

ID: 714755 User: Mcloudgirl Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Wed 09 Nov 2022 18:10 Selected Answer: - Upvotes: 2

I agree, based on the docs, B. Leading with a non-reversed timestamp will lead to hotspotting, reversed is the way to go.

Comment 1.2

ID: 1350814 User: Ryannn23 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 03 Feb 2025 10:39 Selected Answer: - Upvotes: 2

According to the link you provided:
If you usually retrieve the most recent records first in your queries, a pattern to consider is using reversed timestamps in the row key. This pattern causes rows to be ordered from most recent to least recent, so more recent data is earlier in the table.
--------------------- READ CAREFULY --------------------------------------------------------------------------------
As with any timestamp, avoid starting a row key with a reversed timestamp so that you don't cause hotspots.
-----------------------------------------------------------------------------------------------------------------------------------
You can get a reversed timestamp by subtracting the timestamp from your programming language's maximum value for long integers (in Java, java.lang.Long.MAX_VALUE).

Hence, starting row key with timestamp should be avoided (normal or reversed).

That leads to answer A, which is a best practice.

Comment 2

ID: 986534 User: arien_chen Badges: Highly Voted Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 15:20 Selected Answer: D Upvotes: 6

Option B using reverse timestamp only, this is not the answer.
the right answer should be using the index and revers timestamp as the row key.

So, Option D is the only answer, because not A,B,C .

Comment 3

ID: 1588719 User: imrane1995 Badges: Most Recent Relative Date: 7 months, 3 weeks ago Absolute Date: Sun 20 Jul 2025 17:18 Selected Answer: D Upvotes: 1

When designing for Bigtable, particularly for time-series data like stock prices, your row key design directly impacts read efficiency and data locality. You want to:

Access latest data quickly

Avoid hotspots

Keep queries simple and fast

Comment 4

ID: 1350815 User: Ryannn23 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 03 Feb 2025 10:40 Selected Answer: A Upvotes: 3

Vote A, as explained by Augustax:

Agree A is the best option because:
1. Multi-tenancy solution
2. As with any timestamp, avoid starting a row key with a reversed timestamp so that you don't cause hotspots.

Comment 5

ID: 1343962 User: Augustax Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Tue 21 Jan 2025 02:57 Selected Answer: A Upvotes: 4

Agree A is the best option because:
1. Multi-tenancy solution
2. As with any timestamp, avoid starting a row key with a reversed timestamp so that you don't cause hotspots.

Comment 6

ID: 1328892 User: shangning007 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 19 Dec 2024 08:49 Selected Answer: D Upvotes: 4

I don't think any answer is correct.
A lot people upvote for B, but based on https://cloud.google.com/bigtable/docs/schema-design#time-based, "As with any timestamp, avoid starting a row key with a reversed timestamp so that you don't cause hotspots."

Comment 7

ID: 1305798 User: ToiToi Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Fri 01 Nov 2024 13:30 Selected Answer: D Upvotes: 2

Why other options are not as suitable:

A and B (One table for all indices): Storing all indices in a single table can lead to performance issues as the table grows larger. It also makes it harder to scale individual indices independently.
C (Timestamp as row key): Using a regular timestamp would place the most recent data at the end of the table, making it less efficient to retrieve the latest prices.

Comment 8

ID: 1303917 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 28 Oct 2024 11:17 Selected Answer: D Upvotes: 2

Option B and Option D are both from my point of view correct. It depens on the situation. If there is need to get the information from each stock index, then D is more suitable. Otherwise B.

Comment 9

ID: 1278776 User: mayankazyour Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 05 Sep 2024 09:52 Selected Answer: D Upvotes: 2

1. Reverse Timestamp for most recent stock prices
2. Having different table for each stock is more efficient, improves the query performance and option B doesn't specify stock in row key.

Comment 10

ID: 1260426 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 03 Aug 2024 21:22 Selected Answer: A Upvotes: 6

Row keys that start with a timestamp (irrespective reversed or not) causes sequential writes to be pushed onto a single node, creating a hotspot. If you put a timestamp in a row key, precede it with a high-cardinality value (index in our case) to avoid hotspots.

The ideal option would be: "use the index and reversed timestamp as the row key design".

Comment 11

ID: 1122671 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 14 Jan 2024 17:27 Selected Answer: B Upvotes: 5

B is a correct answer because "you need to access only the most recent stock prices"

"If you usually retrieve the most recent records first, you can use a reversed timestamp in the row key by subtracting the timestamp from your programming language's maximum value for long integers (in Java, java.lang.Long.MAX_VALUE). With a reversed timestamp, the records will be ordered from most recent to least recent."
https://cloud.google.com/bigtable/docs/schema-design#time-based

Comment 12

ID: 1104546 User: TVH_Data_Engineer Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 24 Dec 2023 11:35 Selected Answer: B Upvotes: 2

B. One unique table for all indices, reverse timestamp as row key:

A single table for all indices keeps the structure simple.
Using a reverse timestamp as part of the row key ensures that the most recent data comes first in the sorted order. This design is beneficial for quickly accessing the latest data.
For example: you can convert the timestamp to a string and format it in reverse order, like "yyyyMMddHHmmss", ensuring newer dates and times are sorted lexicographically before older ones.

Comment 13

ID: 1008346 User: kshehadyx Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 15 Sep 2023 11:54 Selected Answer: - Upvotes: 1

Correct Is B

Comment 14

ID: 968055 User: Lanro Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 31 Jul 2023 13:58 Selected Answer: B Upvotes: 2

https://cloud.google.com/bigtable/docs/schema-design#row-keys - If you usually retrieve the most recent records first, you can use a reversed timestamp
B it is.

Comment 15

ID: 945802 User: Chom Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 07 Jul 2023 17:55 Selected Answer: A Upvotes: 2

A is the answer

Comment 16

ID: 929460 User: vaga1 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Wed 21 Jun 2023 14:12 Selected Answer: B Upvotes: 2

the answer relieves on whether the application need to access the whole indexes at the same time or not. If yes then is B, if no is A.

in mind the answer is yes, so B makes more sense: I retrieve all the list at the same time.

Comment 17

ID: 919089 User: ajdf Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 09 Jun 2023 10:34 Selected Answer: B Upvotes: 2

https://cloud.google.com/bigtable/docs/schema-design#time-based If you usually retrieve the most recent records first, you can use a reversed timestamp in the row key by subtracting the timestamp from your programming language's maximum value for long integers (in Java, java.lang.Long.MAX_VALUE). With a reversed timestamp, the records will be ordered from most recent to least recent.

15. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 3

Sequence: 68
Discussion ID: 16635
Source URL: https://www.examtopics.com/discussions/google/view/16635-exam-professional-data-engineer-topic-1-question-3/
Posted By: -
Posted At: March 15, 2020, 8:14 a.m.

Question

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

A. Add capacity (memory and disk space) to the database server by the order of 200.
B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Community Answer Votes

C: 18 most voted

Comments 18 comments Click to expand

Comment 1

ID: 473987 User: MaxNRG Badges: Highly Voted Relative Date: 4 years, 4 months ago Absolute Date: Sun 07 Nov 2021 18:37 Selected Answer: - Upvotes: 10

C is correct because this option provides the least amount of inconvenience over using pre-specified date ranges or one table per clinic while also increasing performance due to avoiding self-joins.
A is not correct because adding additional compute resources is not a recommended way to resolve database schema problems.
B is not correct because this will reduce the functionality of the database and make running reports more difficult.
D is not correct because this will likely increase the number of tables so much that it will be more difficult to generate reports vs. the correct option.
https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#explicit-alias-visibility

Comment 1.1

ID: 1332101 User: gord_nat Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Thu 26 Dec 2024 20:51 Selected Answer: - Upvotes: 2

Why are we assuming the database in question is BigQuery? There are several other RDBMS options in GCP. ..
Also, why was the original db pushed to prod without being normalized first? Typically, the normalized db is released to prod. When the data set becomes larger, you would add partitioning.
The scenario being presented is unrealistic.

Comment 2

ID: 280556 User: balseron99 Badges: Highly Voted Relative Date: 5 years, 1 month ago Absolute Date: Sun 31 Jan 2021 14:24 Selected Answer: - Upvotes: 8

A is incorrect because adding space won't solve the problem of query performance.
B is incorrect because there is nothing related to the report generation which is specified and sharding tables on date ranges is not a good option as it will create many tables.
C is CORRECT because the statement says "the scope of the project has expanded. The database must now store 100 times more patient records". As the data increases there would be difficulty in managing the tables and querying it. Hence creating different table is correct as per the need.
D is Incorrect as it Partitions on each clinic. We have to adjust the database design so that it performs optimally when generating reports.
Also nothing is specified for generation of reports in the required statement.

Comment 3

ID: 1583843 User: Nanto90 Badges: Most Recent Relative Date: 8 months, 1 week ago Absolute Date: Sun 06 Jul 2025 22:43 Selected Answer: C Upvotes: 1

Because you can optimize the storage of each table; furthermore, you will avoid self-joins.

Comment 4

ID: 1362312 User: Ahamada Badges: - Relative Date: 1 year ago Absolute Date: Wed 26 Feb 2025 22:38 Selected Answer: C Upvotes: 1

answer is C, the problem here is the self-join (avoid self-join if possible) on a Denormalized table. So the solution is to Normalize

Comment 5

ID: 1339951 User: cqrm3n Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 13 Jan 2025 16:26 Selected Answer: C Upvotes: 1

Normalizing the database into separate Patients and Visits tables, along with creating other necessary tables, is the best solution for handling the increased data size while ensuring efficient query performance and maintainability. This approach addresses the root problem instead of applying temporary fixes.

Comment 6

ID: 1300757 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 06:41 Selected Answer: C Upvotes: 1

C is the most suitable solution for this situation. It provides a better way for scalability and monitoring. B has a constraint on predefined date range, which is usually not suitable for reporting.

Comment 7

ID: 1060866 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 02 Nov 2023 21:24 Selected Answer: C Upvotes: 4

Normalization is a technique used to organize data in a relational database to reduce data redundancy and improve data integrity. Breaking the patient records into separate tables (patient and visits) and eliminating self-joins will make the database more scalable and improve query performance. It also helps maintain data integrity and makes it easier to manage large datasets efficiently.

Options A, B, and D may provide some benefits in specific cases, but for a scenario where the project scope has expanded significantly and there are performance issues with self-joins, normalization (Option C) is the most robust and scalable solution.

Comment 8

ID: 1050463 User: rtcpost Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 22 Oct 2023 12:50 Selected Answer: C Upvotes: 3

Normalization is a technique used to organize data in a relational database to reduce data redundancy and improve data integrity. Breaking the patient records into separate tables (patient and visits) and eliminating self-joins will make the database more scalable and improve query performance. It also helps maintain data integrity and makes it easier to manage large datasets efficiently.

Options A, B, and D may provide some benefits in specific cases, but for a scenario where the project scope has expanded significantly and there are performance issues with self-joins, normalization (Option C) is the most robust and scalable solution.

Comment 9

ID: 901955 User: vaga1 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 19 May 2023 15:08 Selected Answer: C Upvotes: 1

"100 times more patient records"immediately brings to create a patient dimensional table to save space on disk if a generical relational database is mentioned.

Comment 10

ID: 839145 User: maurilio_cardoso_multiedro Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Tue 14 Mar 2023 19:45 Selected Answer: - Upvotes: 1

C - https://cloud.google.com/bigquery/docs/best-practices-performance-patterns

Comment 11

ID: 835649 User: bha11111 Badges: - Relative Date: 3 years ago Absolute Date: Sat 11 Mar 2023 05:13 Selected Answer: C Upvotes: 1

C- This is correct have verified from different sources

Comment 12

ID: 810161 User: Morock Badges: - Relative Date: 3 years ago Absolute Date: Thu 16 Feb 2023 02:26 Selected Answer: C Upvotes: 1

Should be C. Basic ER design...

Comment 13

ID: 771070 User: GCPpro Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 10 Jan 2023 06:13 Selected Answer: - Upvotes: 1

c - is the correct one.

Comment 14

ID: 767528 User: testoneAZ Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 06 Jan 2023 11:43 Selected Answer: - Upvotes: 1

C should be the correct answer

Comment 15

ID: 755064 User: Brillianttyagi Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 24 Dec 2022 18:53 Selected Answer: C Upvotes: 1

C- Is the correct answer!

Comment 16

ID: 559254 User: Arkon88 Badges: - Relative Date: 4 years ago Absolute Date: Wed 02 Mar 2022 09:11 Selected Answer: C Upvotes: 2

C - based on Google documentation, self-join is an anti-pattern:
https://cloud.google.com/bigquery/docs/best-practices-performance-patterns

Comment 17

ID: 544481 User: ch1nczyk Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Thu 10 Feb 2022 12:35 Selected Answer: C Upvotes: 1

Correct

16. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 275

Sequence: 69
Discussion ID: 130517
Source URL: https://www.examtopics.com/discussions/google/view/130517-exam-professional-data-engineer-topic-1-question-275/
Posted By: Smakyel79
Posted At: Jan. 7, 2024, 5:17 p.m.

Question

You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

A. Import the ORC files to Bigtable tables for the data scientist team.
B. Import the ORC files to BigQuery tables for the data scientist team.
C. Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.
D. Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.

Community Answer Votes

D: 15 most voted
C: 4
B: 1

Comments 11 comments Click to expand

Comment 1

ID: 1117778 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 20:20 Selected Answer: D Upvotes: 8

- It leverages the strengths of BigQuery for SQL-based exploration while avoiding additional costs and complexity associated with data transformation or migration.
- The data remains in ORC format in Cloud Storage, and BigQuery's external tables feature allows direct querying of this data.

Comment 1.1

ID: 1273786 User: nadavw Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Wed 28 Aug 2024 06:41 Selected Answer: - Upvotes: 1

There is a requirement to use a 'hive query engine'', and BQ is using only the hive metastore and his own engine, so 'D' seems a better fit here.

Comment 2

ID: 1175921 User: kaisarfarel Badges: Highly Voted Relative Date: 1 year, 12 months ago Absolute Date: Sun 17 Mar 2024 16:35 Selected Answer: - Upvotes: 7

I think C is the correct answer, DS want to explore the data in a "similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine". Dataproc can help to create clusters quickly with the Hadoop cluster. CMIIW

Comment 2.1

ID: 1333029 User: apoio.certificacoes.closer Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sat 28 Dec 2024 16:35 Selected Answer: - Upvotes: 2

I think "Similar" is doing a lot of heavy lift on the confusion. If it was equal, I'd say C. Since it similar, it can be GoogleSQL (Bigquery).

Comment 3

ID: 1582619 User: 56d02cd Badges: Most Recent Relative Date: 8 months, 1 week ago Absolute Date: Thu 03 Jul 2025 00:14 Selected Answer: C Upvotes: 1

It says that scientists need to "explore the data with SQL on the Hive query engine". That excludes BigQuery.

Comment 4

ID: 1305205 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Wed 30 Oct 2024 22:58 Selected Answer: B Upvotes: 1

using external tables have always limitations - affecting performance, no preview of the data and no cost estimation. So, why option D is correct?

Comment 5

ID: 1174741 User: hanoverquay Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Sat 16 Mar 2024 06:12 Selected Answer: D Upvotes: 1

option d

Comment 6

ID: 1171210 User: 0725f1f Badges: - Relative Date: 2 years ago Absolute Date: Mon 11 Mar 2024 19:00 Selected Answer: C Upvotes: 3

it is talking about partition as well

Comment 7

ID: 1155315 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 07:36 Selected Answer: D Upvotes: 1

Option D

Comment 8

ID: 1121824 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 16:59 Selected Answer: D Upvotes: 2

Option D - leverages BigQuery for SQL-based exploration on direct querying to cloud storage

Comment 9

ID: 1116001 User: Smakyel79 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 17:17 Selected Answer: D Upvotes: 3

This approach leverages BigQuery's powerful analytics capabilities without the overhead of data transformation or maintaining a separate cluster, while also allowing your team to use SQL for data exploration, similar to their experience with the on-premises Hadoop/Hive environment.

17. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 281

Sequence: 71
Discussion ID: 130269
Source URL: https://www.examtopics.com/discussions/google/view/130269-exam-professional-data-engineer-topic-1-question-281/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 6:28 a.m.

Question

You work for a large ecommerce company. You store your customer's order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead. What should you do?

A. Set the expiring values of the column families to 29 days and keep the number of versions to 1.
B. Use a timestamp range filter in the query to fetch the customer's data for a specific range.
C. Schedule a job daily to scan the data in the table and delete data older than 30 days.
D. Set the expiring values of the column families to 30 days and set the number of versions to 2.

Community Answer Votes

B: 18 most voted

Comments 8 comments Click to expand

Comment 1

ID: 1121856 User: Matt_108 Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 16:38 Selected Answer: B Upvotes: 8

Agree with others https://cloud.google.com/bigtable/docs/garbage-collection

Comment 1.1

ID: 1131343 User: AllenChen123 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 25 Jul 2024 05:58 Selected Answer: - Upvotes: 10

Agree. https://cloud.google.com/bigtable/docs/garbage-collection#data-removed
"Because it can take up to a week for expired data to be deleted, you should never rely solely on garbage collection policies to ensure that read requests return the desired data. Always apply a filter to your read requests that excludes the same values as your garbage collection rules. You can filter by limiting the number of cells per column or by specifying a timestamp range."

Comment 2

ID: 1159253 User: cuadradobertolinisebastiancami Badges: Highly Voted Relative Date: 1 year, 6 months ago Absolute Date: Sun 25 Aug 2024 23:51 Selected Answer: B Upvotes: 6

Agree with MAtt_108 and AllenChen 123.
"Garbage collection is a continuous process in which Bigtable checks the rules for each column family and deletes expired and obsolete data accordingly. In general, it can take up to a week from the time that data matches the criteria in the rules for the data to actually be deleted. You are not able to change the timing of garbage collection."

"Always apply a filter to your read requests that exclude the same values as your garbage collection rules. "

Ref: https://cloud.google.com/bigtable/docs/garbage-collection#data-removed

Comment 3

ID: 1581308 User: Ben_oso Badges: Most Recent Relative Date: 8 months, 2 weeks ago Absolute Date: Sat 28 Jun 2025 01:28 Selected Answer: B Upvotes: 1

Its "B", but the filter responsability transfer to user, so, this dont ensure that him filter the data.

Comment 4

ID: 1325825 User: m_a_p_s Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 12 Dec 2024 20:29 Selected Answer: B Upvotes: 1

"Because it can take up to a week for expired data to be deleted, you should never rely solely on garbage collection policies to ensure that read requests return the desired data. Always apply a filter to your read requests that excludes the same values as your garbage collection rules. You can filter by limiting the number of cells per column or by specifying a timestamp range."

https://cloud.google.com/bigtable/docs/garbage-collection#data-removed

Comment 5

ID: 1118322 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 10 Jul 2024 07:58 Selected Answer: B Upvotes: 1

I will go for B too

Comment 6

ID: 1115752 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 11:15 Selected Answer: - Upvotes: 3

B. Use a timestamp range filter in the query to fetch the customer's data for a specific range.

Always use query filter as garbage collectore runs on it's way - https://cloud.google.com/bigtable/docs/garbage-collection

Comment 7

ID: 1113352 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 05:28 Selected Answer: B Upvotes: 1

B. Use a timestamp range filter in the query to fetch the customer's data for a specific range.

18. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 162

Sequence: 81
Discussion ID: 16902
Source URL: https://www.examtopics.com/discussions/google/view/16902-exam-professional-data-engineer-topic-1-question-162/
Posted By: rickywck
Posted At: March 18, 2020, 2:11 a.m.

Question

You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the `Trust No One` (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your data. What should you do?

A. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.
B. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key. Use gsutil cp to upload each encrypted file to the Cloud Storage bucket. Manually destroy the key previously used for encryption, and rotate the key once.
C. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
D. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.

Community Answer Votes

D: 22 most voted
A: 12

Comments 18 comments Click to expand

Comment 1

ID: 70284 User: dhs227 Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Thu 02 Apr 2020 01:50 Selected Answer: - Upvotes: 46

The correct answer must be D
A and B can be eliminated immediately since kms generated keys are considered potentially accessible by CSP.
C is incorrect because memory store is essentially a cache service.

Additional authenticated data (AAD) acts as a "salt", it is not a cipher.

Comment 1.1

ID: 171351 User: mikey007 Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Tue 01 Sep 2020 14:42 Selected Answer: - Upvotes: 4

AAD is bound to the encrypted data, because you cannot decrypt the ciphertext unless you know the AAD, but it is not stored as part of the ciphertext. AAD also does not increase the cryptographic strength of the ciphertext. Instead it is an additional check by Cloud KMS to authenticate a decryption request.

Comment 2

ID: 68924 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sat 28 Mar 2020 17:54 Selected Answer: - Upvotes: 15

Answer: A
Description: AAD is used to decrypt the data so better to keep it outside GCP for safety

Comment 3

ID: 1573555 User: grimoren Badges: Most Recent Relative Date: 9 months, 2 weeks ago Absolute Date: Fri 30 May 2025 18:02 Selected Answer: D Upvotes: 2

The "prevent the cloud provider staff from decrypting your data" makes me lean more towards D than A.

Comment 4

ID: 1561423 User: aaaaaaaasdasdasfs Badges: - Relative Date: 10 months, 4 weeks ago Absolute Date: Thu 17 Apr 2025 12:28 Selected Answer: D Upvotes: 2

Based on the requirement for a "Trust No One" (TNO) approach where even the cloud provider cannot decrypt your data, the correct answer is:
D. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
This is the best option because:

Customer-supplied encryption keys (CSEKs) are encryption keys that you manage and provide to Google Cloud, rather than using Google-managed keys.
When you use CSEKs, Google Cloud uses your key to encrypt your data but does not store the key. This means Google Cloud cannot decrypt your data without you providing the key again.
Storing the CSEK in a different project that only the security team can access ensures that the key is securely stored but separated from the encrypted data.

Comment 5

ID: 1559564 User: aaaaaaaasdasdasfs Badges: - Relative Date: 11 months ago Absolute Date: Thu 10 Apr 2025 14:01 Selected Answer: D Upvotes: 2

This is the correct option because:

Customer-supplied encryption keys (CSEKs) provide client-side encryption where you fully control the keys.
By specifying the CSEK in the .boto configuration file, the data is encrypted before it reaches Google's servers.
Storing the keys in a different project with restricted access ensures proper separation.
This approach keeps the encryption keys entirely under your control, following the TNO principle.

Comment 6

ID: 1230469 User: Anudeep58 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 14 Jun 2024 14:01 Selected Answer: A Upvotes: 2

Keep AAD Outside of Google Cloud:

Keeping the AAD outside of Google Cloud ensures that Google cannot access the additional context required to decrypt the files, thus implementing the TNO approach.

Option C:
Customer-Supplied Encryption Key (CSEK) in .boto File:
Storing the CSEK in Cloud Memorystore or any cloud service introduces a risk where the key could be potentially accessed by cloud provider staff.
Option D:
Customer-Supplied Encryption Key (CSEK) in a Different Project:
While storing the CSEK in a different project adds some security, it still leaves the keys within the Google Cloud environment, which does not fully meet the TNO approach.

Comment 7

ID: 1076470 User: emmylou Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 17:55 Selected Answer: - Upvotes: 1

I just cannot understand this question. If you can't trust the provider, in this case Google, then how can you use the KMS approach. In my mind you have to generate the key locally and upload but I'm clearly wrong and don't get why.

Comment 8

ID: 1020900 User: shanwford Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 29 Sep 2023 17:20 Selected Answer: D Upvotes: 4

IMO must be (D) : to reach TNO goal keys must be customer supplied.

Comment 9

ID: 1016261 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Mon 25 Sep 2023 02:56 Selected Answer: D Upvotes: 2

Customer-Supplied Encryption Key (CSEK): CSEK allows you to provide your encryption keys, ensuring that the cloud provider staff does not have access to the keys and cannot decrypt your data.

Separate Project for Key Management: Saving the CSEK in a different project that only the security team can access adds an additional layer of security. It isolates the encryption keys from the project where the data is stored, ensuring that even within the same cloud provider, only authorized personnel can access the keys.

Use of .boto Configuration: Specifying the CSEK in the .boto configuration file ensures that it is applied consistently when interacting with Cloud Storage through tools like gsutil. This way, every archival file is encrypted using your keys.

Options A and B involve using Google Cloud Key Management Service (KMS) to manage keys, which does not align with the TNO approach because cloud provider staff could potentially access the keys stored in Google Cloud KMS.

Comment 10

ID: 972371 User: NewDE2023 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Fri 04 Aug 2023 20:09 Selected Answer: D Upvotes: 4

CSEKs are used when an organization needs complete control over key management.

Comment 11

ID: 963965 User: tavva_prudhvi Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Wed 26 Jul 2023 18:27 Selected Answer: - Upvotes: 2

Option A is not the best choice for the "Trust No One" (TNO) approach because it involves using Google Cloud's Key Management Service (KMS) to create and manage encryption keys. This means that the cloud provider will have access to the keys, which could potentially enable their staff to decrypt the data.

Comment 12

ID: 837828 User: midgoo Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Mon 13 Mar 2023 11:14 Selected Answer: A Upvotes: 3

D may work, but 'Trust No One' = do not trust GCP too. So D cannot be the answer.

Comment 13

ID: 813385 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Sat 18 Feb 2023 19:29 Selected Answer: - Upvotes: 2

answer A: KMS + AAD is more secure than CSEK

Comment 14

ID: 722583 User: Jay_Krish Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 20 Nov 2022 12:58 Selected Answer: D Upvotes: 4

CSEK with only security team having access seems to be right approach. Not sure how A can be better.

Comment 15

ID: 711274 User: cloudmon Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Fri 04 Nov 2022 18:20 Selected Answer: A Upvotes: 2

It’s A, because you cannot decrypt the ciphertext unless you know the AAD (https://cloud.google.com/kms/docs/additional-authenticated-data)

Comment 16

ID: 686473 User: devaid Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 05 Oct 2022 01:07 Selected Answer: A Upvotes: 1

Answer: A

Comment 17

ID: 675703 User: clouditis Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 22 Sep 2022 05:23 Selected Answer: - Upvotes: 1

D it is

19. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 284

Sequence: 86
Discussion ID: 130273
Source URL: https://www.examtopics.com/discussions/google/view/130273-exam-professional-data-engineer-topic-1-question-284/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 7:37 a.m.

Question

You have a network of 1000 sensors. The sensors generate time series data: one metric per sensor per second, along with a timestamp. You already have 1 TB of data, and expect the data to grow by 1 GB every day. You need to access this data in two ways. The first access pattern requires retrieving the metric from one specific sensor stored at a specific timestamp, with a median single-digit millisecond latency. The second access pattern requires running complex analytic queries on the data, including joins, once a day. How should you store this data?

A. Store your data in BigQuery. Concatenate the sensor ID and timestamp, and use it as the primary key.
B. Store your data in Bigtable. Concatenate the sensor ID and timestamp and use it as the row key. Perform an export to BigQuery every day.
C. Store your data in Bigtable. Concatenate the sensor ID and metric, and use it as the row key. Perform an export to BigQuery every day.
D. Store your data in BigQuery. Use the metric as a primary key.

Community Answer Votes

B: 25 most voted

Comments 8 comments Click to expand

Comment 1

ID: 1117914 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Wed 10 Jan 2024 00:14 Selected Answer: B Upvotes: 16

- Bigtable excels at incredibly fast lookups by row key, often reaching single-digit millisecond latencies.
- Constructing the row key with sensor ID and timestamp enables efficient retrieval of specific sensor readings at exact timestamps.
- Bigtable's wide-column design effectively stores time series data, allowing for flexible addition of new metrics without schema changes.
- Bigtable scales horizontally to accommodate massive datasets (petabytes or more), easily handling the expected data growth.

Comment 2

ID: 1571651 User: 22c1725 Badges: Most Recent Relative Date: 9 months, 3 weeks ago Absolute Date: Fri 23 May 2025 18:30 Selected Answer: B Upvotes: 1

If anyone in the future answers other then B I wouldn't be suprise.

Comment 3

ID: 1231859 User: fitri001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 17 Jun 2024 11:57 Selected Answer: B Upvotes: 2

agree with raaad

Comment 4

ID: 1174366 User: hanoverquay Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Fri 15 Mar 2024 17:04 Selected Answer: B Upvotes: 1

voted b

Comment 5

ID: 1155430 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Wed 21 Feb 2024 11:16 Selected Answer: B Upvotes: 1

Option B

Comment 6

ID: 1121866 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 17:45 Selected Answer: B Upvotes: 1

Option B - agree with raaad

Comment 7

ID: 1113373 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 07:37 Selected Answer: B Upvotes: 3

B. Store your data in Bigtable. Concatenate the sensor ID and timestamp and use it as the row key. Perform an export to BigQuery every day.

Comment 7.1

ID: 1116010 User: Smakyel79 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 17:38 Selected Answer: - Upvotes: 4

Based on your requirements, Option B seems most suitable. Bigtable's design caters to the low-latency access of time-series data (your first requirement), and the daily export to BigQuery enables complex analytics (your second requirement). The use of sensor ID and timestamp as the row key in Bigtable would facilitate efficient access to specific sensor data at specific times.

20. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 317

Sequence: 91
Discussion ID: 152975
Source URL: https://www.examtopics.com/discussions/google/view/152975-exam-professional-data-engineer-topic-1-question-317/
Posted By: noiz
Posted At: Dec. 14, 2024, 2:24 a.m.

Question

You have several different unstructured data sources, within your on-premises data center as well as in the cloud. The data is in various formats, such as Apache Parquet and CSV. You want to centralize this data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution. What should you do?

A. Use BigQuery Data Transfer Service to move files into BigQuery.
B. Use Storage Transfer Service to move files into Cloud Storage
C. Use Dataflow to move files into Cloud Storage
D. Use Cloud Data Fusion to move files into Cloud Storage.

Community Answer Votes

D: 10 most voted
B: 4

Comments 6 comments Click to expand

Comment 1

ID: 1571082 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Wed 21 May 2025 21:56 Selected Answer: D Upvotes: 1

I would go with "D" since GUI is required.

Comment 2

ID: 1332665 User: apoio.certificacoes.closer Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 27 Dec 2024 23:43 Selected Answer: D Upvotes: 2

I have read in previous questions that Transfer Services only uses CMEK in-transit.
https://cloud.google.com/storage-transfer/docs/on-prem-security#in-flight

Comment 3

ID: 1331217 User: m_a_p_s Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 24 Dec 2024 20:27 Selected Answer: D Upvotes: 4

D - only Cloud Data Fusion is a GUI-based solution.

Comment 4

ID: 1331166 User: skycracker Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Tue 24 Dec 2024 16:24 Selected Answer: D Upvotes: 3

data fusion allows encryption

Comment 5

ID: 1326308 User: noiz Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sat 14 Dec 2024 02:24 Selected Answer: B Upvotes: 4

Is B incorrect?
Transfer service + CloudKMS

Comment 5.1

ID: 1571083 User: 22c1725 Badges: - Relative Date: 9 months, 3 weeks ago Absolute Date: Wed 21 May 2025 21:57 Selected Answer: - Upvotes: 1

There is no GUI.

21. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 258

Sequence: 96
Discussion ID: 130211
Source URL: https://www.examtopics.com/discussions/google/view/130211-exam-professional-data-engineer-topic-1-question-258/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 5:12 p.m.

Question

You have several different file type data sources, such as Apache Parquet and CSV. You want to store the data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution. What should you do?

A. Use Storage Transfer Service to move files into Cloud Storage.
B. Use Cloud Data Fusion to move files into Cloud Storage.
C. Use Dataflow to move files into Cloud Storage.
D. Use BigQuery Data Transfer Service to move files into BigQuery.

Community Answer Votes

B: 21 most voted
A: 6

Comments 12 comments Click to expand

Comment 1

ID: 1114557 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 15:25 Selected Answer: B Upvotes: 10

- Cloud Data Fusion is a fully managed, code-free, GUI-based data integration service that allows you to visually connect, transform, and move data between various sources and sinks. - It supports various file formats and can write to Cloud Storage.
- You can configure it to use Customer-Managed Encryption Keys (CMEK) for the buckets where it writes data.

Comment 1.1

ID: 1127098 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 20 Jan 2024 07:55 Selected Answer: - Upvotes: 4

Agree. https://cloud.google.com/data-fusion/docs/how-to/customer-managed-encryption-keys#create-instance

Comment 2

ID: 1127560 User: Helinia Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Sun 21 Jan 2024 01:54 Selected Answer: B Upvotes: 6

Even though storage transfer service can be used in GUI, it does not support CMEK which is required in this question.

"Storage Transfer Service does not encrypt data on your behalf, such as in customer-managed encryption keys (CMEK). We only encrypt data in transit."

Ref: https://cloud.google.com/storage-transfer/docs/on-prem-security

Comment 3

ID: 1564601 User: aaaaaaaasdasdasfs Badges: Most Recent Relative Date: 10 months, 2 weeks ago Absolute Date: Tue 29 Apr 2025 07:32 Selected Answer: A Upvotes: 2

A. Use Storage Transfer Service
• ✅ GUI-based: Yes, can be set up in the Google Cloud Console.
• ✅ Can move data from on-premises, AWS, or even between GCS buckets.
• ✅ Works with different file types, including CSV, Parquet.
• ✅ Supports writing into buckets protected with CMEK.
• ✅ Best suited when you’re moving or syncing raw files into Cloud Storage

Comment 4

ID: 1362477 User: vishavpreet Badges: - Relative Date: 1 year ago Absolute Date: Thu 27 Feb 2025 11:14 Selected Answer: B Upvotes: 1

Cloud Data Fusion is a fully-managed, cloud native, enterprise data integration service for quickly building and managing data pipelines.
Graphically, no coding solution.

Comment 5

ID: 1332161 User: hussain.sain Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 27 Dec 2024 02:20 Selected Answer: B Upvotes: 1

B is answer as requirement is for GUI and sink

Comment 6

ID: 1295490 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Thu 10 Oct 2024 10:29 Selected Answer: - Upvotes: 1

I acknolwedge that the question wants the answer to be B, so I'm calling it as B. I don't like this though, as can't we just create a bucket with a CMEK upfront and then use the storage transfer service? It would be easier and cheaper, and achieve the same thing.
The language of "sink" strongly suggests to me they intend this to be B though, as datafusion uses that terminology, and the CMEK thing is probably indicating that datafusion can encrypt for you with CMEK.

Comment 7

ID: 1177686 User: hanoverquay Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 19 Mar 2024 21:03 Selected Answer: B Upvotes: 1

option 8

Comment 8

ID: 1157919 User: ashdam Badges: - Relative Date: 2 years ago Absolute Date: Sat 24 Feb 2024 13:08 Selected Answer: B Upvotes: 1

B just because tranfer service does not support CMEK
A:
* GUI + encryption but no CMEK.
B:
* Its GUI ETL + support CMEK but not sure why you need an ETL tool for transfering something once? (No scheduling or event-driven trigger is mentioned)

Comment 9

ID: 1147132 User: casadocc Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 11 Feb 2024 11:17 Selected Answer: - Upvotes: 1

B if data fusion creates the bucket. We could create the bucket and associate, in this case A is better.

Comment 10

ID: 1119556 User: task_7 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 11 Jan 2024 10:37 Selected Answer: A Upvotes: 4

A. Use Storage Transfer Service to move files into Cloud Storage.
move files into Cloud Storage should be Storage Transfer Service
Cloud Data Fusion is like using a tank to kill an ant

Comment 11

ID: 1112919 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 17:12 Selected Answer: B Upvotes: 1

B. Use Cloud Data Fusion to move files into Cloud Storage.

22. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 13

Sequence: 99
Discussion ID: 16279
Source URL: https://www.examtopics.com/discussions/google/view/16279-exam-professional-data-engineer-topic-1-question-13/
Posted By: jvg637
Posted At: March 11, 2020, 6:22 p.m.

Question

You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling.
Which Google database service should you use?

A. Cloud SQL
B. BigQuery
C. Cloud Bigtable
D. Cloud Datastore

Community Answer Votes

D: 61 most voted
A: 35
C: 10

Comments 30 comments Click to expand

Comment 1

ID: 101332 User: DeepakKhattar Badges: Highly Voted Relative Date: 5 years, 9 months ago Absolute Date: Wed 03 Jun 2020 02:57 Selected Answer: - Upvotes: 78

Initially, thinking D is the best answer but when question is re-re-read, A seems to be correct answer for following reasons
1. Is payment TRANSACTION -- DB should able to perform full blown transaction (updating inventory, sales info etc, though not specified) , not just ATOMIC which DataStore provides
2. Its point-of-sale application, not ONLINE STORE where HIGH number of concurrent users ordering stuff.
3. User Base could grow exponentially - again more users does mot mean concurrent users and more processing power. Its only about storage.
4. Do not want to Manage infrastructure scaling. - Cloud SQL can scale in terms of storage.
5. CloudStore is poor selection for OLTP application
- Each property is index - so higher latency

Not sure, during exam 2 min is enough to think on various point..
I may be wrong or wrong path ... lets brainstrom..

Comment 1.1

ID: 473508 User: canon123 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 06 Nov 2021 16:15 Selected Answer: - Upvotes: 11

CloudSql does not auto scale.

Comment 1.1.1

ID: 494899 User: BigQuery Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Mon 06 Dec 2021 05:51 Selected Answer: - Upvotes: 3

https://cloud.google.com/architecture/elastically-scaling-your-mysql-environment#objectives

Please read. It can be configured for autoscaling.

Comment 1.1.1.1

ID: 503303 User: hendrixlives Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Fri 17 Dec 2021 03:29 Selected Answer: - Upvotes: 4

That link explains how to set MySQL autoscaling with Google Compute Engine instances (you install and manage MySQL on the VM). This can not be applied to Cloud SQL (managed service). In Cloud SQL, only the storage can be automatically increased, and changing the Cloud SQL instance size requires a manual edit of the instance type.

Comment 1.1.1.1.1

ID: 706617 User: MisuLava Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Fri 28 Oct 2022 18:21 Selected Answer: - Upvotes: 3

yes. and that is ok since this is a point of sale. an exponential increase in number of clients still means reduced parallel processing (how many customers can buy in the very same time) so an increase in memory and CPU is very unlikely to be necessary. yes, an exponential increase in the number of customers means more memory, and more storage, which in cloud SQL increases automatically.

Comment 1.1.1.2

ID: 681374 User: nkunwar Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 28 Sep 2022 05:43 Selected Answer: - Upvotes: 2

C SQL doesn't AUTO SCALE, you need to manually edit , Please show where does it says AUTO SCALING

Comment 1.2

ID: 431181 User: Blobby Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Wed 25 Aug 2021 08:01 Selected Answer: - Upvotes: 3

Can't online be considered PoS? CloudSQL does have constraints for scaling and Google seem to specifically be selling Datastore for transactional use cases so going with D:
https://cloud.google.com/datastore/docs/concepts/transactions

Comment 1.2.1

ID: 445000 User: Blobby Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Wed 15 Sep 2021 07:09 Selected Answer: - Upvotes: 5

Based on a re-read of the above comments and other later questions agree with A.
pls ignore my first answer.

Comment 2

ID: 62577 User: jvg637 Badges: Highly Voted Relative Date: 6 years ago Absolute Date: Wed 11 Mar 2020 18:22 Selected Answer: - Upvotes: 38

D seems to be the right one. Cloud SQL doesn't automatically scale

Comment 2.1

ID: 494895 User: BigQuery Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Mon 06 Dec 2021 05:47 Selected Answer: - Upvotes: 2

Cloud SQL does scale automatically. THERE IS A SETTING WHERE YOU DEFINE INCREASE MEMORY SPACE WHEN IT REACHED 70%.

https://cloud.google.com/sql/docs/features#features_3

Here it say's
-> Fully managed SQL Server databases in the cloud.
-> Custom machine types with up to 624 GB of RAM and 96 CPUs.
-> Up to 64 TB of storage available, with the ability to automatically increase storage size as needed.

Comment 2.1.1

ID: 503306 User: hendrixlives Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Fri 17 Dec 2021 03:33 Selected Answer: - Upvotes: 6

Storage scale is automatic (e.g. you begin with a 50GB disk and it grows automatically as needed), but the instance size (CPU/memory) will be the same. The questions states that the user base may increase exponentially. Even if you have enough disk space to store all your user data, the increase in users will cause problems if your instance (CPU/memory) is too small, since the instance will not be able to process all the queries at the required speed.

Comment 2.1.1.1

ID: 739564 User: imsaikat50 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 08 Dec 2022 23:57 Selected Answer: - Upvotes: 2

I believe the key point is it's a POS, not an e-commerce. Keeping that in mind, exponential user increase in POS might not mean concurrent user increase, which could be a huge consideration in case of it is being e-commerce.

I would rather go with 'Cloud SQL' as the best answer.

Comment 3

ID: 1559549 User: fassil Badges: Most Recent Relative Date: 11 months ago Absolute Date: Thu 10 Apr 2025 13:42 Selected Answer: D Upvotes: 1

You don't need to go far, guys. Cloud SQL does not support auto scale; cut the shit here based on the requirement—the question specifically says "you do not want to manage infrastructure scaling."

Comment 4

ID: 1398846 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 14:39 Selected Answer: D Upvotes: 2

POS app needing scalable DB without managing infrastructure. Cloud Datastore or Firestore? But options are Cloud SQL, BigQuery, Bigtable, Datastore. Since it's for transactions and scale, maybe Cloud Datastore (D) or Cloud Bigtable. But Cloud Bigtable is for high throughput. The question says not to manage scaling. Cloud Datastore is serverless. So D.

Comment 5

ID: 1366712 User: monyu Badges: - Relative Date: 1 year ago Absolute Date: Sun 09 Mar 2025 00:52 Selected Answer: D Upvotes: 1

B and C are discarded Since we are dealing with transactional data.
A is discarded since we require to deal with Infrastructure scaling in case the base of users increase and also (not necessarily) the concurrent transactions.

Users base is gonna grow and we DO NOT WANT TO DEAL with infrastructure scaling. D is the most appropriated

Comment 6

ID: 1342430 User: cqrm3n Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 06:54 Selected Answer: D Upvotes: 1

The answer is D, Cloud Datastore (now Firestore in Datastore mode), because it supports auto scaling and low latency reads and writes. Cloud SQL is not the correct answer because it requires more active management for scaling.

Comment 7

ID: 1301210 User: GHill1982 Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 20:03 Selected Answer: C Upvotes: 3

For handling payment transactions in a point-of-sale application with potential exponential growth and without the need to manage infrastructure scaling, Cloud Bigtable would be the best choice.

Comment 8

ID: 214051 User: Radhika7983 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:20 Selected Answer: - Upvotes: 14

D seems to be the answer. This is what I think based on my analysis below.
POS is OLTP system but now a days NOSQL with ACID properties also are used for OLTP,
Cloud sql is good for relational database and it would have been an option here but it clearly says that "you do not want to manage infrastructure scaling". In cloud SQL, which is managed service and not server less, you need to manually do vertical scaling(scale up and scale down).
Hence I believe CLOUD SQL is not the option here.
I also tried creating a datastore using google cloud console and it gives 2 options now that is cloud firestore in native mode and cloud firestore in data store mode. automatic scaling is available in both where there is no manual scaling up or down is required. Also, both firestore in native and datastore provides ACID properties. Also, firestore is now optimized for OLTP. Please see below
https://cloud.google.com/solutions/building-scalable-apps-with-cloud-firestore
Though the question only talks about datastore, I am just providing additional information.
Considering all what I read through, D is the answer.

Comment 8.1

ID: 397093 User: awssp12345 Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Fri 02 Jul 2021 19:52 Selected Answer: - Upvotes: 1

I agree. I think people are missing the part of the question that mentions they don't want to maintain the DB.

Comment 8.2

ID: 397097 User: awssp12345 Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Fri 02 Jul 2021 19:53 Selected Answer: - Upvotes: 1

This should be accepted and the highest voted answer.

Comment 9

ID: 506855 User: kishanu Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:19 Selected Answer: D Upvotes: 2

D is the hero here.
Though Cloud SQL has an upper hand when it comes to transactions(OLTP), it does not autoscale its computing capabilities as compared to datastore.
Do visit: https://cloud.google.com/datastore/docs/concepts/overview#what_its_good_for

Comment 10

ID: 503320 User: hendrixlives Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:19 Selected Answer: D Upvotes: 4

D is correct: Datastore (currently Firestore in native or datastore mode). It is a fully managed and serverless solution that allows for transactions and will autoscale (storage and compute) without the need to manage any infrastructure.
A is wrong: Cloud SQL is fully a managed transactional DB, but only the storage grows automatically. As your user base increases, you will need to increase the CPU/memory of the instance, and to do that you must edit the instance manually (and the questions specifically says "you do not want to manage infrastructure scaling")
B is wrong: Bigquery is OLAP (for analytics). NoOps, fully managed, autoscales and allows transactions, but it is not designed for this use case.
C is wrong: Bigtable is a NoSQL database for massive writes, and to scale (storage and CPU) you must add nodes, so it is completely out of this use case.

Comment 10.1

ID: 510662 User: kuik Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 28 Dec 2021 00:49 Selected Answer: - Upvotes: 2

May be some history can help to decide which is best answer.
Datastore built by Google uses BigTable as it's storage, while the company who built FireStore uses Cloud Spanner as it's storage. Google decided that they like the FireStore technology and acquired it.
If Cloud Spanner is an option I would choose it. So, D for me, although it's json storage format, but the Cloud Spanner it uses as storage fits all the requirements.

Comment 11

ID: 1056987 User: axantroff Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:16 Selected Answer: D Upvotes: 1

B - not an option
C - lack of ACID transactions
A - lack of resource automatic scalability
D - (correct, IMHO) support ACID, suitable for OLPT and scalable enough

Comment 12

ID: 1062181 User: rocky48 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:16 Selected Answer: D Upvotes: 3

B - not an option
C - lack of ACID transactions
A - lack of resource automatic scalability
D - (correct, IMHO) support ACID, suitable for OLPT and scalable enough

Comment 13

ID: 1096236 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:16 Selected Answer: D Upvotes: 1

Cloud Datastore (now part of Google Cloud Firestore in Datastore mode) is designed for high scalability and ease of management for applications. It is a NoSQL document database built for automatic scaling, high performance, and ease of application development. It's serverless, meaning it handles the scaling, performance, and management automatically, fitting your requirement of not wanting to manage infrastructure scaling.

Cloud SQL, while a fully-managed relational database service that makes it easy to set up, manage, and administer your SQL databases, is not as automatically scalable as Datastore. It's better suited for applications that require a traditional relational database.

Comment 14

ID: 1286842 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Fri 20 Sep 2024 15:29 Selected Answer: A Upvotes: 2

This actually is A. I was initially in the D camp, and have spent considerable time reading about it now (circa 1 hour). D is explicitly not suitable for payment transactions, as Datastore supports ACID transactions, but only within entity groups, which are small, localized sets of data. This restriction means that transactions are not suitable for scenarios requiring multi-entity consistency across the entire database.
The only two products recommended for payments in the Google ecosystem are Cloud Spanner and Cloud SQL. Is Cloud SQL managed? I'd say it wasn't really, due to the need to configure instances, but that is trumped by the fact it is the only choice suitable for a payment transaction system.

Comment 15

ID: 1263997 User: SatyamKishore Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 11 Aug 2024 11:39 Selected Answer: - Upvotes: 1

Cloud Datastore is a fully managed, NoSQL document database that is highly scalable and designed to automatically handle large increases in traffic without requiring manual intervention. It's well-suited for applications with a rapidly growing user base.

Comment 16

ID: 1258318 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 30 Jul 2024 19:21 Selected Answer: D Upvotes: 1

Firestore extension of Datastore can handle acid transactions and allows autoscaling

Comment 17

ID: 1238168 User: jamalkhan Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 27 Jun 2024 15:14 Selected Answer: A Upvotes: 1

A. Requires Transactions.

23. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 257

Sequence: 104
Discussion ID: 130210
Source URL: https://www.examtopics.com/discussions/google/view/130210-exam-professional-data-engineer-topic-1-question-257/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 5:09 p.m.

Question

You are planning to use Cloud Storage as part of your data lake solution. The Cloud Storage bucket will contain objects ingested from external systems. Each object will be ingested once, and the access patterns of individual objects will be random. You want to minimize the cost of storing and retrieving these objects. You want to ensure that any cost optimization efforts are transparent to the users and applications. What should you do?

A. Create a Cloud Storage bucket with Autoclass enabled.
B. Create a Cloud Storage bucket with an Object Lifecycle Management policy to transition objects from Standard to Coldline storage class if an object age reaches 30 days.
C. Create a Cloud Storage bucket with an Object Lifecycle Management policy to transition objects from Standard to Coldline storage class if an object is not live.
D. Create two Cloud Storage buckets. Use the Standard storage class for the first bucket, and use the Coldline storage class for the second bucket. Migrate objects from the first bucket to the second bucket after 30 days.

Community Answer Votes

A: 10 most voted

Comments 6 comments Click to expand

Comment 1

ID: 1410147 User: desertlotus1211 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Tue 25 Mar 2025 20:39 Selected Answer: A Upvotes: 1

By chance is this a repeat question?

Comment 2

ID: 1259971 User: iooj Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Fri 02 Aug 2024 19:36 Selected Answer: A Upvotes: 2

Thanks to you guys, I found out about this feature :D

The feature was released on November 3, 2023. Note that enabling Autoclass on an existing bucket incurs additional charges.

Comment 3

ID: 1154491 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 05:42 Selected Answer: A Upvotes: 1

A. Autoclass

Comment 4

ID: 1121740 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 15:33 Selected Answer: A Upvotes: 1

Option A

Comment 5

ID: 1114556 User: raaad Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 15:21 Selected Answer: A Upvotes: 4

- Autoclass automatically analyzes access patterns of objects and automatically transitions them to the most cost-effective storage class within Standard, Nearline, Coldline, or Archive.
- This eliminates the need for manual intervention or setting specific age thresholds.
- No user or application interaction is required, ensuring transparency.

Comment 6

ID: 1112918 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 17:09 Selected Answer: A Upvotes: 1

A. Create a Cloud Storage bucket with Autoclass enabled.

24. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 214

Sequence: 108
Discussion ID: 129861
Source URL: https://www.examtopics.com/discussions/google/view/129861-exam-professional-data-engineer-topic-1-question-214/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:39 a.m.

Question

You have a Standard Tier Memorystore for Redis instance deployed in a production environment. You need to simulate a Redis instance failover in the most accurate disaster recovery situation, and ensure that the failover has no impact on production data. What should you do?

A. Create a Standard Tier Memorystore for Redis instance in the development environment. Initiate a manual failover by using the limited-data-loss data protection mode.
B. Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode.
C. Increase one replica to Redis instance in production environment. Initiate a manual failover by using the force-data-loss data protection mode.
D. Initiate a manual failover by using the limited-data-loss data protection mode to the Memorystore for Redis instance in the production environment.

Community Answer Votes

B: 18 most voted
C: 2
D: 1
A: 1

Comments 14 comments Click to expand

Comment 1

ID: 1116120 User: MaxNRG Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 20:18 Selected Answer: B Upvotes: 13

The best option is B - Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode.
The key points are:
• The failover should be tested in a separate development environment, not production, to avoid impacting real data.
• The force-data-loss mode will simulate a full failover and restart, which is the most accurate test of disaster recovery.
• Limited-data-loss mode only fails over reads which does not fully test write capabilities.
• Increasing replicas in production and failing over (C) risks losing real production data.
• Failing over production (D) also risks impacting real data and traffic.
So option B isolates the test from production and uses the most rigorous failover mode to fully validate disaster recovery capabilities.

Comment 2

ID: 1401982 User: desertlotus1211 Badges: Most Recent Relative Date: 11 months, 3 weeks ago Absolute Date: Sat 22 Mar 2025 18:11 Selected Answer: - Upvotes: 1

Answer A is best suited for an 'accurate disaster recovery' scenario...

limited-data-loss mode is Google’s recommended failover simulation mode:

Promotes the replica.

Attempts to minimize data loss (vs. force failover).

Mimics a realistic Redis failover due to a zonal outage or instance crash.

Comment 3

ID: 1398960 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 18:46 Selected Answer: D Upvotes: 1

Initiate a manual failover by using the limited-data-loss data protection mode to the Memorystore for Redis instance in the production environment.

Comment 3.1

ID: 1401979 User: desertlotus1211 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Sat 22 Mar 2025 18:09 Selected Answer: - Upvotes: 1

did you read this part? ' ensure that the failover has no impact on production data'....
Answer D is wrong.

Comment 4

ID: 1304507 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Tue 29 Oct 2024 16:11 Selected Answer: B Upvotes: 1

The question says "no impact on production data" Thus, the best practice is about simulating in a different environment. force-data-loss mode covers the most accurate disaster recovery situation. (https://cloud.google.com/memorystore/docs/redis/about-manual-failover)

Comment 5

ID: 1300766 User: mi_yulai Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 07:42 Selected Answer: - Upvotes: 1

D:
"A standard tier Memorystore for Redis instance uses a replica node to back up the primary node. A normal failover occurs when the primary node becomes unhealthy, causing the replica to be designated as the new primary. A manual failover differs from a normal failover because you initiate it yourself."
The limited-data-loss mode minimizes data loss by verifying that the difference in data between the primary and replica is below 30 MB before initiating the failover. The offset on the primary is incremented for each byte of data that must be synchronized to its replicas.

Comment 6

ID: 1275371 User: mayankazyour Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 31 Aug 2024 06:41 Selected Answer: A Upvotes: 1

We are trying to simulate the disaster recovery on a redis Instance and we want minimum data loss.
Therefore, Option A - create a test Standard Tier Memorystore for Redis instance in Dev Environment and use the limited data loss data protection mode, seems to be the correct option here.

Comment 7

ID: 1245005 User: anyone_99 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 19:13 Selected Answer: - Upvotes: 1

D seems correct. We are required to simulate and not test in a different environment.
"How data protection modes work
The limited-data-loss mode minimizes data loss by verifying that the difference in data between the primary and replica is below 30 MB before initiating the failover. The offset on the primary is incremented for each byte of data that must be synchronized to its replicas. In the limited-data-loss mode, the failover will abort if the greatest offset delta between the primary and each replica is 30MB or greater. If you can tolerate more data loss and want to aggressively execute the failover, try setting the data protection mode to force-data-loss.

The force-data-loss mode employs a chain of failover strategies to aggressively execute the failover. It does not check the offset delta between the primary and replicas before initiating the failover; you can potentially lose more than 30MB of data changes."

Comment 8

ID: 1124865 User: tibuenoc Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 17 Jan 2024 10:56 Selected Answer: B Upvotes: 2

https://cloud.google.com/memorystore/docs/redis/about-manual-failover

Comment 9

ID: 1123238 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 11:16 Selected Answer: B Upvotes: 1

B. Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode

Comment 10

ID: 1121480 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 11:00 Selected Answer: B Upvotes: 1

Best option is B - no impact on production env and forces a full failover

Comment 11

ID: 1112355 User: raaad Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 00:38 Selected Answer: C Upvotes: 1

Increasing the number of replicas in a Redis instance in a production environment means that we will have additional copies of the same data and thats why failover will not impact the production data

Comment 11.1

ID: 1116122 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 20:19 Selected Answer: - Upvotes: 1

"no impact on production data" - not C nor D

Comment 12

ID: 1109539 User: e70ea9e Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:39 Selected Answer: C Upvotes: 1

Separate Development Environment:

Isolates testing from production, preventing any impact on live data or services.
Provides a safe and controlled environment for simulating failover scenarios.

25. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 93

Sequence: 114
Discussion ID: 79773
Source URL: https://www.examtopics.com/discussions/google/view/79773-exam-professional-data-engineer-topic-1-question-93/
Posted By: AWSandeep
Posted At: Sept. 3, 2022, 2:01 p.m.

Question

You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes. You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.
What should you do?

A. Export Bigtable dump to GCS and run your analytical job on top of the exported files.
B. Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
C. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
D. Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.

Community Answer Votes

C: 28 most voted
B: 16

Comments 22 comments Click to expand

Comment 1

ID: 951486 User: aewis Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Fri 14 Jul 2023 13:46 Selected Answer: C Upvotes: 5

It was actually illustrated here
https://cloud.google.com/bigtable/docs/replication-settings#batch-vs-serve

Comment 2

ID: 1400890 User: oussama7 Badges: Most Recent Relative Date: 11 months, 4 weeks ago Absolute Date: Thu 20 Mar 2025 02:55 Selected Answer: B Upvotes: 1

Option C offers single-cluster routing, meaning that each query is directed to a single specific cluster. This does not protect the transactional workload from heavy analytical loads, which can lead to performance degradation. With multi-cluster routing (option B), Bigtable can automatically distribute the load and avoid congestion on a single cluster.

Comment 3

ID: 1399467 User: desertlotus1211 Badges: - Relative Date: 12 months ago Absolute Date: Mon 17 Mar 2025 01:11 Selected Answer: B Upvotes: 1

Multi-cluster satisfy reliability aspect of the question. single cluster may and will cause contention for resources...

Comment 4

ID: 1288934 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 10:42 Selected Answer: C Upvotes: 1

To answer some confusion - "single cluster routing" is routing to one cluster per profile, rather than having failover options per profile. So we have two clusters, but it's not multicluster, because we have two profiles, so it's one cluster per profile, so "single cluster routing". We COULD use multicluster, but none of the answers give the steps required to do so, so the assumption as to be that we're using single.

Comment 4.1

ID: 1288935 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 10:43 Selected Answer: - Upvotes: 1

(an example of multicluster in this case would be 4 clusters, 2 for the transactional load and 2 for the analytical load)

Comment 5

ID: 1239585 User: 47767f9 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 11:22 Selected Answer: B Upvotes: 1

B better than C. Multi-cluster routing to handle failovers automatically. Reference: https://cloud.google.com/bigtable/docs/replication-settings#regional-failover

Comment 6

ID: 1184013 User: opt_sub Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 27 Mar 2024 12:02 Selected Answer: B Upvotes: 2

B is correct.
Two different job profiles to redirect trafiic to two different cluster. C is incorrect because there is no tpoint in creating app profile for two different workloads in the same cluster. One cluster handles writes and another handle reads.

Comment 7

ID: 1091501 User: carbino Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 09 Dec 2023 08:53 Selected Answer: C Upvotes: 3

IIt is C:
"Workload isolation:
Using separate app profiles lets you use different routing policies for different purposes. For example, consider a situation when you want to prevent a batch read job (workload A) from increasing CPU usage on clusters that handle an application's steady reads and writes (workload B). You can create an app profile for workload B that routes to a cluster group that excludes one cluster. Then you create an app profile for workload A that specifies single-cluster routing to the cluster that workload B doesn't send requests to.

You can change the settings for one application or function without affecting other applications that connect to the same data."
https://cloud.google.com/bigtable/docs/app-profiles

Comment 8

ID: 870326 User: DevShah Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 14 Apr 2023 18:20 Selected Answer: C Upvotes: 3

https://cloud.google.com/bigtable/docs/replication-settings#batch-vs-serve

Comment 8.1

ID: 889282 User: A4M Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 04 May 2023 10:07 Selected Answer: - Upvotes: 1

I see what you say on C but the question states high availability how do you handle that with option C when you have a single region cluster hence answer needs to be with multi-region cluster - To configure your instance for a high availability (HA) use case, create a new app profile that uses multi-cluster routing, or update the default app profile to use multi-cluster routing.

Comment 8.1.1

ID: 889284 User: A4M Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 04 May 2023 10:08 Selected Answer: - Upvotes: 1

i meant single-cluster routing

Comment 8.1.2

ID: 1201232 User: zevexWM Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Wed 24 Apr 2024 11:18 Selected Answer: - Upvotes: 1

It actually addresses the issue of High availability in that same link if you scroll down a bit more.
https://cloud.google.com/bigtable/docs/replication-settings#high-availability

Comment 9

ID: 849579 User: juliobs Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 24 Mar 2023 20:31 Selected Answer: C Upvotes: 3

C. This is exactly the example in the documentation.
https://cloud.google.com/bigtable/docs/replication-settings#batch-vs-serve

Comment 9.1

ID: 870324 User: DevShah Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 14 Apr 2023 18:18 Selected Answer: - Upvotes: 1

Correct
2 jobs >> 2 cluster
3 jobs >> 3 cluster
app profiles with single-cluster routing used to route to specific cluster
Job1 >> Cluster 1
Job2 >> Cluster 2 .....

Comment 10

ID: 809656 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Wed 15 Feb 2023 15:45 Selected Answer: - Upvotes: 1

Answer B:
reason 1: If you don' t have any cost constraint use multi-cluster routing,
reason 2: Single cluster is less scalable as we need high scalability i would go with B

Comment 11

ID: 799090 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 05 Feb 2023 19:36 Selected Answer: C Upvotes: 1

I am going for C?

Comment 12

ID: 751943 User: slade_wilson Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 21 Dec 2022 07:46 Selected Answer: C Upvotes: 2

When you use a single cluster to run a batch analytics job that performs numerous large reads alongside an application that performs a mix of reads and writes, the large batch job can slow things down for the application's users. With replication, you can use app profiles with single-cluster routing to route batch analytics jobs and application traffic to different clusters, so that batch jobs don't affect your applications' users.

Single cluster routing - You can use single-cluster routing for this use case if you don't want your Bigtable cluster to automatically fail over if a zone or region becomes unavailable.

Multi-cluster routing - If you want Bigtable to automatically fail over to one region if your application cannot reach the other region, use multi-cluster routing.

Comment 13

ID: 734571 User: Siant_137 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 03 Dec 2022 17:44 Selected Answer: - Upvotes: 2

Answer is C

"When you use a single cluster to run a batch analytics job that performs numerous large reads alongside an application that performs a mix of reads and writes, the large batch job can slow things down for the application's users. With replication, you can use app profiles with single-cluster routing to route batch analytics jobs and application traffic to different clusters, so that batch jobs don't affect your applications' users."

https://cloud.google.com/bigtable/docs/replication-overview#batch-vs-serve

Comment 14

ID: 725845 User: sfsdeniso Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 24 Nov 2022 14:06 Selected Answer: - Upvotes: 1

Answer is C

Comment 15

ID: 724127 User: dish11dish Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Tue 22 Nov 2022 07:44 Selected Answer: B Upvotes: 3

Option B is correct

An app profile specifies the routing policy that Bigtable should use for each request.

Single-cluster routing routes all requests to 1 cluster in your instance. If that cluster becomes unavailable, you must manually fail over to another cluster.

Multi-cluster routing automatically routes requests to the nearest cluster in an instance. If the cluster becomes unavailable, traffic automatically fails over to the nearest cluster that is available. Bigtable considers clusters in a single region to be equidistant, even though they are in different zones. You can configure an app profile to route to any cluster in an instance, or you can specify a cluster group that tells the app profile to route to only some of the clusters in the instance.

Cluster group routing sends requests to the nearest available cluster within a cluster group that you specify in the app profile settings.

Reference:-https://cloud.google.com/bigtable/docs/app-profiles#routing

Comment 16

ID: 723493 User: piotrpiskorski Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 21 Nov 2022 13:49 Selected Answer: C Upvotes: 2

https://cloud.google.com/bigtable/docs/replication-settings#batch-vs-serve

"When you use a single cluster to run a batch analytics job that performs numerous large reads alongside an application that performs a mix of reads and writes, the large batch job can slow things down for the application's users. With replication, you can use app profiles with single-cluster routing to route batch analytics jobs and application traffic to different clusters, so that batch jobs don't affect your applications' users."

It is C.

Comment 17

ID: 721211 User: gudiking Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 18 Nov 2022 10:56 Selected Answer: C Upvotes: 1

C - "With replication, you can use app profiles with single-cluster routing to route batch analytics jobs and application traffic to different clusters, so that batch jobs don't affect your applications' users." - https://cloud.google.com/bigtable/docs/replication-overview#batch-vs-serve

26. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 307

Sequence: 118
Discussion ID: 130320
Source URL: https://www.examtopics.com/discussions/google/view/130320-exam-professional-data-engineer-topic-1-question-307/
Posted By: scaenruy
Posted At: Jan. 4, 2024, 1:13 p.m.

Question

You need to connect multiple applications with dynamic public IP addresses to a Cloud SQL instance. You configured users with strong passwords and enforced the SSL connection to your Cloud SQL instance. You want to use Cloud SQL public IP and ensure that you have secured connections. What should you do?

A. Add CIDR 0.0.0.0/0 network to Authorized Network. Use Identity and Access Management (IAM) to add users.
B. Add all application networks to Authorized Network and regularly update them.
C. Leave the Authorized Network empty. Use Cloud SQL Auth proxy on all applications.
D. Add CIDR 0.0.0.0/0 network to Authorized Network. Use Cloud SQL Auth proxy on all applications.

Community Answer Votes

C: 16 most voted
D: 1

Comments 10 comments Click to expand

Comment 1

ID: 1120004 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Thu 11 Jul 2024 17:21 Selected Answer: C Upvotes: 10

- Using the Cloud SQL Auth proxy is a recommended method for secure connections, especially when dealing with dynamic IP addresses.
- The Auth proxy provides secure access to your Cloud SQL instance without the need for Authorized Networks or managing IP addresses.
- It works by encapsulating database traffic and forwarding it through a secure tunnel, using Google's IAM for authentication.
- Leaving the Authorized Networks empty means you're not allowing any direct connections based on IP addresses, relying entirely on the Auth proxy for secure connectivity. This is a secure and flexible solution, especially for applications with dynamic IPs.

Comment 1.1

ID: 1399241 User: FreshMind Badges: - Relative Date: 12 months ago Absolute Date: Sun 16 Mar 2025 14:14 Selected Answer: - Upvotes: 1

In question "You want to use Cloud SQL public IP", how this could be if "Leaving the Authorized Networks empty means you're not allowing any direct connections based on IP addresses" ?

Comment 2

ID: 1156312 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Thu 22 Aug 2024 10:53 Selected Answer: C Upvotes: 1

Option C

Comment 3

ID: 1126002 User: Pukapuiz Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 18 Jul 2024 15:31 Selected Answer: C Upvotes: 3

The Cloud SQL Auth Proxy is a Cloud SQL connector that provides secure access to your instances without a need for Authorized networks or for configuring SSL.
https://cloud.google.com/sql/docs/mysql/sql-proxy

Comment 4

ID: 1120864 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 12 Jul 2024 14:57 Selected Answer: C Upvotes: 1

always use Cloud SQL Auth proxy if possible

Comment 5

ID: 1119723 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 11 Jul 2024 12:38 Selected Answer: - Upvotes: 4

https://stackoverflow.com/questions/27759356/how-to-authorize-my-dynamic-ip-network-address-in-google-cloud-sql
https://stackoverflow.com/questions/24749810/how-to-make-a-google-cloud-sql-instance-accessible-for-any-ip-address

Comment 5.1

ID: 1156311 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Thu 22 Aug 2024 10:52 Selected Answer: - Upvotes: 4

Links also say not to go with option D.
0.0.0.0/0 which includes all possible IP Addresses is not recommended for security reasons. You have to keep access as restricted as possible.

Comment 6

ID: 1119721 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 11 Jul 2024 12:37 Selected Answer: D Upvotes: 1

As for me, after reading documentation, option D looks appropriate

Comment 6.1

ID: 1194601 User: BennyXu Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sun 13 Oct 2024 03:27 Selected Answer: - Upvotes: 1

Save your shxx answer in your dxxb head.

Comment 7

ID: 1113655 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 04 Jul 2024 12:13 Selected Answer: C Upvotes: 1

C. Leave the Authorized Network empty. Use Cloud SQL Auth proxy on all applications.

27. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 95

Sequence: 122
Discussion ID: 16842
Source URL: https://www.examtopics.com/discussions/google/view/16842-exam-professional-data-engineer-topic-1-question-95/
Posted By: rickywck
Posted At: March 17, 2020, 9:48 a.m.

Question

You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of your Cloud Bigtable cluster. Which two actions can you take to accomplish this? (Choose two.)

A. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100.
B. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100.
C. Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.
D. Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity.
E. Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.

Community Answer Votes

CD: 11 most voted
BC: 3
AD: 2

Comments 25 comments Click to expand

Comment 1

ID: 66229 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Sun 20 Sep 2020 10:06 Selected Answer: - Upvotes: 53

Answer is C & D.
C –> Adding more nodes to a cluster (not replication) can improve the write performance https://cloud.google.com/bigtable/docs/performance
D –> since Google recommends adding nodes when storage utilization is > 70% https://cloud.google.com/bigtable/docs/modifying-instance#nodes

Comment 1.1

ID: 457495 User: sergio6 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 05 Apr 2022 07:11 Selected Answer: - Upvotes: 1

Adding nodes to the cluster In Bigtable scales linearly the performances both read and write
https://cloud.google.com/bigtable/docs/performance#typical-workloads

Comment 1.2

ID: 194350 User: dabrat Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Tue 06 Apr 2021 16:07 Selected Answer: - Upvotes: 4

Storage utilization (% max)
The percentage of the cluster's storage capacity that is being used. The capacity is based on the number of nodes in your cluster.

In general, do not use more than 70% of the hard limit on total storage, so you have room to add more data. If you do not plan to add significant amounts of data to your instance, you can use up to 100% of the hard limit.

Important: If any cluster in an instance exceeds the hard limit on the amount of storage per node, writes to all clusters in that instance will fail until you add nodes to each cluster that is over the limit. Also, if you try to remove nodes from a cluster, and the change would cause the cluster to exceed the hard limit on storage, Cloud Bigtable will deny the request.
If you are using more than the recommended percentage of the storage limit, add nodes to the cluster. You can also delete existing data, but deleted data takes up more space, not less, until a compaction occurs.

Comment 1.2.1

ID: 194351 User: dabrat Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Tue 06 Apr 2021 16:07 Selected Answer: - Upvotes: 3

https://cloud.google.com/bigtable/docs/monitoring-instance

Comment 2

ID: 76160 User: Barniyah Badges: Highly Voted Relative Date: 5 years, 4 months ago Absolute Date: Sun 18 Oct 2020 21:20 Selected Answer: - Upvotes: 10

Key visualizer is bigtable metric , So A and B incorrect
storage utilization also bigtable metric , So D incorrect
The question want you to monitor pipeline metrics (which is dataflow metrics) , in our case we can only monitor latency .
The answer will be : C & E

Comment 2.1

ID: 115536 User: ch3n6 Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Mon 21 Dec 2020 14:55 Selected Answer: - Upvotes: 14

No. it is C, D. "You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys."
why are you monitoring read anyway? you are just writing.

Comment 3

ID: 1398913 User: Parandhaman_Margan Badges: Most Recent Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 16:28 Selected Answer: BC Upvotes: 2

Correct answers are **B** (Write pressure) and **C** (latency)

Comment 4

ID: 1214197 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Wed 20 Nov 2024 10:53 Selected Answer: BC Upvotes: 1

The question focus is on writing. BC is correct. when the writing pressure is above 100, it is time to increase. same logic with C

Comment 5

ID: 809677 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Tue 15 Aug 2023 15:18 Selected Answer: - Upvotes: 1

why not B ?

Comment 5.1

ID: 820594 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 24 Aug 2023 14:24 Selected Answer: - Upvotes: 2

i am feeling to go with B and D. In option C, when latency is low, latency can be low for write operation for other reason.
but in option B, its showing clearly when write pressure more than 100. But why no one is talking about B

Comment 6

ID: 783910 User: RoshanAshraf Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 22 Jul 2023 03:37 Selected Answer: CD Upvotes: 3

Key visualizer is Metrics for Performance issues. Ruled out
Storage and Write Operations ; C and D

Comment 7

ID: 669649 User: John_Pongthorn Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Wed 15 Mar 2023 11:16 Selected Answer: CD Upvotes: 2

Well-designed row key : A B are not nessary
Write : CD both are involved in the question the most.

Comment 8

ID: 626898 User: Fezo Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 10:41 Selected Answer: - Upvotes: 2

Answer: CD
https://cloud.google.com/bigtable/docs/scaling

Comment 9

ID: 518490 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 06 Jul 2022 19:01 Selected Answer: CD Upvotes: 2

as explained by MaxNRG

Comment 10

ID: 511418 User: MaxNRG Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 28 Jun 2022 17:55 Selected Answer: CD Upvotes: 3

D: In general, do not use more than 70% of the hard limit on total storage, so you have room to add more data. If you do not plan to add significant amounts of data to your instance, you can use up to 100% of the hard limit
C: If this value is frequently at 100%, you might experience increased latency. Add nodes to the cluster to reduce the disk load percentage.
The key visualizer metrics options, suggest other things other than increase the cluster size.
https://cloud.google.com/bigtable/docs/monitoring-instance

Comment 11

ID: 504078 User: hendrixlives Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 18 Jun 2022 07:41 Selected Answer: CD Upvotes: 1

CD.

I agree with jvg637

Comment 12

ID: 490029 User: StefanoG Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Sun 29 May 2022 16:18 Selected Answer: AD Upvotes: 2

from https://cloud.google.com/bigtable/docs/monitoring-instance
Disk load - If this value is frequently at 100%, you might experience increased latency. Add nodes to the cluster to reduce the disk load percentage.
Storage utilization (% max) - In general, do not use more than 70% of the hard limit on total storage, so you have room to add more data. If you do not plan to add significant amounts of data to your instance, you can use up to 100% of the hard limit.

Comment 13

ID: 475912 User: KokkiKumar Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Wed 11 May 2022 01:39 Selected Answer: - Upvotes: 2

I am Voting for CD

Comment 14

ID: 458307 User: u_t_s Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 06 Apr 2022 15:59 Selected Answer: - Upvotes: 1

Answer should be D & E

Comment 14.1

ID: 582336 User: tavva_prudhvi Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Fri 07 Oct 2022 11:30 Selected Answer: - Upvotes: 1

Why are you monitoring read operations, when youre supposed to write? why E?

Comment 15

ID: 457489 User: sergio6 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Tue 05 Apr 2022 07:03 Selected Answer: - Upvotes: 2

D--> 70% is the the recommended percentage of the cluster's storage capacity that is being used, If you are using more than 70% storage limit, add nodes to the cluster
https://cloud.google.com/bigtable/quotas#storage-per-node
https://cloud.google.com/bigtable/docs/monitoring-instance#disk
E--> 100 ms is an order of magnitude lower latency than Google claimed (<10ms)
https://cloud.google.com/bigtable/docs/performance#typical-workloads

Comment 16

ID: 427949 User: hauhau Badges: - Relative Date: 4 years ago Absolute Date: Sun 20 Feb 2022 10:11 Selected Answer: - Upvotes: 1

BC
D:you can just add node, not cluster
The percentage of the cluster's storage capacity that is being used. The capacity is based on the number of nodes in your cluster.(https://cloud.google.com/bigtable/docs/monitoring-instance)
After you create a Cloud Bigtable instance, you can update any of the following settings without any downtime:

(The number of nodes in each cluster)
https://cloud.google.com/bigtable/docs/modifying-instance

Comment 17

ID: 396027 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 01 Jan 2022 16:57 Selected Answer: - Upvotes: 2

B, C , D - all three looks okay to me

Comment 17.1

ID: 396028 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 01 Jan 2022 16:58 Selected Answer: - Upvotes: 3

Vote for C & D,
Option B eliminated, as Row are well defined (as per question) - so no need of key-visualizer

Comment 17.2

ID: 463783 User: squishy_fishy Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Mon 18 Apr 2022 02:12 Selected Answer: - Upvotes: 2

Answer is C, D.
B is not correct, because B is Key Visualizer, it means the row key needs re-design again.

28. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 218

Sequence: 131
Discussion ID: 129865
Source URL: https://www.examtopics.com/discussions/google/view/129865-exam-professional-data-engineer-topic-1-question-218/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:43 a.m.

Question

You have a Cloud SQL for PostgreSQL instance in Region’ with one read replica in Region2 and another read replica in Region3. An unexpected event in Region’ requires that you perform disaster recovery by promoting a read replica in Region2. You need to ensure that your application has the same database capacity available before you switch over the connections. What should you do?

A. Enable zonal high availability on the primary instance. Create a new read replica in a new region.
B. Create a cascading read replica from the existing read replica in Region3.
C. Create two new read replicas from the new primary instance, one in Region3 and one in a new region.
D. Create a new read replica in Region1, promote the new read replica to be the primary instance, and enable zonal high availability.

Community Answer Votes

C: 16 most voted
D: 1

Comments 10 comments Click to expand

Comment 1

ID: 1113228 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 00:50 Selected Answer: C Upvotes: 5

After promoting the read replica in Region2 to be the new primary instance, creating additional read replicas from it can help distribute the read load and maintain or increase the database's total capacity.

Comment 2

ID: 1365592 User: skhaire Badges: Most Recent Relative Date: 1 year ago Absolute Date: Wed 05 Mar 2025 22:01 Selected Answer: C Upvotes: 1

Corrected answer: C

If the primary instance (db-a-0) becomes unavailable, you can promote the replica in region B to become the primary. To again have additional replicas in regions A and C, delete the old instances (the former primary instance in A, and the replica in C), and create new read replicas from the new primary instance in B.

Comment 3

ID: 1365591 User: skhaire Badges: - Relative Date: 1 year ago Absolute Date: Wed 05 Mar 2025 21:54 Selected Answer: D Upvotes: 1

Question is flawed but the closest answer would be D since C will result in 2 read replicas on Region 3 (original one and new now)
Option C- Create two new read replicas from the new primary instance - contradicts with the requirements - You need to ensure that your application has the same database capacity available 'before you switch over the connections.'
Option D- Create a new read replica in Region1, promote the new read replica to be the primary instance - contradicts with the requirement - 'requires that you perform disaster
recovery by promoting a read replica in Region2.' How does this affect the answer choices?

Comment 4

ID: 1213428 User: josech Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 18 May 2024 20:22 Selected Answer: C Upvotes: 3

https://cloud.google.com/sql/docs/mysql/replication#cross-region-read-replicas

Comment 4.1

ID: 1272196 User: nadavw Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sun 25 Aug 2024 17:41 Selected Answer: - Upvotes: 2

requires 2 new read replicas as the read replica that wasn't promoted isn't capable to be a replica any more as the primary isa gone

Comment 5

ID: 1191195 User: CGS22 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 07 Apr 2024 22:44 Selected Answer: C Upvotes: 4

The best option here is C. Create two new read replicas from the new primary instance, one in Region3 and one in a new region.

Here's the breakdown:

Capacity Restoration: Promoting the Region2 replica makes it the new primary. You need to replicate from this new primary to maintain redundancy and capacity. Creating two replicas (Region3, new region) accomplishes this.
Geographic Distribution: Distributing replicas across regions ensures availability if another regional event occurs.
Speed: Creating new replicas from the promoted primary is likely faster than promoting another existing replica.

Comment 5.1

ID: 1194151 User: BigDataBB Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Fri 12 Apr 2024 07:03 Selected Answer: - Upvotes: 1

Who said that i can use a 4° region? If have constraint that i can't go out from that 3 regions?
By My opinion will be a right solution if the new replica will in another zona of the region 1 or 3.
may be the best solution is the case D

Comment 5.1.1

ID: 1194153 User: BigDataBB Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Fri 12 Apr 2024 07:11 Selected Answer: - Upvotes: 1

https://cloud.google.com/sql/docs/postgres/replication/cross-region-replicas?hl=en

Comment 6

ID: 1152527 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 13:08 Selected Answer: C Upvotes: 1

Option C

Comment 7

ID: 1109544 User: e70ea9e Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:43 Selected Answer: C Upvotes: 2

Immediate Failover:

Promoting the read replica in Region2 quickly restores database operations in a different region, aligning with disaster recovery goals.
Capacity Restoration:

Creates two new read replicas from the promoted primary instance (formerly the read replica in Region2).
This replaces the lost capacity in Region1 and adds a read replica in a new region for further redundancy.

29. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 137

Sequence: 137
Discussion ID: 79674
Source URL: https://www.examtopics.com/discussions/google/view/79674-exam-professional-data-engineer-topic-1-question-137/
Posted By: ducc
Posted At: Sept. 3, 2022, 6:36 a.m.

Question

You have a data pipeline with a Dataflow job that aggregates and writes time series metrics to Bigtable. You notice that data is slow to update in Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. Which two actions should you take? (Choose two.)

A. Configure your Dataflow pipeline to use local execution
B. Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions
C. Increase the number of nodes in the Bigtable cluster
D. Modify your Dataflow pipeline to use the Flatten transform before writing to Bigtable
E. Modify your Dataflow pipeline to use the CoGroupByKey transform before writing to Bigtable

Community Answer Votes

BC: 28 most voted
CD: 2
DE: 1
BE: 1
CE: 1

Comments 16 comments Click to expand

Comment 1

ID: 661069 User: arpitagrawal Badges: Highly Voted Relative Date: 3 years, 6 months ago Absolute Date: Tue 06 Sep 2022 12:10 Selected Answer: BC Upvotes: 9

It should be B and C

Comment 2

ID: 658069 User: ducc Badges: Highly Voted Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 06:36 Selected Answer: BC Upvotes: 7

BC is correct

Why the comments is deleted?

Comment 3

ID: 1346436 User: loki82 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sat 25 Jan 2025 13:28 Selected Answer: CE Upvotes: 1

If there's a write speed bottleneck on bigtable, more dataflow workers won't make a difference. If I add more bigtable nodes, or group my writes together, I can increase update throughput.

https://cloud.google.com/dataflow/docs/guides/write-to-bigtable#best-practices

Comment 4

ID: 1288320 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 23 Sep 2024 23:44 Selected Answer: BC Upvotes: 1

the goal is to reduce the write latency not to improve data flow code

Comment 5

ID: 1069707 User: emmylou Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 13 Nov 2023 21:03 Selected Answer: - Upvotes: 2

The "Correct Answers" are just put in with a random generator :-) B and C

Comment 6

ID: 1054501 User: BlehMaks Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 26 Oct 2023 13:32 Selected Answer: BC Upvotes: 4

B - opportunity to parallelise the process
C - increase throughput

Comment 7

ID: 1022910 User: Bahubali1988 Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Mon 02 Oct 2023 09:21 Selected Answer: - Upvotes: 1

Exactly opposite answers in the discussions

Comment 8

ID: 1015432 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 04:34 Selected Answer: BC Upvotes: 4

B. Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions:
Increasing the number of Dataflow workers can help parallelize the processing of your data, which can result in faster data updates to Bigtable and improved concurrency. You can set maxNumWorkers to a higher value to achieve this.

C. Increase the number of nodes in the Bigtable cluster:
Increasing the number of nodes in your Bigtable cluster can improve the overall throughput and reduce latency when writing data. It allows Bigtable to handle a higher rate of data ingestion and queries, which is essential for supporting additional concurrent users.

Comment 9

ID: 1012192 User: ckanaar Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Wed 20 Sep 2023 12:17 Selected Answer: CD Upvotes: 2

C definetely is correct, as it improves the read and write performance of Bigtable.

However, I do think that the second option is actually D instead of B, because the question specifically states that the pipeline aggregates data. Flatten merges multiple PCollection objects into a single logical PCollection, allowing for faster aggregation of time series data.

Comment 10

ID: 972280 User: NewDE2023 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Fri 04 Aug 2023 17:22 Selected Answer: BE Upvotes: 1

B - I believe it is consensus.
D - The question mentions "a Dataflow job that "aggregates" and writes time series metrics to Bigtable". So CoGroupByKey performs a shuffle (grouping) operation to distribute data across workers.

https://cloud.google.com/dataflow/docs/guides/develop-and-test-pipelines

Comment 11

ID: 917483 User: WillemHendr Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 20:01 Selected Answer: DE Upvotes: 1

I read this question as: BigTable Write operations are all over the place (key-wise), and BigTable doesn't like that. When creating groups (batch writes), of similar keys (close to each other), BigTable is happy again, which I loosely translate into DE.

Comment 12

ID: 889408 User: vaga1 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Thu 04 May 2023 16:04 Selected Answer: - Upvotes: 1

B is correct. But I don't see how you increase the write throughput of Bigtable increasing its cluster size. It should be dataflow instance resources that have to be increased

Comment 13

ID: 846144 User: juliobs Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 21 Mar 2023 17:10 Selected Answer: BC Upvotes: 1

BC make sense

Comment 14

ID: 791559 User: NamitSehgal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 29 Jan 2023 11:51 Selected Answer: - Upvotes: 1

BC only makes sense here , no mention of data, no mention of keeping cost low

Comment 15

ID: 762722 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 17:59 Selected Answer: - Upvotes: 1

B. Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions Most Voted
C. Increase the number of nodes in the Bigtable cluster

Comment 16

ID: 724907 User: ovokpus Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 23 Nov 2022 06:29 Selected Answer: BC Upvotes: 2

Increase max num of workers increases pipeline performance in Dataflow
Increase number of nodes in Bigtable increases write throughput

30. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 101

Sequence: 141
Discussion ID: 17204
Source URL: https://www.examtopics.com/discussions/google/view/17204-exam-professional-data-engineer-topic-1-question-101/
Posted By: Rajokkiyam
Posted At: March 22, 2020, 6:35 a.m.

Question

You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?

A. Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
B. Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
C. Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
D. Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

Community Answer Votes

B: 35 most voted
A: 33
D: 3

Comments 31 comments Click to expand

Comment 1

ID: 76731 User: Ganshank Badges: Highly Voted Relative Date: 5 years, 10 months ago Absolute Date: Mon 20 Apr 2020 06:37 Selected Answer: - Upvotes: 63

You are transferring sensitive patient information, so C & D are ruled out. Choice comes down to A & B. Here it gets tricky. How to choose Transfer Appliance: (https://cloud.google.com/transfer-appliance/docs/2.0/overview)
Without knowing the bandwidth, it is not possible to determine whether the upload can be completed within 7 days, as recommended by Google. So the safest and most performant way is to use Transfer Appliance.
Therefore my choice is B.

Comment 1.1

ID: 134228 User: tprashanth Badges: - Relative Date: 5 years, 8 months ago Absolute Date: Mon 13 Jul 2020 20:43 Selected Answer: - Upvotes: 5

https://cloud.google.com/solutions/migration-to-google-cloud-transferring-your-large-datasets
The table shows for 1Gbps, it takes 30 hrs for 10 TB. Generally, corporate internet speeds are over 1Gbps. I'm inclined to pick A

Comment 1.1.1

ID: 493845 User: BigQuery Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Sat 04 Dec 2021 18:23 Selected Answer: - Upvotes: 3

SAY MY NAME!
You need to Transfer Sensitive Patient information, over public ISP you shouldn't do that.

Comment 1.1.1.1

ID: 1342574 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 16:44 Selected Answer: - Upvotes: 1

Security is not a concern, the data is encrypted at rest as well as in transit

Comment 1.1.2

ID: 911310 User: forepick Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 31 May 2023 15:12 Selected Answer: - Upvotes: 2

If you transfer 10TBs over the wire, your network will be blocked for the entire transfer time. This isn't something a company would be happy to swallow.

Comment 1.2

ID: 189583 User: TNT87 Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Tue 29 Sep 2020 11:52 Selected Answer: - Upvotes: 19

Answer is B,gsutil has a limit of 1TBaccording to Google documentation,if data is morethan 1TBthen we have to use Transfer Appliance.

Comment 1.2.1

ID: 419809 User: Yiouk Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Wed 04 Aug 2021 18:01 Selected Answer: - Upvotes: 9

The answer is clearly seen here: https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#transfer-options

Comment 1.2.2

ID: 1342579 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 16:50 Selected Answer: - Upvotes: 1

According to Google documentation, for files of size > 1TB STS is to be used

Comment 1.2.3

ID: 1342570 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 16:37 Selected Answer: - Upvotes: 2

While gsutil itself doesn't have a strict data limit, the underlying Google Cloud Storage service does, allowing for individual object sizes up to 5 Terabytes (TiB). This means that when using gsutil to transfer data, the maximum file size you can upload or download is 5 TiB

Comment 1.3

ID: 762435 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 03:17 Selected Answer: - Upvotes: 4

B is right answer

Comment 1.4

ID: 1342578 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 16:48 Selected Answer: - Upvotes: 1

With reasonable network connectivity (for example, 1 Gbps), transferring 100 TB of data online takes over 10 days to complete. If this rate is acceptable, an online transfer is likely a good solution for your needs. If you only have a 100 Mbps connection (or worse from a remote location), the same transfer takes over 100 days. At this point, it's worth considering an offline-transfer option such as Transfer Appliance.

Comment 2

ID: 128500 User: SSV Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Tue 07 Jul 2020 03:48 Selected Answer: - Upvotes: 8

Answer should be B: A is also correct but it has its own limit. It allows only 5TB data upload at a time to cloud storage.
https://cloud.google.com/storage/quotas
I will go with B

Comment 2.1

ID: 262487 User: VASI Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Fri 08 Jan 2021 11:40 Selected Answer: - Upvotes: 2

5Tb "for individual objects". Create smaller AVRO files.

Comment 2.2

ID: 262489 User: VASI Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Fri 08 Jan 2021 11:45 Selected Answer: - Upvotes: 3

AVRO compression can reduce file size to a tenth

Comment 3

ID: 1342589 User: grshankar9 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 17:11 Selected Answer: A Upvotes: 1

C and D are ruled out owing to limitations of Cloud Storage with regards to max file size of 5 TB ( Refer - https://cloud.google.com/storage-transfer/docs/known-limitations-transfer#:~:text=performance%20uploads%3A%206GiB-,Scaling%20limitations,of%20Gbps%20in%20transfer%20speed ). The answer should be between A & B. if we were to assume bandwidth of 1 GBPS, it would only take about 1 day. Even if the bandwidth were a tenth of 1 GBPS, it would take about 10 days to transfer 10 TB. With a transfer appliance, it would take minimum of 25 days. I would go with A.

Comment 4

ID: 1327001 User: clouditis Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sun 15 Dec 2024 19:10 Selected Answer: B Upvotes: 2

B - only because it says sensitive data, you can never be sure uploading over internet!

Comment 5

ID: 1288800 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 00:35 Selected Answer: B Upvotes: 1

Also,
For your scenario with 10 TB of data in Cloud SQL, if you export to Avro without specifying compression, you can expect the resulting Avro file to be around the same size, potentially slightly smaller depending on the data characteristics. Here in this question, there is no mentioning about compression.

So let's not assume that the data being used in Avro format will get compressed.

If Google cloud storage itself, can't handle an object of size greater than 5 TB, there is no point of using gsutil

Comment 5.1

ID: 1342581 User: grshankar9 Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 16:52 Selected Answer: - Upvotes: 1

gsutil is deprecated and has been replaced by gcloud storage. Gcloud storage is faster and requires less manual optimization for uploads and downloads

Comment 6

ID: 1288797 User: Preetmehta1234 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 00:28 Selected Answer: B Upvotes: 1

https://cloud.google.com/storage-transfer/docs/known-limitations-transfer

Cloud Storage 5TiB object size limit
Cloud Storage supports a maximum single-object size up 5 tebibytes. If you have objects larger than 5TiB, the object transfer fails for those objects for either Cloud Storage or Storage Transfer Service.

Comment 7

ID: 1239275 User: hussain.sain Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 29 Jun 2024 15:13 Selected Answer: D Upvotes: 1

while Option A is feasible and could work depending on specific requirements and security measures implemented, Option D (exporting as Avro, using Storage Transfer Service, and then loading into BigQuery) generally offers a more secure, efficient, and managed approach for transferring sensitive patient records into BigQuery from a relational database.Avro files uploaded to GCS will need to be secured. While GCS itself offers security features like IAM policies and access controls, using a public URL (as suggested in Option A) introduces additional security concerns.

Comment 8

ID: 1205335 User: Naresh_4u Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Thu 02 May 2024 09:05 Selected Answer: B Upvotes: 1

to securely transfer data and looking at the size of data B is the correct option.

Comment 9

ID: 1138496 User: GCanteiro Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 02 Feb 2024 13:00 Selected Answer: A Upvotes: 1

IMO "A" is the most suitable option since the transfer appliance could take 25 days to get the appliance and then 25 days to ship it back and have the data available.
https://cloud.google.com/transfer-appliance/docs/4.0/overview#transfer-speeds

Comment 10

ID: 1102562 User: TVH_Data_Engineer Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 21 Dec 2023 14:36 Selected Answer: B Upvotes: 1

Given the sensitivity of the patient records and the large size of the data, using Google's Transfer Appliance is a secure and efficient method. The Transfer Appliance is a hardware solution provided by Google for transferring large amounts of data. It enables you to securely transfer data without exposing it over the internet.

Comment 11

ID: 1088148 User: rocky48 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 05 Dec 2023 03:31 Selected Answer: B Upvotes: 1

Option B combines security, efficiency, and ease of use, making it a suitable choice for transferring sensitive patient records to BigQuery.

Comment 12

ID: 1065275 User: spicebits Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 08 Nov 2023 02:30 Selected Answer: A Upvotes: 5

10 TB is nothing. With a single 10 GB interconnect you could transfer the data in 3 hours or even with a 1 GB speeds without interconnect you could transfer it in one weekend. The transfer appliance will take 25 days to get the appliance and then 25 days while you wait for the data to be available that is not "time-efficient" at all. I go with A instead of B.

Comment 12.1

ID: 1065278 User: spicebits Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 08 Nov 2023 02:35 Selected Answer: - Upvotes: 2

I got the 25 days + 25 days from here: https://cloud.google.com/transfer-appliance/docs/4.0/overview#transfer-speeds

Comment 13

ID: 1014679 User: A_Nasser Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 23 Sep 2023 07:50 Selected Answer: A Upvotes: 3

transfer appliance will take time more than gsutil. and we did not mention yet if the location of the organization has google data centre

Comment 14

ID: 986153 User: DineshVarma Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 03:24 Selected Answer: D Upvotes: 2

As per Google recommendation above 1TB of transfer from onprem or from Google cloud or other cloud storage like s3 etc we need to use storage transfer service.

Comment 15

ID: 985448 User: arien_chen Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sun 20 Aug 2023 00:16 Selected Answer: - Upvotes: 1

Transfer Appliance would take 20 days for epected turnaround time. https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#expected%20turnaround:~:text=The%20expected%20turnaround%20time%20for%20a%20network%20appliance%20to%20be%20shipped%2C%20loaded%20with%20your%20data%2C%20shipped%20back%2C%20and%20rehydrated%20on%20Google%20Cloud%20is%2020%20days.

The best answer would be A.
If gsutil consume/leverage 100MB it would take 12 days and more time-efficient than B.
This is a reasonable assumption.
https://cloud.google.com/static/architecture/images/big-data-transfer-how-to-get-started-transfer-size-and-speed.png

Comment 16

ID: 983420 User: Colourseun Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 17 Aug 2023 11:20 Selected Answer: - Upvotes: 1

I will go with " A" because of the transition time to take transfer appliance to Google and that also depends in the organisation location. gsutil works anywhere internet is available.

Comment 17

ID: 951519 User: aewis Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 14 Jul 2023 14:26 Selected Answer: B Upvotes: 1

A will take crazy time if the organization didnt have a dedicated link

31. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 158

Sequence: 154
Discussion ID: 16899
Source URL: https://www.examtopics.com/discussions/google/view/16899-exam-professional-data-engineer-topic-1-question-158/
Posted By: rickywck
Posted At: March 18, 2020, 1:46 a.m.

Question

You need to deploy additional dependencies to all nodes of a Cloud Dataproc cluster at startup using an existing initialization action. Company security policies require that Cloud Dataproc nodes do not have access to the Internet so public initialization actions cannot fetch resources. What should you do?

A. Deploy the Cloud SQL Proxy on the Cloud Dataproc master
B. Use an SSH tunnel to give the Cloud Dataproc cluster access to the Internet
C. Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter
D. Use Resource Manager to add the service account used by the Cloud Dataproc cluster to the Network User role

Community Answer Votes

C: 9 most voted

Comments 16 comments Click to expand

Comment 1

ID: 68171 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Fri 25 Sep 2020 16:39 Selected Answer: - Upvotes: 39

Correct: C

If you create a Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

Comment 1.1

ID: 762820 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 18:27 Selected Answer: - Upvotes: 1

Thank you for detailed explanation. C is right

Comment 2

ID: 65403 User: rickywck Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Fri 18 Sep 2020 00:46 Selected Answer: - Upvotes: 12

Should be C:

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

Comment 3

ID: 1335507 User: b3e59c2 Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Thu 02 Jan 2025 11:40 Selected Answer: C Upvotes: 1

A: Incorrect, Proxy allows for connection to a Cloud SQL instance, which unless you have the dependencies stored there (doesn't seem viable or smart), would achieve nothing.

B: Incorrect, will allow for a connection to the internet to be made for installing the dependencies, however this goes against the companies security policies so should not be considered.

C: Correct, only one that makes sense and is best practise (see https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions)

D: Incorrect, provides access to a shared VPC network, however doesn't necessarily provide a way to access the dependencies. And even if it did, would go against company security policy.

Comment 4

ID: 1176345 User: gcpdataeng Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 18 Sep 2024 08:32 Selected Answer: C Upvotes: 1

c looks good

Comment 5

ID: 1015876 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 17:16 Selected Answer: C Upvotes: 3

Security Compliance: This option aligns with your company's security policies, which prohibit public Internet access from Cloud Dataproc nodes. Placing the dependencies in a Cloud Storage bucket within your VPC security perimeter ensures that the data remains within your private network.

VPC Security: By placing the dependencies within your VPC security perimeter, you maintain control over network access and can restrict access to the necessary nodes only.

Dataproc Initialization Action: You can use a custom initialization action or script to fetch and install the dependencies from the secure Cloud Storage bucket to the Dataproc cluster nodes during startup.

By copying the dependencies to a secure Cloud Storage bucket and using an initialization action to install them on the Dataproc nodes, you can meet your security requirements while providing the necessary dependencies to your cluster.

Comment 6

ID: 973660 User: knith66 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 06 Feb 2024 11:49 Selected Answer: C Upvotes: 1

C is correct

Comment 7

ID: 838239 User: charline Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 13 Sep 2023 19:49 Selected Answer: C Upvotes: 1

C seems good

Comment 8

ID: 812962 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 12:56 Selected Answer: - Upvotes: 2

Answer C,
It needs practical experience to understand this question. You create cluster with some package/software i.e dependencies such as python packages that you store in .zip file, then you save a jar file to run the cluster as an application such as you need java while running spark session. and some config yaml file.
These dependencies you can save in bucket and can use to configure cluster from external window , sdk or api. without going into UI.
Then you need to use VPC to access these files

Comment 9

ID: 634398 User: DataEngineer_WideOps Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 21 Jan 2023 09:01 Selected Answer: - Upvotes: 1

Without access to the internet, you can enable Private Google Access and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

Comment 10

ID: 520141 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 09 Jul 2022 11:03 Selected Answer: C Upvotes: 2

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#create_a_cloud_dataproc_cluster_with_internal_ip_address_only

Comment 11

ID: 507447 User: Prabusankar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Thu 23 Jun 2022 00:17 Selected Answer: - Upvotes: 3

When creating a Dataproc cluster, you can specify initialization actions in executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run

Comment 12

ID: 486435 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 25 May 2022 04:23 Selected Answer: - Upvotes: 1

Correct: C

Comment 13

ID: 152791 User: clouditis Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Mon 08 Feb 2021 04:00 Selected Answer: - Upvotes: 2

c it is!

Comment 14

ID: 70615 User: Rajokkiyam Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 06:08 Selected Answer: - Upvotes: 2

Should be C

Comment 15

ID: 65524 User: jvg637 Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Fri 18 Sep 2020 08:17 Selected Answer: - Upvotes: 4

I think the correct answer might be C instead, due to https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#create_a_cloud_dataproc_cluster_with_internal_ip_address_only

32. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 140

Sequence: 155
Discussion ID: 80270
Source URL: https://www.examtopics.com/discussions/google/view/80270-exam-professional-data-engineer-topic-1-question-140/
Posted By: jsree236
Posted At: Sept. 5, 2022, 10:23 a.m.

Question

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

A. The current epoch time
B. A concatenation of the product name and the current epoch time
C. A random universally unique identifier number (version 4 UUID)
D. The original order identification number from the sales system, which is a monotonically increasing integer

Community Answer Votes

C: 24 most voted
B: 2

Comments 12 comments Click to expand

Comment 1

ID: 668094 User: Remi2021 Badges: Highly Voted Relative Date: 2 years, 12 months ago Absolute Date: Mon 13 Mar 2023 16:46 Selected Answer: C Upvotes: 9

According to the documentation:
Use a Universally Unique Identifier (UUID)
You can use a Universally Unique Identifier (UUID) as defined by RFC 4122 as the primary key. Version 4 UUID is recommended, because it uses random values in the bit sequence. Version 1 UUID stores the timestamp in the high order bits and is not recommended.

https://cloud.google.com/spanner/docs/schema-design

Comment 1.1

ID: 762729 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 17:03 Selected Answer: - Upvotes: 1

Agree with C

Comment 2

ID: 1015439 User: barnac1es Badges: Most Recent Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 05:43 Selected Answer: C Upvotes: 4

For a transaction table in Cloud Spanner that stores product sales data, from a performance perspective, it is generally recommended to choose a primary key that allows for even distribution of data across nodes and minimizes hotspots. Therefore, option C, which suggests using a random universally unique identifier number (version 4 UUID), is the preferred choice.

Comment 3

ID: 985590 User: arien_chen Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 09:43 Selected Answer: C Upvotes: 1

For a RDB I would choice D.

But for Google Spanner, Google says:
https://cloud.google.com/spanner/docs/schema-and-data-model#:~:text=monotonically%20increasing%20integer

Comment 4

ID: 893157 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 09 Nov 2023 17:08 Selected Answer: C Upvotes: 2

B might work if you say timestamp instead than epoch. PK of sales should contain the exact purchase date or timestamp, not the time when the transaction was processed. I personally associate the term epoch in this context to the process timestamp instead than the purchase timestamp.

Comment 5

ID: 837731 User: midgoo Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 13 Sep 2023 07:42 Selected Answer: C Upvotes: 2

B may cause error if same product ID came at the same time (same id + same epoch)
So C is the correct answer here

Comment 5.1

ID: 1217116 User: NickNtaken Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sun 24 Nov 2024 02:54 Selected Answer: - Upvotes: 1

Agreed. Additionally, using the product name can lead to unbalanced distribution if some products are sold more frequently than others.

Comment 6

ID: 750832 User: jkhong Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 20 Jun 2023 12:09 Selected Answer: C Upvotes: 2

A and D are invalid because they monotonically increases.
B would work, but in terms of pure performance UUID 4 is the fastest because it virtually will not cause hotspots

Comment 7

ID: 739033 User: odacir Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 08 Jun 2023 12:33 Selected Answer: C Upvotes: 3

A and D are not valid, because they monotonically increase.
C avoid hotspots for sure, but It's nor relate with querys. So for writing performance it's perfect that the reason for chose this: “You need to create a new transaction table in Cloud Spanner that stores product sales data”. They only ask you to store product data, its a writing ops.
If the question had spoken about query the info or hard performance read, the best option would be B, because it has the balance of writing/reading best practices.
There are a few disadvantages to using a UUID:

They are slightly large, using 16 bytes or more. Other options for primary keys don't use this much storage.
They carry no information about the record. For example, a primary key of SingerId and AlbumId has an inherent meaning, while a UUID does not.
You lose locality between records that are related, which is why using a UUID eliminates hotspots.

https://cloud.google.com/spanner/docs/schema-design#uuid_primary_key

Comment 8

ID: 660028 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Sun 05 Mar 2023 13:09 Selected Answer: C Upvotes: 1

C. A random universally unique identifier number (version 4 UUID)

From https://cloud.google.com/spanner/docs/schema-and-data-model

There are techniques that can spread the load across multiple servers and avoid hotspots:

Hash the key and store it in a column. Use the hash column (or the hash column and the unique key columns together) as the primary key.
Swap the order of the columns in the primary key.
Use a Universally Unique Identifier (UUID). Version 4 UUID is recommended, because it uses random values in the high-order bits. Don't use a UUID algorithm (such as version 1 UUID) that stores the timestamp in the high order bits.
Bit-reverse sequential values.

Comment 9

ID: 659916 User: jsree236 Badges: - Relative Date: 3 years ago Absolute Date: Sun 05 Mar 2023 11:23 Selected Answer: B Upvotes: 2

Answer should be B as in all the other options hotspotting is possible. According to proper schema design guideline..
Schema design best practice #1: Do not choose a column whose value monotonically increases or decreases as the first key part for a high write rate table.

Supporting link:
https://cloud.google.com/spanner/docs/schema-design#primary-key-prevent-hotspots

Comment 9.1

ID: 1334445 User: LP_PDE Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Mon 30 Dec 2024 23:17 Selected Answer: - Upvotes: 1

Potential Skew: If there are a limited number of product names, this could still lead to uneven data distribution and potential hotspots.
Increased Key Size: Concatenating strings can result in larger primary keys, which can slightly impact storage and performance.

33. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 229

Sequence: 164
Discussion ID: 129876
Source URL: https://www.examtopics.com/discussions/google/view/129876-exam-professional-data-engineer-topic-1-question-229/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:55 a.m.

Question

You currently use a SQL-based tool to visualize your data stored in BigQuery. The data visualizations require the use of outer joins and analytic functions. Visualizations must be based on data that is no less than 4 hours old. Business users are complaining that the visualizations are too slow to generate. You want to improve the performance of the visualization queries while minimizing the maintenance overhead of the data preparation pipeline. What should you do?

A. Create materialized views with the allow_non_incremental_definition option set to true for the visualization queries. Specify the max_staleness parameter to 4 hours and the enable_refresh parameter to true. Reference the materialized views in the data visualization tool.
B. Create views for the visualization queries. Reference the views in the data visualization tool.
C. Create a Cloud Function instance to export the visualization query results as parquet files to a Cloud Storage bucket. Use Cloud Scheduler to trigger the Cloud Function every 4 hours. Reference the parquet files in the data visualization tool.
D. Create materialized views for the visualization queries. Use the incremental updates capability of BigQuery materialized views to handle changed data automatically. Reference the materialized views in the data visualization tool.

Community Answer Votes

A: 12 most voted
B: 2

Comments 9 comments Click to expand

Comment 1

ID: 1293399 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sat 05 Oct 2024 10:34 Selected Answer: A Upvotes: 1

Just a note, the question saying "data no less than 4 hours old" presumably means "no more than 4 hours old"

Comment 2

ID: 1265400 User: JamesKarianis Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 14 Aug 2024 00:09 Selected Answer: B Upvotes: 2

Unfortunately the correct answer is B due to the limitations of materialized views, doesn't support any other join than inner and no analytical function is supported

Comment 2.1

ID: 1330319 User: AWSandeep Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Sun 22 Dec 2024 10:05 Selected Answer: - Upvotes: 1

Yes, they do if they are non-incremental.

Comment 3

ID: 1172880 User: ricardovazz Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Wed 13 Mar 2024 21:19 Selected Answer: A Upvotes: 3

A

https://cloud.google.com/bigquery/docs/materialized-views-create#non-incremental

In scenarios where data staleness is acceptable, for example for batch data processing or reporting, non-incremental materialized views can improve query performance and reduce cost.

allow_non_incremental_definition option. This option must be accompanied by the max_staleness option. To ensure a periodic refresh of the materialized view, you should also configure a refresh policy.

Comment 4

ID: 1121543 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 12:14 Selected Answer: A Upvotes: 2

Option A is better than D, since it accounts for data staleness and is better suited for heavy querying, thanks to the allow_non_incremental_definition

Comment 5

ID: 1114827 User: Jordan18 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 23:37 Selected Answer: - Upvotes: 4

A seems right but whats wrong with option D, can anybody please explain?

Comment 5.1

ID: 1123648 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 15 Jan 2024 21:29 Selected Answer: - Upvotes: 2

Seems like materialiazed views can use incremental updates only if data was not delated or updated in original table. Here the data changes so I think thats the reason why its not correct answer
https://cloud.google.com/bigquery/docs/materialized-views-use#incremental_updates
"BigQuery combines the cached view's data with new data to provide consistent query results while still using the materialized view. For single-table materialized views, this is possible if the base table is unchanged since the last refresh, or if only new data was added. For multi-table views, no more than one table can have appended data. If more than one of a multi-table view's base tables has changed, then the view cannot be incrementally updated."

Comment 6

ID: 1113839 User: raaad Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 17:00 Selected Answer: A Upvotes: 3

- Materialized views in BigQuery precompute and store the result of a base query, which can speed up data retrieval for complex queries used in visualizations.
- The max_staleness parameter allows us to specify how old the data can be, ensuring that the visualizations are based on data no less than 4 hours old.
- The enable_refresh parameter ensures that the materialized view is periodically refreshed.
- The allow_non_incremental_definition is used for enabling the creation of non-incrementally refreshable materialized views.

Comment 7

ID: 1109557 User: e70ea9e Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 09:55 Selected Answer: A Upvotes: 3

Precomputed Results: Materialized views store precomputed results of complex queries, significantly accelerating subsequent query performance, addressing the slow visualization issue.
Allow Non-Incremental Views: Using allow_non_incremental_definition circumvents the limitation of incremental updates for outer joins and analytic functions, ensuring views can be created for the specified queries.
Near-Real-Time Data: Setting max_staleness to 4 hours guarantees data freshness within the acceptable latency for visualizations.
Automatic Refresh: Enabling refresh with enable_refresh maintains view consistency with minimal maintenance overhead.
Minimal Overhead: Materialized views automatically update as underlying data changes, reducing maintenance compared to manual exports or view definitions.

34. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 50

Sequence: 165
Discussion ID: 16661
Source URL: https://www.examtopics.com/discussions/google/view/16661-exam-professional-data-engineer-topic-1-question-50/
Posted By: jvg637
Posted At: March 15, 2020, 1:59 p.m.

Question

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100
TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID).
However, high availability and low latency are required.
You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

A. Redis
B. HBase
C. MySQL
D. MongoDB
E. Cassandra
F. HDFS with Hive

Community Answer Votes

BDE: 17 most voted
BEF: 1

Comments 24 comments Click to expand

Comment 1

ID: 64269 User: jvg637 Badges: Highly Voted Relative Date: 4 years, 12 months ago Absolute Date: Mon 15 Mar 2021 13:59 Selected Answer: - Upvotes: 39

BDE. Hive is not for NoSQL

Comment 1.1

ID: 453474 User: sergio6 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 28 Sep 2022 16:52 Selected Answer: - Upvotes: 2

Redis is also NoSQL

Comment 1.1.1

ID: 459269 User: vholti Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Sat 08 Oct 2022 16:20 Selected Answer: - Upvotes: 3

Redis is limited to 1 TB capacity quota per region. So it doesn't satisfy the requirement.
https://cloud.google.com/memorystore/docs/redis/quotas

Comment 1.1.1.1

ID: 1012986 User: ckanaar Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sat 21 Sep 2024 12:25 Selected Answer: - Upvotes: 1

Memorystore, Google's managed Redis service is. But OS Redis is not. Though it is hard to find a 100GB RAM machine

Comment 2

ID: 398565 User: awssp12345 Badges: Highly Voted Relative Date: 3 years, 8 months ago Absolute Date: Mon 04 Jul 2022 19:59 Selected Answer: - Upvotes: 32

Answer is BDE -
A. Redis - Redis is an in-memory non-relational key-value store. Redis is a great choice for implementing a highly available in-memory cache to decrease data access latency, increase throughput, and ease the load off your relational or NoSQL database and application. Since the question does not ask cache, A is discarded.
B. HBase - Meets reqs
C. MySQL - they do not need ACID, so not needed.
D. MongoDB - Meets reqs
E. Cassandra - Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
F. HDFS with Hive - Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. HIVE IS NOT A DATABSE.

Comment 3

ID: 1329511 User: sravi1200 Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Fri 20 Dec 2024 15:49 Selected Answer: BDE Upvotes: 1

Option A: Redis cannot handle large scale data it is NOSQL db to store small amount of key value pairs,
Option B: HBase NOSQL db built on Hadoop does not support ACID Properties. Correct answer
Option C: Mysql Does not store telemetry IOT data. Mysql is a relational database structured data only stored.
Option D, E: NOSQL Databases, Option F: HDFS with hive used for batch processing not real time streaming data.
Option

Comment 4

ID: 808280 User: musumusu Badges: - Relative Date: 2 years ago Absolute Date: Wed 14 Feb 2024 11:31 Selected Answer: - Upvotes: 1

BDE
Faster Database are NoSql db than SQL, Cassandra is the fastest one in market now than Hbase and then others, in given list MongoBD

Comment 5

ID: 652304 User: MisuLava Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 26 Aug 2023 20:09 Selected Answer: - Upvotes: 1

"Which three databases meet your requirements? "
Hive is not a database server.
HBase, Mongo and Cassandra are and meet the criteria.
BDE is the right answer

Comment 6

ID: 523699 User: sraakesh95 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 14 Jan 2023 20:35 Selected Answer: BDE Upvotes: 1

@hendrixlives

Comment 7

ID: 516718 User: medeis_jar Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 15:32 Selected Answer: BDE Upvotes: 1

as explained by hendrixlives

Comment 8

ID: 503477 User: hendrixlives Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 17 Dec 2022 09:36 Selected Answer: BDE Upvotes: 14

BDE:
A. Redis is a key-value store (and in many cases used as in-memory and non persistent cache). It is not designed for "100TB per year" of highly available storage.
B. HBase is similar to Google Bigtable, fits the requirements perfectly: highly available, scalable and with very low latency.
C. MySQL is a relational DB, designed precisely for ACID transactions and not for the stated requirements. Also, growth may be an issue.
D. MongoDB is a document-db used for high volume data and maintains currently used data in RAM, so performance is usually really good. Should also fit the requirements well.
E. Cassandra is designed precisely for highly available massive datasets, and a fine tuned cluster may offer low latency in reads. Fits the requirements.
F. HDFS with Hive is great for OLAP and data-warehouse scenarios, allowing to solve map-reduce problems using an SQL subset, but the latency is usually really high (we may talk about seconds, not milliseconds, when obtaining results), so this does not complies with the requirements.

Comment 9

ID: 489048 User: MaxNRG Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 28 Nov 2022 11:09 Selected Answer: BEF Upvotes: 1

Very strange question, seems outdated and irrelevant to me as it doesn't contain any GCP products :)

Anyway, I would choose BEF.
Redis is in-memory key value, not good
HBase yes, excelent case for linear growth and a column-oriented database
mysql not good, too big and no need for transactionality
Mongodb, document db with flexible schema ??
Yes Cassandra, good use case
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.
https://www.wikiwand.com/en/Apache_Hive

Comment 9.1

ID: 503460 User: hendrixlives Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 17 Dec 2022 09:01 Selected Answer: - Upvotes: 2

Latency in Hive is usually quite high, and one of the requirements is "low latency"

Comment 9.1.1

ID: 532296 User: MaxNRG Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Wed 25 Jan 2023 18:27 Selected Answer: - Upvotes: 2

agreed on BDE

Comment 9.1.2

ID: 529979 User: MaxNRG Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 22 Jan 2023 17:41 Selected Answer: - Upvotes: 1

good point!

Comment 10

ID: 462779 User: anji007 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 15 Oct 2022 21:16 Selected Answer: - Upvotes: 2

Ans: B, D and E

Comment 11

ID: 392208 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Mon 27 Jun 2022 17:59 Selected Answer: - Upvotes: 2

vote for BDE

Comment 12

ID: 309318 User: BhupiSG Badges: - Relative Date: 4 years ago Absolute Date: Sun 13 Mar 2022 02:08 Selected Answer: - Upvotes: 2

BEF
B: HBASE is based upon BigTable
E: Cassandra is low latency columnar distributed database like BigTable
F: HDFS is low latency distributed file system and Hive will help with running the queries

Comment 12.1

ID: 336586 User: Manue Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 15 Apr 2022 22:39 Selected Answer: - Upvotes: 5

Hive is not for low latency queries. It is for analytics.

Comment 13

ID: 307377 User: daghayeghi Badges: - Relative Date: 4 years ago Absolute Date: Thu 10 Mar 2022 19:42 Selected Answer: - Upvotes: 2

BDE:
These are NoSQL DB, Hive is not for NoSQL.

Comment 14

ID: 297224 User: Rayleigh Badges: - Relative Date: 4 years ago Absolute Date: Wed 23 Feb 2022 08:31 Selected Answer: - Upvotes: 1

The answer is ADE, the statement says they require a NoSQL with high availability and low latency, they do not require consistency.
C. it is not NoSQL.
F. it is not NoSQL.
B. it is NoSQL but focused on strong consistency and based on HDFS, you need HDFS for Hbase.
Therefore the answer is ADE

Comment 15

ID: 290613 User: daghayeghi Badges: - Relative Date: 4 years ago Absolute Date: Tue 15 Feb 2022 02:52 Selected Answer: - Upvotes: 1

BDF:
Redis and Cassandra have only Rowkey and couldn't be indexed, and MySQL isn't NoSQL, Then B D and E is correct answer.

Comment 16

ID: 285684 User: naga Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 07 Feb 2022 18:55 Selected Answer: - Upvotes: 3

Correct BDE

Comment 17

ID: 255432 User: apnu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 30 Dec 2021 08:37 Selected Answer: - Upvotes: 3

it should be BDE because Hive is a sql based datawarehouse , it is not a nosql DB

35. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 119

Sequence: 166
Discussion ID: 17244
Source URL: https://www.examtopics.com/discussions/google/view/17244-exam-professional-data-engineer-topic-1-question-119/
Posted By: -
Posted At: March 22, 2020, 12:41 p.m.

Question

You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added. What should you do to improve the performance of your application?

A. Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.
B. Change the row key syntax in your Cloud Bigtable table to begin with a random number per second.
C. Change the data pipeline to use BigQuery for storing stock trades, and update your application.
D. Use Cloud Dataflow to write a summary of each day's stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses.

Community Answer Votes

A: 6 most voted
B: 1

Comments 25 comments Click to expand

Comment 1

ID: 73540 User: kichukonr Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Sun 12 Apr 2020 07:13 Selected Answer: - Upvotes: 13

Stock symbol will be similar for most of the records, so it's better to start with random number.. Answer should be B

Comment 1.1

ID: 81139 User: taepyung Badges: - Relative Date: 5 years, 10 months ago Absolute Date: Wed 29 Apr 2020 08:01 Selected Answer: - Upvotes: 3

I agree with u

Comment 1.2

ID: 294820 User: karthik89 Badges: - Relative Date: 5 years ago Absolute Date: Sat 20 Feb 2021 08:20 Selected Answer: - Upvotes: 6

it can start with stock symbol concated with timestamp can be a good row key design

Comment 1.2.1

ID: 510001 User: Yonghai Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Mon 27 Dec 2021 05:01 Selected Answer: - Upvotes: 3

for a given company, the data poits starts with the same stock symbol. The dataset is not distrubuted. It is not a good option.

Comment 1.3

ID: 476195 User: Abhi16820 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Thu 11 Nov 2021 13:24 Selected Answer: - Upvotes: 13

You never use something called random number in bigtable rowkey because it gives you no use in querying possibilities, since we can't run sql querys in bigtable we should not randomise rowkeys in bigtable.
Don't confuse the above point with the hotspot logic, both are different if you think so.

And another thing is, what you said can be good choice if we are using cloud spanner and trying to comeup with primary key situation, since there we can always run sql query.

I think you got the point now.

Comment 2

ID: 820774 User: musumusu Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Fri 24 Feb 2023 18:27 Selected Answer: - Upvotes: 12

Answer A:
Trick to remember: Row-key adjustment always be like in decending order.
#<<Least value>>#<<Lesser value>>
For example:
1. #<<Earth>>#<<continents>>#<<countries>>#<<cities>> and so on..
2. #<<Stock>>#<<users>>#timestamp..
in 99% cases timestamp will be in the end, as its smallest division...

Comment 2.1

ID: 982752 User: piyush7777 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 16 Aug 2023 18:11 Selected Answer: - Upvotes: 1

Awesome!

Comment 3

ID: 1327049 User: clouditis Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Sun 15 Dec 2024 21:29 Selected Answer: B Upvotes: 1

the most plausible option to pick here is B, A can introduce hot-spotting

Comment 4

ID: 1271224 User: Vineet_Mor Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Fri 23 Aug 2024 12:13 Selected Answer: - Upvotes: 2

B is correct, By introducing a random number or a hash at the beginning of the row key, you distribute the writes and reads more evenly across the Bigtable cluster, thereby improving performance under heavy load.

WHY NOT A?
This might still cause hotspots if certain stocks are more popular than others. It could lead to uneven load distribution, which wouldn't solve the performance degradation problem.

Comment 5

ID: 1116033 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 18:15 Selected Answer: A Upvotes: 2

Answer is A.

Comment 6

ID: 517703 User: MaxNRG Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Wed 05 Jan 2022 18:50 Selected Answer: A Upvotes: 4

A: https://cloud.google.com/bigtable/docs/schema-design-time-series#prefer_rows_to_column_versions

Comment 7

ID: 487101 User: JG123 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Fri 26 Nov 2021 06:59 Selected Answer: - Upvotes: 1

Correct: A

Comment 8

ID: 475153 User: JayZeeLee Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Wed 10 Nov 2021 02:24 Selected Answer: - Upvotes: 1

A and B would both work, since both would distribute the work. This question is not framed properly.

Comment 9

ID: 397159 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Fri 02 Jul 2021 22:34 Selected Answer: - Upvotes: 3

Vote for A

Comment 10

ID: 302712 User: Jay3244 Badges: - Relative Date: 5 years ago Absolute Date: Wed 03 Mar 2021 16:44 Selected Answer: - Upvotes: 5

Option A.
Below document explains
Having EXCHANGE and SYMBOL in the leading positions in the row key will naturally distribute activity.
https://cloud.google.com/bigtable/docs/schema-design-time-series

Comment 11

ID: 222885 User: arghya13 Badges: - Relative Date: 5 years, 3 months ago Absolute Date: Thu 19 Nov 2020 16:17 Selected Answer: - Upvotes: 2

I think A

Comment 12

ID: 221791 User: kavs Badges: - Relative Date: 5 years, 3 months ago Absolute Date: Wed 18 Nov 2020 11:45 Selected Answer: - Upvotes: 1

Catch here is current Rowley starts with timestamp which should not be in the starting or end position so symbolmshould be prefixed before timestamp

Comment 13

ID: 216039 User: Cloud_Enthusiast Badges: - Relative Date: 5 years, 4 months ago Absolute Date: Mon 09 Nov 2020 16:26 Selected Answer: - Upvotes: 6

A is correct..A Good ROW KEY has to be an ID followed by timestamp. Stock symbol in this case works as an ID

Comment 14

ID: 189675 User: kino2020 Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Tue 29 Sep 2020 14:57 Selected Answer: - Upvotes: 2

A.
You can find an example in Google's introductory guide.
https://cloud.google.com/bigtable/docs/schema-design-time-series?hl=ja#financial_market_data

Comment 15

ID: 181383 User: Diqtator Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Fri 18 Sep 2020 07:02 Selected Answer: - Upvotes: 3

I think A would be best practice. Adding random numbers as start of rowkey doesn't help with troubleshooting

Comment 16

ID: 179059 User: Tanmoyk Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Mon 14 Sep 2020 06:12 Selected Answer: - Upvotes: 1

B should be the answer as adding random numbers in the beginning of the rowkey will distributes data across multiple nodes

Comment 17

ID: 163109 User: haroldbenites Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Fri 21 Aug 2020 20:00 Selected Answer: - Upvotes: 1

B is correct.
A is incorrect. The docuemntation don´t recommend constants in the row key because the balance is not efficiente. The are 2 methods to avoid hotspoting. Promotion Field (use a UserId BEFORE the timestamp) and Salting (use timestamp-hash divided by 3 y put it before the timestamp)

Comment 17.1

ID: 163988 User: atnafu2020 Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Sun 23 Aug 2020 01:23 Selected Answer: - Upvotes: 2

Aggree with promotion field and salting. But, there is no constant. A stock symbol is a unique series of letters assigned to security for trading purposes.

Comment 17.2

ID: 293392 User: daghayeghi Badges: - Relative Date: 5 years ago Absolute Date: Thu 18 Feb 2021 14:37 Selected Answer: - Upvotes: 1

A is correct:
You deny your sentence, User ID means Stock Symbol, then B is correct.

Comment 17.2.1

ID: 293394 User: daghayeghi Badges: - Relative Date: 5 years ago Absolute Date: Thu 18 Feb 2021 14:38 Selected Answer: - Upvotes: 1

, then A is correct.

36. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 161

Sequence: 168
Discussion ID: 16689
Source URL: https://www.examtopics.com/discussions/google/view/16689-exam-professional-data-engineer-topic-1-question-161/
Posted By: madhu1171
Posted At: March 15, 2020, 7:41 p.m.

Question

You need to choose a database to store time series CPU and memory usage for millions of computers. You need to store this data in one-second interval samples. Analysts will be performing real-time, ad hoc analytics against the database. You want to avoid being charged for every query executed and ensure that the schema design will allow for future growth of the dataset. Which database and data model should you choose?

A. Create a table in BigQuery, and append the new samples for CPU and memory to the table
B. Create a wide table in BigQuery, create a column for the sample value at each second, and update the row with the interval for each second
C. Create a narrow table in Bigtable with a row key that combines the Computer Engine computer identifier with the sample time at each second
D. Create a wide table in Bigtable with a row key that combines the computer identifier with the sample time at each minute, and combine the values for each second as column data.

Community Answer Votes

C: 8 most voted
D: 4

Comments 20 comments Click to expand

Comment 1

ID: 81750 User: psu Badges: Highly Voted Relative Date: 5 years, 10 months ago Absolute Date: Thu 30 Apr 2020 16:58 Selected Answer: - Upvotes: 35

Answer C

A tall and narrow table has a small number of events per row, which could be just one event, whereas a short and wide table has a large number of events per row. As explained in a moment, tall and narrow tables are best suited for time-series data.

For time series, you should generally use tall and narrow tables. This is for two reasons: Storing one event per row makes it easier to run queries against your data. Storing many events per row makes it more likely that the total row size will exceed the recommended maximum (see Rows can be big but are not infinite).

https://cloud.google.com/bigtable/docs/schema-design-time-series#patterns_for_row_key_design

Comment 1.1

ID: 762827 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sat 31 Dec 2022 19:35 Selected Answer: - Upvotes: 1

C. Create a narrow table in Bigtable with a row key that combines the Computer Engine computer identifier with the sample time at each second

Comment 1.2

ID: 612213 User: nadavw Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Mon 06 Jun 2022 08:32 Selected Answer: - Upvotes: 1

there is a limit of 60 columns per row according to question. in addition in D the cost will be a lower which is a requirement. so D seems more suitable.

Comment 2

ID: 64416 User: madhu1171 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Sun 15 Mar 2020 19:41 Selected Answer: - Upvotes: 19

C correct answer

Comment 3

ID: 1328923 User: shangning007 Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Thu 19 Dec 2024 10:41 Selected Answer: C Upvotes: 1

Even though I am sure the answer is C, but I am not sure why it will help avoid being charged for every query?

Comment 4

ID: 1303873 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 28 Oct 2024 09:02 Selected Answer: C Upvotes: 1

just like psu said. Additionally for a time-series data, it it usually the best practice to combine the identifier + time, not the values.

Comment 5

ID: 1165807 User: mothkuri Badges: - Relative Date: 2 years ago Absolute Date: Mon 04 Mar 2024 17:40 Selected Answer: C Upvotes: 1

Option C is correct answer. Narrow table is good for time series data.

Comment 6

ID: 1016260 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Mon 25 Sep 2023 02:51 Selected Answer: C Upvotes: 1

Scalability: Bigtable can handle large-scale data efficiently, making it suitable for storing time series data for millions of computers.

Low Latency: Bigtable provides low-latency access to data, which is crucial for real-time analytics.

Flexible Schema: The narrow table design allows you to efficiently store and query time series data without specifying all possible columns in advance, providing flexibility for future growth.

Column Families: Bigtable supports column families, allowing you to organize data logically.

Row Key Design: Combining the computer identifier with the sample time at each second in the row key allows for efficient retrieval of data for specific computers and time intervals.

Analytics: While Bigtable does not support SQL directly, it allows for efficient data retrieval and can be integrated with other tools for analytics.

Comment 7

ID: 917830 User: WillemHendr Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 08 Jun 2023 07:23 Selected Answer: D Upvotes: 4

"..and ensure that the schema design will allow for future growth of the dataset":

https://cloud.google.com/bigtable/docs/schema-design-time-series#time-buckets

"Data stored in this way is compressed more efficiently than data in tall, narrow tables."

I read the "future growth" as a sign to be effective in storage, and go for the Time-Buckets.

Comment 8

ID: 664918 User: Remi2021 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 09 Sep 2022 21:49 Selected Answer: C Upvotes: 2

time series = narrow table

Comment 9

ID: 588446 User: _8008_ Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Wed 20 Apr 2022 07:14 Selected Answer: - Upvotes: 3

What about "avoid being charged for every query executed"? Nothing on this topic in here https://cloud.google.com/bigtable/docs/schema-design-time-series can anyone comment?

Comment 10

ID: 520146 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sun 09 Jan 2022 12:15 Selected Answer: C Upvotes: 2

Narrow and tall table for a single event and good for time-series data
Short and Wide table for data over a month, multiple events

Comment 11

ID: 486450 User: JG123 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Thu 25 Nov 2021 06:09 Selected Answer: - Upvotes: 2

Correct: C

Comment 12

ID: 462828 User: squishy_fishy Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 16 Oct 2021 00:54 Selected Answer: - Upvotes: 3

Answer is C.
Bigtable is best suited to the following scenarios: time-series data (e.g. CPU and memory usage over time for multiple servers), financial data (e.g. transaction histories, stock prices, and currency exchange rates), and IoT (Internet of Things) use cases.
https://www.xplenty.com/blog/bigtable-vs-bigquery/

Comment 13

ID: 423557 User: safiyu Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Thu 12 Aug 2021 12:57 Selected Answer: - Upvotes: 5

C is the correct answer. If you consider wide table, then 60 columns for cpu usage and 60 columns for memory usage. in future, if you need to add a new kpi to the table, then the schema changes. you will have to add 60 more columns for the new feature. this is not so future proof.. so D is out of the picture.

Comment 14

ID: 418172 User: DeepakS227 Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Sun 01 Aug 2021 10:24 Selected Answer: - Upvotes: 2

BQ is optimized for large-scale, ad-hoc SQL-based analysis. i Think it should be A

Comment 15

ID: 377857 User: koupayio Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Wed 09 Jun 2021 01:44 Selected Answer: - Upvotes: 1

First C & D won't cause hotspoting as computer_identifier is first part of row key
I prefer D because "ensure that the schema design will allow for future growth of the dataset."
C is too tall and narrow I cannot see schema design grow in the future

Comment 16

ID: 364303 User: crslake Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Sun 23 May 2021 10:36 Selected Answer: - Upvotes: 1

D, Better overall, harder to implement (but that is not stated as a constraint)
https://cloud.google.com/bigtable/docs/schema-design-time-series#time-buckets

Comment 16.1

ID: 375375 User: lollo1234 Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Sat 05 Jun 2021 20:06 Selected Answer: - Upvotes: 2

How do you store both CPU and memory usage though? Two sets of 60 columns per row? I am wondering if that goes along with "the schema design will allow for future growth"...what if by future growth they mean monitoring N more metrics. That would imply N*60 columns, right?

Comment 17

ID: 325313 User: Sumanth09 Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Wed 31 Mar 2021 21:15 Selected Answer: - Upvotes: 8

Should be A
question did not talk about latency
without query cost -- BigQuery Cache
flexible schema - BigQuery (nested and repeated)

37. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 241

Sequence: 170
Discussion ID: 130184
Source URL: https://www.examtopics.com/discussions/google/view/130184-exam-professional-data-engineer-topic-1-question-241/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 2:06 p.m.

Question

You are designing the architecture of your application to store data in Cloud Storage. Your application consists of pipelines that read data from a Cloud Storage bucket that contains raw data, and write the data to a second bucket after processing. You want to design an architecture with Cloud Storage resources that are capable of being resilient if a Google Cloud regional failure occurs. You want to minimize the recovery point objective (RPO) if a failure occurs, with no impact on applications that use the stored data. What should you do?

A. Adopt multi-regional Cloud Storage buckets in your architecture.
B. Adopt two regional Cloud Storage buckets, and update your application to write the output on both buckets.
C. Adopt a dual-region Cloud Storage bucket, and enable turbo replication in your architecture.
D. Adopt two regional Cloud Storage buckets, and create a daily task to copy from one bucket to the other.

Community Answer Votes

C: 29 most voted
A: 7

Comments 15 comments Click to expand

Comment 1

ID: 1114072 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 22:08 Selected Answer: C Upvotes: 12

- Dual-region buckets are a specific type of storage that automatically replicates data between two geographically distinct regions.
- Turbo replication is an enhanced feature that provides faster replication between the two regions, thus minimizing RPO.
- This option ensures that your data is resilient to regional failures and is replicated quickly, meeting the needs for low RPO and no impact on application performance.

Comment 2

ID: 1115150 User: therealsohail Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Sat 06 Jan 2024 12:49 Selected Answer: C Upvotes: 5

Turbo replication provides faster redundancy across regions for data in your dual-region buckets, which reduces the risk of data loss exposure and helps support uninterrupted service following a regional outage.

Comment 3

ID: 1316715 User: petulda Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Sat 23 Nov 2024 16:49 Selected Answer: - Upvotes: 1

Why not A
https://cloud.google.com/storage/docs/locations
multi regional location has cross region redundancy

Comment 3.1

ID: 1316718 User: petulda Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sat 23 Nov 2024 16:58 Selected Answer: - Upvotes: 1

Sorry, it is about minimizing RPO, where Turbo replication is a factor..

Comment 4

ID: 1191255 User: CGS22 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 08 Apr 2024 01:39 Selected Answer: A Upvotes: 5

A. Adopt multi-regional Cloud Storage buckets in your architecture.

Why A is the best choice:

Automatic Cross-Region Replication: Multi-regional buckets automatically replicate data across multiple geographically separated regions within a selected multi-region location (e.g., us). This ensures data redundancy and availability even if one region experiences an outage.
Minimal RPO: Data written to a multi-regional bucket is synchronously replicated to at least two regions. This means that in the event of a regional failure, the RPO is essentially zero, as the data is already available in other regions.
No Application Changes: Applications can continue reading and writing data to the multi-regional bucket without any modifications, as the cross-region replication is handled transparently by Cloud Storage

Comment 4.1

ID: 1328562 User: mdell Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Wed 18 Dec 2024 15:44 Selected Answer: - Upvotes: 1

Minimal yet, but not minimized as stated in the question. That's why C is correct.
Turbo replication provides faster redundancy across regions for data in your dual-region buckets, which reduces the risk of data loss exposure and helps support uninterrupted service following a regional outage. When enabled, turbo replication is designed to replicate 100% of newly written objects to the two regions that constitute a dual-region within the recovery point objective of 15 minutes, regardless of object size.

Note that even for default replication, most objects finish replication within minutes.

https://cloud.google.com/storage/docs/availability-durability#turbo-replication

Comment 5

ID: 1179552 User: hanoverquay Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 21 Mar 2024 20:30 Selected Answer: C Upvotes: 2

vote c

Comment 6

ID: 1173316 User: ricardovazz Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Thu 14 Mar 2024 11:48 Selected Answer: C Upvotes: 3

https://cloud.google.com/storage/docs/availability-durability#turbo-replication

"Default replication in Cloud Storage is designed to provide redundancy across regions for 99.9% of newly written objects within a target of one hour and 100% of newly written objects within a target of 12 hours"

"When enabled, turbo replication is designed to replicate 100% of newly written objects to both regions that constitute the dual-region within the recovery point objective of 15 minutes, regardless of object size."

Thus, since they want to minimize RPO, should use turbo replication

Comment 7

ID: 1154436 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 03:35 Selected Answer: C Upvotes: 2

Option C

Comment 8

ID: 1121677 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 14:23 Selected Answer: C Upvotes: 5

Option C: https://cloud.google.com/storage/docs/dual-regions + https://cloud.google.com/storage/docs/managing-turbo-replication

Comment 9

ID: 1112771 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 14:06 Selected Answer: A Upvotes: 2

A. Adopt multi-regional Cloud Storage buckets in your architecture.

Comment 9.1

ID: 1124073 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 16 Jan 2024 11:04 Selected Answer: - Upvotes: 3

It wont be a correct answer. Correct is C. It is required "no impact on applications that use the stored data"

Comment 9.1.1

ID: 1157453 User: ashdam Badges: - Relative Date: 2 years ago Absolute Date: Fri 23 Feb 2024 21:02 Selected Answer: - Upvotes: 1

But multi-region is completely transparent for the application if one fails. it would need to fail all EU or US regions. I dont undertand why multi-region would have impact on that

Comment 9.1.2

ID: 1124075 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 16 Jan 2024 11:04 Selected Answer: - Upvotes: 4

Whereas with multi-region " it can also introduce unpredictable latency into the response time and higher network egress charges for cloud workloads when multi-region data is read from remote regions"
https://cloud.google.com/blog/products/storage-data-transfer/choose-between-regional-dual-region-and-multi-region-cloud-storage

Comment 9.1.2.1

ID: 1157466 User: ashdam Badges: - Relative Date: 2 years ago Absolute Date: Fri 23 Feb 2024 21:11 Selected Answer: - Upvotes: 2

There is no requirment on latency, just RPO which it would be 0 since multi-region.

38. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 123

Sequence: 177
Discussion ID: 17023
Source URL: https://www.examtopics.com/discussions/google/view/17023-exam-professional-data-engineer-topic-1-question-123/
Posted By: rickywck
Posted At: March 20, 2020, 4:21 a.m.

Question

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis.
Every hour, thousands of transactions are updated with a new status. The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)

A. Denormalize the data as must as possible.
B. Preserve the structure of the data as much as possible.
C. Use BigQuery UPDATE to further reduce the size of the dataset.
D. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
E. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

Community Answer Votes

AD: 14 most voted
BD: 3
AC: 1

Comments 30 comments Click to expand

Comment 1

ID: 66177 User: rickywck Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Mon 20 Sep 2021 03:21 Selected Answer: - Upvotes: 40

I think AD is the answer. E will not improve performance.

Comment 2

ID: 68820 User: [Removed] Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Tue 28 Sep 2021 10:35 Selected Answer: - Upvotes: 20

Answer: A, D
Description: Denormalization will help in performance by reducing query time, update are not good with bigquery

Comment 2.1

ID: 402320 User: awssp12345 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 09 Jan 2023 03:41 Selected Answer: - Upvotes: 3

My guess is append has better performance than update.

Comment 3

ID: 847792 User: midgoo Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Mon 23 Sep 2024 04:26 Selected Answer: BD Upvotes: 3

If we denormalize the data, the Data Science team will shout at us. Preserving it is the way to go

Comment 3.1

ID: 917292 User: vaga1 Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sat 07 Dec 2024 17:21 Selected Answer: - Upvotes: 1

Denormalization is just a best practice when using BQ.

Comment 3.2

ID: 915989 User: WillemHendr Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Fri 06 Dec 2024 10:01 Selected Answer: - Upvotes: 5

Shouting data-science teams are not part of question, this is more about what is exam correct, not what it the best for your own situation

Comment 4

ID: 738811 User: odacir Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 08 Jun 2024 09:20 Selected Answer: AD Upvotes: 7

A and D:
A- Improve performance
D- Is better for DS have all the history and not the last update...

Comment 5

ID: 710980 User: NicolasN Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Sat 04 May 2024 07:57 Selected Answer: - Upvotes: 2

The criteria for selecting a strategy are the performance and usability for the data science team. This team performs the analysis by querying stored data. So we don't care for performance related with data ingestion. According to this point of view:
A: YES - undisputedly favours query performance
B: YES - Keeping the structure unchanged promotes usability (the team won't need to update queries or ML models)
C: Questionable - Updating the status of a row instead of appending newer versions is keeping the size smaller. But does this affect significantly the analysis performance? Even if it does, creating materialized views to keep the most recent status per row eliminates it
D: NO - has nothing to do with DS team's tasks, affects ingestion performance
E: NO - demotes usability

Comment 5.1

ID: 710982 User: NicolasN Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Sat 04 May 2024 08:01 Selected Answer: - Upvotes: 1

(mistakenly voted AC instead of AB)

Comment 5.2

ID: 747130 User: jkhong Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 16 Jun 2024 11:57 Selected Answer: - Upvotes: 1

For B there is no mention that the current data structure is being used (...data science team WILL build machine learning models based on this data.) ... We're developing a new data model to be used by them in the future

Comment 6

ID: 674844 User: DerickTW Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 21 Mar 2024 09:16 Selected Answer: AC Upvotes: 1

The DML quota limit is removed since 2020, I think C is better than D now.

Comment 6.1

ID: 696335 User: devaid Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 16 Apr 2024 17:28 Selected Answer: - Upvotes: 5

Is not about the quota. You should avoid using UPDATE because it makes a big scan of the table, and is not efficient or high performant. Usually prefer appends and merges instead, and using the optimized schema approach of Big Query that denormalizes the table to avoid joins and leverages nested and repeated fields.

Comment 7

ID: 520081 User: MaxNRG Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 09 Jul 2023 10:00 Selected Answer: AD Upvotes: 4

A: Denormalization increases query speed for tables with billions of rows because BigQuery's performance degrades when doing JOINs on large tables, but with a denormalized data structure, you don't have to use JOINs, since all of the data has been combined into one table.
Denormalization also makes queries simpler because you do not have to use JOIN clauses.
https://cloud.google.com/solutions/bigquery-data-warehouse#denormalizing_data
D: BigQuery append

Comment 8

ID: 519485 User: medeis_jar Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sat 08 Jul 2023 11:59 Selected Answer: AD Upvotes: 3

requirements are -> performance and usability.

Denormalization will help in performance by reducing query time, update is not good with big query.

And append has better performance than Update.

Comment 9

ID: 477936 User: doninakula Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sun 14 May 2023 04:26 Selected Answer: - Upvotes: 1

I think AD. E is not valid because it use external table which is not good for performance

Comment 10

ID: 397474 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 03 Jan 2023 12:56 Selected Answer: - Upvotes: 5

A - correct (denormlization will help)
B - data already heavily structured (no use and no impact)
C - more than 1500 Updates not possible
D - Not sure..(because appending will increase size and cost)
E - Does not look good (increase cost..also we are storing for all days....again for query we need to issue mutiple query for all days....)

So, A & D (left out of 5)

Comment 11

ID: 388338 User: Jeysolomon Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 23 Dec 2022 01:44 Selected Answer: - Upvotes: 3

Correct Answer: AE
A – Denormalisation helps improve performance.
B, C - Not helping to address the problem.
D – Append will increase the db size and cost involved for storage and also for large number of records to scan for queries by data science team which is costlier.
E - Addresses the problem of maximising the usability of the data science team and the data. They can anayse the data exported to cloud storage instead of reading from bigquery which is expensive and impact performance considerably.

Comment 11.1

ID: 463343 User: Chelseajcole Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Mon 17 Apr 2023 03:38 Selected Answer: - Upvotes: 1

It didn't mention cost is a concern

Comment 11.2

ID: 454223 User: retep007 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 29 Mar 2023 16:34 Selected Answer: - Upvotes: 2

E is wrong, you've been asked to use bigquery and reading files from storage in bq is significantly more time consuming

Comment 12

ID: 308968 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 12 Sep 2022 15:29 Selected Answer: - Upvotes: 4

A, D:
Using BigQuery as an OLTP store is considered an anti-pattern. Because OLTP stores have a high volume of updates and deletes, they are a mismatch for the data warehouse use case. To decide which storage option best fits your use case, review the Cloud storage products table.
BigQuery is built for scale and can scale out as the size of the warehouse grows, so there is no need to delete older data. By keeping the entire history, you can deliver more insight on your business. If the storage cost is a concern, you can take advantage of BigQuery's long term storage pricing by archiving older data and using it for special analysis when the need arises. If you still have good reasons for dropping older data, you can use BigQuery's native support for date-partitioned tables and partition expiration. In other words, BigQuery can automatically delete older data.
https://cloud.google.com/solutions/bigquery-data-warehouse#handling_change

Comment 13

ID: 304956 User: Hithesh Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Wed 07 Sep 2022 06:00 Selected Answer: - Upvotes: 2

should be AC.. "Every hour, thousands of transactions are updated with a new status" if we append how we will handle the new status change..

Comment 13.1

ID: 397472 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 03 Jan 2023 12:51 Selected Answer: - Upvotes: 1

C not possible, maximum 1500 updates possible in a day

Comment 13.1.1

ID: 413265 User: raf2121 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 18:34 Selected Answer: - Upvotes: 1

DML without limits now in BQ (below blog says March 2020, Not sure whether these questions were prepared before or after March 2020)

https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery

Comment 13.1.1.1

ID: 415445 User: hdmi_switch Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 27 Jan 2023 15:11 Selected Answer: - Upvotes: 4

There is no more hard limit, but UPDATES are queued:
"BigQuery runs up to 2 of them concurrently, after which up to 20 are queued as PENDING. When a previously running job finishes, the next pending job is dequeued and run. Currently, queued mutating DML statements share a per-table queue with maximum length 20. Additional statements past the maximum queue length for each table fail."

With thousands of updates per hour, this doesn't seem feasible. I would assume the question is marked as outdated anyway or the answers are update in the actual exam.

Comment 14

ID: 293598 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 18 Aug 2022 17:50 Selected Answer: - Upvotes: 1

AC:
the problem is exactly about Updating and preserving size of database as much as possible, then denormalization and using UPDATE function from DML will address the issue. they don't want to update faster. then A & C is correct.
https://cloud.google.com/solutions/bigquery-data-warehouse

Comment 14.1

ID: 308972 User: daghayeghi Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 12 Sep 2022 15:37 Selected Answer: - Upvotes: 3

A, D:
it was my mistake, we should decrease update as Bigquery is not design for update.

https://cloud.google.com/solutions/bigquery-data-warehouse#handling_change

Comment 14.2

ID: 294843 User: karthik89 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 20 Aug 2022 08:15 Selected Answer: - Upvotes: 3

you can update bigquery 1500 times in a day

Comment 15

ID: 227444 User: Nams_139 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 25 May 2022 11:08 Selected Answer: - Upvotes: 5

A,D Since the requirements are both performance and usability.

Comment 16

ID: 223024 User: federicohi Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Thu 19 May 2022 19:03 Selected Answer: - Upvotes: 3

i tink may be ita AC becuase appending its worst to increase dataset size. THe question seems to put like a problem the size of dataset and performance to datascience so inserting more rows decrease performace for them.

Comment 17

ID: 216923 User: Ram459 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Tue 10 May 2022 20:09 Selected Answer: - Upvotes: 4

AD looks like a good fit

39. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 132

Sequence: 178
Discussion ID: 17232
Source URL: https://www.examtopics.com/discussions/google/view/17232-exam-professional-data-engineer-topic-1-question-132/
Posted By: -
Posted At: March 22, 2020, 10:15 a.m.

Question

Your United States-based company has created an application for assessing and responding to user actions. The primary table's data volume grows by 250,000 records per second. Many third parties use your application's APIs to build the functionality into their own frontend applications. Your application's APIs should comply with the following requirements:
✑ Single global endpoint
✑ ANSI SQL support
✑ Consistent access to the most up-to-date data
What should you do?

A. Implement BigQuery with no region selected for storage or processing.
B. Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.
C. Implement Cloud SQL for PostgreSQL with the master in North America and read replicas in Asia and Europe.
D. Implement Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.

Community Answer Votes

B: 15 most voted
A: 2

Comments 17 comments Click to expand

Comment 1

ID: 397806 User: sumanshu Badges: Highly Voted Relative Date: 3 years, 2 months ago Absolute Date: Tue 03 Jan 2023 22:17 Selected Answer: - Upvotes: 24

A - BigQuery with NO Region ? (Looks wrong)
B - Spanner (SQL support and Scalable and have replicas ) - Looks correct
C - SQL (can't store so many records) (wrong)
D - Bigtable - NO SQL (wrong)

Vote for B

Comment 2

ID: 180707 User: Tanmoyk Badges: Highly Voted Relative Date: 3 years, 12 months ago Absolute Date: Thu 17 Mar 2022 08:29 Selected Answer: - Upvotes: 8

B is correct, Bigquery cannot support 250K data ingestion/second , as ANSI SQL support is required , no other options left except Spanner.

Comment 3

ID: 918036 User: vaga1 Badges: Most Recent Relative Date: 1 year, 3 months ago Absolute Date: Sun 08 Dec 2024 11:49 Selected Answer: B Upvotes: 3

A - NO - BigQuery with must have a selected regional or multi-regional file storage
B - YES - Spanner is specifically designed for this high and consistent throughput
C - NO - I am not sure about what many said in this discussion as Cloud SQL can store this amount of records if u have just a few columns. Anyway, for sure Spanner is better and it is a GCP product.
D - Bigtable - it's a NoSQL solution, no ANSI

Comment 4

ID: 762707 User: AzureDP900 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 16:44 Selected Answer: - Upvotes: 1

B is the answer

Comment 5

ID: 668063 User: Remi2021 Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Wed 13 Mar 2024 16:28 Selected Answer: B Upvotes: 4

Guys, read documentation well. A is wrong, BigQuery has Maximum rows per request (50,000).
https://cloud.google.com/bigquery/quotas

It is B

Comment 6

ID: 633696 User: JamesKarianis Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 19 Jan 2024 20:56 Selected Answer: B Upvotes: 2

Spanner is globally available and meets all the requirements

Comment 7

ID: 582122 User: devric Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 07 Oct 2023 04:33 Selected Answer: A Upvotes: 2

Correct is A. There's no sense to having read replicas outside of US considering than the company is US based.

If you generate a dataset without specifing the Data Location it's gonna be stored in "US Multiregion" by default

Comment 8

ID: 520318 User: MaxNRG Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 09 Jul 2023 16:14 Selected Answer: B Upvotes: 6

B: Cloud Spanner is the first scalable, enterprise-grade, globally-distributed, and strongly consistent database service built for the cloud specifically to combine the benefits of relational database structure with non-relational horizontal scale.
https://cloud.google.com/spanner/
Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
https://cloud.google.com/spanner/docs/
https://cloud.google.com/spanner/docs/instances#available-configurations-multi-region

Comment 9

ID: 163167 User: haroldbenites Badges: - Relative Date: 4 years ago Absolute Date: Mon 21 Feb 2022 23:33 Selected Answer: - Upvotes: 2

B is correct

Comment 10

ID: 148333 User: Archy Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Tue 01 Feb 2022 05:59 Selected Answer: - Upvotes: 6

B, as Cloud Spanner has three types of replicas: read-write replicas, read-only replicas, and witness replicas.

Comment 11

ID: 141014 User: VishalB Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sat 22 Jan 2022 12:10 Selected Answer: - Upvotes: 2

Correct Answer : C
Explanation:-
B -> This option is incorrect, as we do not have option to configure read-replica in Cloud Spanner, Multi-region instance configurations use a combination of all three types’ read-write replicas, read-only replicas, and witness replicas
C -> This is correct option, In Cloud Sql we have option to create a master node for read-write replicas and read-only replicas in other regions
D -> This option is incorrect, as Bigtable do not support ANSI SQL

Comment 11.1

ID: 160949 User: saurabh1805 Badges: - Relative Date: 4 years ago Absolute Date: Fri 18 Feb 2022 17:28 Selected Answer: - Upvotes: 1

but bigquery does so why not A?

Comment 11.1.1

ID: 283246 User: WizzzardLlama Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Thu 04 Aug 2022 07:20 Selected Answer: - Upvotes: 3

You can't create a BigQuery instance without region selected.
I'm wondering about these read replicas, why read only replicas? It seems arbitrary, as the question does not state that API should be read-only, so there's no reason why those should be read-only replicas...

Comment 11.2

ID: 175415 User: mAbreu Badges: - Relative Date: 4 years ago Absolute Date: Tue 08 Mar 2022 00:24 Selected Answer: - Upvotes: 3

wrong, cloud spanner can have read-only replicas
https://cloud.google.com/spanner/docs/replication?hl=pt-br

Comment 12

ID: 124334 User: norwayping Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 01 Jan 2022 18:36 Selected Answer: - Upvotes: 2

I was wrong Bigtable doesn not support ansi SQL. B instead

Comment 13

ID: 121998 User: norwayping Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 28 Dec 2021 18:50 Selected Answer: - Upvotes: 2

I think it is D. THere is limitation of QPS of Cloud Spanner of 2000 qps,
https://cloud.google.com/spanner/docs/instances#multi-region-performance

Comment 14

ID: 70302 User: Rajokkiyam Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sat 02 Oct 2021 03:14 Selected Answer: - Upvotes: 4

Answer B

40. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 59

Sequence: 181
Discussion ID: 80950
Source URL: https://www.examtopics.com/discussions/google/view/80950-exam-professional-data-engineer-topic-1-question-59/
Posted By: Remi2021
Posted At: Sept. 7, 2022, 5:35 p.m.

Question

An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose. Which Google Cloud database should they choose?

A. BigQuery
B. Cloud SQL
C. Cloud BigTable
D. Cloud Datastore

Community Answer Votes

B: 30 most voted
A: 14
C: 1

Comments 25 comments Click to expand

Comment 1

ID: 788849 User: PolyMoe Badges: Highly Voted Relative Date: 3 years, 1 month ago Absolute Date: Thu 26 Jan 2023 16:20 Selected Answer: B Upvotes: 11

B. Cloud SQL would be the most appropriate choice for the online retailer in this scenario. Cloud SQL is a fully-managed relational database service that allows for easy management and analysis of data using SQL. It is well-suited for applications built on Google App Engine and can handle the transactional workload of an e-commerce application, as well as the analytical workload of a BI tool.

Comment 2

ID: 832040 User: Aaronn14 Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 16:06 Selected Answer: - Upvotes: 5

A. "They want to use only a single database for this purpose" is a key requirement. You can use BigQuery for transactions, though it is not efficient. You can not use CloudSQL for analytics. So it is probably BQ.

Comment 2.1

ID: 1287781 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sun 22 Sep 2024 16:34 Selected Answer: - Upvotes: 1

Yeah my thinking was the same, but actually cloud SQL is fine to connect BI tools to, which is specified in this question.

Comment 3

ID: 959535 User: Mathew106 Badges: Most Recent Relative Date: 2 years, 7 months ago Absolute Date: Sat 22 Jul 2023 15:11 Selected Answer: B Upvotes: 3

Cloud SQL seems to fit the best. It supports transactions and can be used to run queries and do analytics.

BigQuery is good for the analysis part but it's not good for managing transactions. If the question needed a database just to store the data for analysis it would be ok. But if we want to update single transactions or add them row by row, then it's not good. BigQuery is not made to support an application. It's a DW.

BigTable is can not carry transactions over multiple rows and is better for large scale analytics jobs. Also we should pick it for use-cases with high throughput/low latency requirements. Seems redundant.

Comment 4

ID: 872022 User: Siddhesh05 Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 16 Apr 2023 20:14 Selected Answer: A Upvotes: 5

Big Query because of analysis

Comment 5

ID: 867685 User: izekc Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 11 Apr 2023 23:29 Selected Answer: C Upvotes: 1

Should be bigtable

Comment 5.1

ID: 1287780 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sun 22 Sep 2024 16:33 Selected Answer: - Upvotes: 2

I can't really see that. Bigtable is only ever the right choice for noSql at vast scale.

Comment 6

ID: 843123 User: juliobs Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 18 Mar 2023 20:56 Selected Answer: A Upvotes: 4

I think BigQuery makes sense here. It works for transactions too.

Comment 6.1

ID: 856800 User: juliobs Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 31 Mar 2023 10:35 Selected Answer: - Upvotes: 2

I just did a session with an official trainer from Google that said BigTable is better.

Comment 6.1.1

ID: 1052934 User: Fotofilico Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 24 Oct 2023 16:33 Selected Answer: - Upvotes: 2

I'm an official trainer from Google and I can say that my best two options for this scenario would be Cloud SQL and BigQuery in that order.
Also we can consider datastore since we're using it with a web app, but it's another topic.

Comment 6.1.1.1

ID: 1322351 User: certs4pk Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Thu 05 Dec 2024 13:21 Selected Answer: - Upvotes: 1

but, how to analyze 'combined data from multiple datasets' in cloud sql?

Comment 7

ID: 827752 User: ninjatech Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 09:07 Selected Answer: - Upvotes: 2

Transactional Data need to written first by application before it could be analysed so cloudsql.

Comment 8

ID: 784891 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 03:51 Selected Answer: - Upvotes: 2

Both BigQuery and Cloud Bigtable are valid options for this use case, but BigQuery is better suited for this specific scenario where the retailer needs to manage and analyze large amounts of data from multiple datasets using a BI tool.

BigQuery is a fully-managed, cloud-native data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure. It can handle large, complex datasets and is well-suited for both transactional and analytical workloads. It can also handle data from multiple datasets and can be integrated with other Google Cloud services, such as Dataflow, Dataproc and Looker for BI analysis.

While Cloud Bigtable is also a good option for this use case as it is a highly scalable and performant NoSQL database that is well-suited for handling large amounts of data and high-write loads. It is not as good as BigQuery for analytical workloads and it may not be as well-suited for this specific scenario where the retailer needs to manage and analyze large amounts of data from multiple datasets using a BI tool.

Comment 8.1

ID: 824486 User: jin0 Badges: - Relative Date: 3 years ago Absolute Date: Tue 28 Feb 2023 08:03 Selected Answer: - Upvotes: 1

Bigquery is a OLAP. So it could be not a answer I think.

Comment 8.2

ID: 784892 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 03:52 Selected Answer: - Upvotes: 2

Cloud SQL and Cloud Datastore are also good options for certain use cases, but they may not be as well-suited for this specific scenario where the retailer needs to manage and analyze large amounts of data from multiple datasets using a BI tool.

Comment 9

ID: 774795 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 13 Jan 2023 20:21 Selected Answer: - Upvotes: 3

The Community is choosing Answer B - Cloud SQL, as per the question.
However when they explain - they're speaking about BQ[????]

So is it BigQuery or Cloud SQL?

Comment 10

ID: 747163 User: DipT Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 13:20 Selected Answer: B Upvotes: 1

https://cloud.google.com/bigquery/docs/partitioned-tables

Comment 11

ID: 747158 User: DipT Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 13:17 Selected Answer: B Upvotes: 4

It needs support for transaction so cloud sql is the choice of database and with Bigquery we can still analyze cloud sql data via federated queries https://cloud.google.com/bigquery/docs/reference/legacy-sql

Comment 12

ID: 745469 User: DGames Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 14 Dec 2022 22:36 Selected Answer: B Upvotes: 1

Most important part of question is transaction (RDBMS) strong ACID property database. Second part analysis of data, yes possible using any BI tool its possible with RDBMS db.

Comment 13

ID: 737618 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 09:55 Selected Answer: B Upvotes: 2

C and D are not able to work with BI directly, so discard.
A: It's the best option for BI for awful for transactions
B: it's the best option for transaction, and works for BI, so this must be the answer

Comment 14

ID: 727366 User: Leeeeee Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 26 Nov 2022 08:30 Selected Answer: B Upvotes: 1

BigQuery for Analytics and BI

Comment 15

ID: 711226 User: Leelas Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Fri 04 Nov 2022 15:53 Selected Answer: B Upvotes: 2

Cloud Sql is Used to store Transactional Data and supports Sql Transactions. Where as Big Query is used for Analytics.

Comment 16

ID: 709451 User: Zion0722 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Tue 01 Nov 2022 23:29 Selected Answer: - Upvotes: 2

Cloud SQL supports transactions as well as analysis through a BI tool. Firestore/Datastore does not support SQL syntax typically needed to do analysis done by a BI tool. BigQuery is not suitable for transactional use case. BigTable does not support SQL.
It's A.

Comment 17

ID: 708053 User: MisuLava Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 30 Oct 2022 22:34 Selected Answer: B Upvotes: 1

it is obvious.

Comment 17.1

ID: 708054 User: MisuLava Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 30 Oct 2022 22:34 Selected Answer: - Upvotes: 1

I meant to choose A, BugQuery :)

41. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 125

Sequence: 196
Discussion ID: 17243
Source URL: https://www.examtopics.com/discussions/google/view/17243-exam-professional-data-engineer-topic-1-question-125/
Posted By: -
Posted At: March 22, 2020, 12:07 p.m.

Question

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?

A. Store and process the entire dataset in BigQuery.
B. Store and process the entire dataset in Bigtable.
C. Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.
D. Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.

Community Answer Votes

C: 12 most voted
B: 2
D: 2

Comments 22 comments Click to expand

Comment 1

ID: 69733 User: Rajokkiyam Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Wed 31 Mar 2021 07:24 Selected Answer: - Upvotes: 34

Answer C.

Comment 2

ID: 116043 User: AJKumar Badges: Highly Voted Relative Date: 4 years, 8 months ago Absolute Date: Tue 22 Jun 2021 07:27 Selected Answer: - Upvotes: 25

A and B can be eliminated right away as they do not talk about providing for other cloud providers. between C and D. The question says nothing about warm or cold data-rather that data should be made available for other providers--C--can fulfill this condition. Answer C.

Comment 2.1

ID: 762452 User: AzureDP900 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 31 Dec 2023 04:03 Selected Answer: - Upvotes: 1

Agree with C

Comment 3

ID: 1009942 User: zbyszek1 Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Tue 17 Sep 2024 18:22 Selected Answer: - Upvotes: 1

For me A. I can use export from BQ to Cloud Storage. There is no need to store two copies of data.

Comment 3.1

ID: 1065309 User: spicebits Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Fri 08 Nov 2024 05:05 Selected Answer: - Upvotes: 4

If you export data from BQ to GCS then you will have two copies and you will be in the same architecture as answer C.

Comment 4

ID: 964391 User: vamgcp Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 27 Jul 2024 07:48 Selected Answer: B Upvotes: 2

It can be C or D , but I will go with C as storing the full dataset in BigQuery and a compressed copy of the data in Cloud Storage is a good way to balance performance and cost.

Comment 5

ID: 911770 User: forepick Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 01 Jun 2024 07:51 Selected Answer: C Upvotes: 2

Best answer is C, although BQ can query gzipped files stored on GCS directly.
Maybe this double storage makes it a bit more highly available.

Comment 6

ID: 891038 User: izekc Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 06 May 2024 23:52 Selected Answer: D Upvotes: 1

D is much more accurate.

Comment 7

ID: 747855 User: jkhong Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 17 Dec 2023 08:57 Selected Answer: C Upvotes: 1

D → does not guarantee 100% queryable or accessible/available

Comment 8

ID: 630328 User: Smaks Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Wed 12 Jul 2023 08:25 Selected Answer: - Upvotes: 1

You can read streaming data from Pub/Sub, and you can write streaming data to Pub/Sub or BigQuery.
Thus Cloud Storage is not a proper sink for streaming pipeline.
I vote for B, since it is possible to convert unstructured data and store in BQ

Comment 8.1

ID: 630331 User: Smaks Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Wed 12 Jul 2023 08:30 Selected Answer: - Upvotes: 10

ignore this comment, please

Comment 9

ID: 558726 User: Aslkdup Badges: - Relative Date: 3 years ago Absolute Date: Wed 01 Mar 2023 13:26 Selected Answer: - Upvotes: 1

BQ can reach files at google storage as external table. so my answer is D. (If data was smaller than this, I would choose C)

Comment 10

ID: 525945 User: Bhawantha Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 17 Jan 2023 17:38 Selected Answer: C Upvotes: 2

both requirements are full filled.

Comment 11

ID: 520087 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 09 Jan 2023 11:08 Selected Answer: D Upvotes: 1

D: BigQuery + Cloud Storage

Comment 11.1

ID: 747854 User: jkhong Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 17 Dec 2023 08:57 Selected Answer: - Upvotes: 2

D → does not guarantee 100% queryable or accessible/available

Comment 12

ID: 519497 User: medeis_jar Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 08 Jan 2023 13:08 Selected Answer: C Upvotes: 7

"You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers?"
Analytics -> BQ
Exposing -> GCS

Comment 13

ID: 487109 User: JG123 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 26 Nov 2022 07:15 Selected Answer: - Upvotes: 2

Correct: C

Comment 14

ID: 420277 User: xiaofeng_0226 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Fri 05 Aug 2022 14:34 Selected Answer: - Upvotes: 3

vote for C

Comment 15

ID: 397488 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sun 03 Jul 2022 12:19 Selected Answer: - Upvotes: 8

Vote for 'C'

A - Only Half requirement fulfil, expose as a file not getting fulfiled
B - Not a warehouse
C. Both requirements fulfiled...Bigquery and GCS
D. Both requirement fulfiled...but what if other cloud provider wants to analysis on rest 80% of the data. -

So out of 4 options, C looks okay

Comment 16

ID: 301561 User: gcper Badges: - Relative Date: 4 years ago Absolute Date: Tue 01 Mar 2022 20:19 Selected Answer: - Upvotes: 3

C

BigQuery for analytics processing and Cloud Storage for exposing the data as files

Comment 17

ID: 293710 User: daghayeghi Badges: - Relative Date: 4 years ago Absolute Date: Fri 18 Feb 2022 21:54 Selected Answer: - Upvotes: 4

answer A:
with BigQuery Omni it is now possible to read data from other cloud providers without transferring the data to GCP and thereby by saving on egress charges.
https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-omni

Comment 17.1

ID: 345893 User: salsabilsf Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sat 30 Apr 2022 07:05 Selected Answer: - Upvotes: 2

the question says :
"expose the dataset as files" wich means Cloud Storage

42. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 80

Sequence: 197
Discussion ID: 17119
Source URL: https://www.examtopics.com/discussions/google/view/17119-exam-professional-data-engineer-topic-1-question-80/
Posted By: -
Posted At: March 21, 2020, 6:19 p.m.

Question

MJTelco Case Study -

Company Overview -
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.

Company Background -
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.

Solution Concept -
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments `" development/test, staging, and production `" to meet the needs of running experiments, deploying new features, and serving production customers.

Business Requirements -
✑ Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed research workers
✑ Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.

Technical Requirements -
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.

CEO Statement -
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.

CTO Statement -
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.

CFO Statement -
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
MJTelco is building a custom interface to share data. They have these requirements:
1. They need to do aggregations over their petabyte-scale datasets.
2. They need to scan specific time range rows with a very fast response time (milliseconds).
Which combination of Google Cloud Platform products should you recommend?

A. Cloud Datastore and Cloud Bigtable
B. Cloud Bigtable and Cloud SQL
C. BigQuery and Cloud Bigtable
D. BigQuery and Cloud Storage

Community Answer Votes

C: 2 most voted

Comments 11 comments Click to expand

Comment 1

ID: 147824 User: atnafu2020 Badges: Highly Voted Relative Date: 5 years, 7 months ago Absolute Date: Fri 31 Jul 2020 07:17 Selected Answer: - Upvotes: 9

C
Bigquery and Big table =PB storage capacity
Bigtable=to read scan rows Big query select row to read

Comment 2

ID: 308105 User: daghayeghi Badges: Highly Voted Relative Date: 5 years ago Absolute Date: Thu 11 Mar 2021 18:05 Selected Answer: - Upvotes: 6

C:
They need to do aggregations over their petabyte-scale datasets: Bigquery
They need to scan specific time range rows with a very fast response time (milliseconds): Bigtable

Comment 3

ID: 1307549 User: 09878d5 Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Tue 05 Nov 2024 21:13 Selected Answer: - Upvotes: 1

Why not A? Can someone please explain

Comment 4

ID: 926813 User: baht Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 18 Jun 2023 19:03 Selected Answer: C Upvotes: 1

Response C => Bigquery and bigtable

Comment 5

ID: 885484 User: ga8our Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sun 30 Apr 2023 19:45 Selected Answer: - Upvotes: 5

Why not A? If we're already using Bigtable, what's the use of another, slower analytic solution, like BigQuery? Wouldn't Datastore be more useful to store our data than BigQuery?

Comment 6

ID: 783769 User: dconesoko Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 21 Jan 2023 22:55 Selected Answer: C Upvotes: 1

bigquery and bigtable

Comment 7

ID: 421772 User: sandipk91 Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Sun 08 Aug 2021 20:01 Selected Answer: - Upvotes: 1

C is correct, no doubt

Comment 8

ID: 394281 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Wed 30 Jun 2021 02:12 Selected Answer: - Upvotes: 2

Vote for C

Comment 9

ID: 209563 User: Insane7 Badges: - Relative Date: 5 years, 4 months ago Absolute Date: Fri 30 Oct 2020 21:26 Selected Answer: - Upvotes: 1

Why not D? Biqquery and GCS.
Also Big Table is no sql, where as BQ is SQL

Comment 9.1

ID: 249451 User: Gcpyspark Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Mon 21 Dec 2020 16:39 Selected Answer: - Upvotes: 5

With GCS you can only scan the rows from BigQuery using External federated Datasources, with that millisecond latency is not possible. Also "scan specific time range rows with a very fast response time" is a natural fit use case for Cloud Bigtable.

Comment 10

ID: 161865 User: haroldbenites Badges: - Relative Date: 5 years, 6 months ago Absolute Date: Thu 20 Aug 2020 01:49 Selected Answer: - Upvotes: 4

C is correct.

43. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 72

Sequence: 199
Discussion ID: 16832
Source URL: https://www.examtopics.com/discussions/google/view/16832-exam-professional-data-engineer-topic-1-question-72/
Posted By: rickywck
Posted At: March 17, 2020, 8:06 a.m.

Question

You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?

A. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
B. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
C. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
D. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.

Community Answer Votes

C: 12 most voted

Comments 24 comments Click to expand

Comment 1

ID: 307908 User: daghayeghi Badges: Highly Voted Relative Date: 4 years ago Absolute Date: Fri 11 Mar 2022 15:02 Selected Answer: - Upvotes: 54

answer C:
BigQuery can access data in external sources, known as federated sources. Instead of first
loading data into BigQuery, you can create a reference to an external source. External
sources can be Cloud Bigtable, Cloud Storage, and Google Drive.
When accessing external data, you can create either permanent or temporary external
tables. Permanent tables are those that are created in a dataset and linked to an external
source. Dataset-level access controls can be applied to these tables. When you are using a
temporary table, a table is created in a special dataset and will be available for approxi-
mately 24 hours. Temporary tables are useful for one-time operations, such as loading data
into a data warehouse.
"Dan Sullivan" Book

Comment 2

ID: 65061 User: rickywck Badges: Highly Voted Relative Date: 4 years, 12 months ago Absolute Date: Wed 17 Mar 2021 08:06 Selected Answer: - Upvotes: 11

Why not C?

Comment 3

ID: 1060032 User: emmylou Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Fri 01 Nov 2024 20:57 Selected Answer: - Upvotes: 1

On so many of these questions, how do you actually know if you're correct. I said C but the correct answer was A. Honestly, it's driving me crazy.

Comment 4

ID: 960222 User: Mathew106 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 23 Jul 2024 10:11 Selected Answer: C Upvotes: 1

For the ones saying BigTable is cheaper, BigTable in eu-north1 costs $0.748/hour per node. So if you were to run the node 24/7 you would pay more than 500$ per month. Querying 1TB of data in BigQuery is 7.5$. With smart querying and good database design you can minimize the bytes processed by BQ. So even though BigTable does not directly charge for querying, it charges for running the cluster and the overall price does not make sense. And as far as I know, it's not possible to spin up and shut down BigTable automatically.

Also, since the table is an external table to BigQuery, we incur no cost for storing that data in BigQuery and paying 300$ per month for storage.

Comment 5

ID: 785832 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 22:05 Selected Answer: - Upvotes: 2

C. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.

Cloud Storage is a highly durable and cost-effective object storage service that can be used to store large amounts of text files. By storing the input data in CSV format in Cloud Storage, you can minimize costs while still being able to query the data using BigQuery.

BigQuery is a fully-managed, highly-scalable data warehouse that allows you to perform fast SQL-like queries on large datasets. By linking the Cloud Storage data as permanent tables in BigQuery, you can enable multiple users to query the data using multiple engines without the need for additional compute resources. This approach would be the most cost-effective for querying aggregate values for multiple users, as BigQuery charges based on the amount of data scanned per query, so the more data you store in BigQuery the less you pay per query.

Comment 5.1

ID: 785833 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 22:05 Selected Answer: - Upvotes: 1

Option D, using Cloud Storage for storage and linking as temporary tables in BigQuery for query, would not be the best choice because temporary tables only exist for the duration of a user session or query and you would need to create and delete them each time a user queries the data, which would add additional cost and complexity to the process.

Option A, Using Cloud Bigtable for storage, and installing the HBase shell on a Compute Engine instance to query the data, is not a cost-effective solution as Cloud Bigtable is a managed NoSQL database service which is more expensive than storing in Cloud Storage and querying in BigQuery.

Option B, Using Cloud Bigtable for storage, and linking as permanent tables in BigQuery for query, is not a cost-effective solution as Cloud Bigtable is a managed NoSQL database service which is more expensive than storing in Cloud Storage and querying in BigQuery.

Comment 6

ID: 772990 User: RoshanAshraf Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 12 Jan 2024 01:27 Selected Answer: C Upvotes: 3

CSV files - Cloud Storage
BigQuery - Aggregate, multiple users
Permanent table - multiple users
External Tables is Easy to implement, cost effective

Comment 7

ID: 742146 User: rivua Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 11 Dec 2023 23:12 Selected Answer: - Upvotes: 7

The 'correct' answers on this platform are ridiculous

Comment 8

ID: 545843 User: VishalBule Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Feb 2023 13:51 Selected Answer: - Upvotes: 1

Answer is C Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.

BigQuery can access data in external sources, known as federated sources. Instead of first loading data into BigQuery, you can create a reference to an external source. External sources can be Cloud Bigtable, Cloud Storage, and Google Drive.

When accessing external data, you can create either permanent or temporary external tables. Permanent tables are those that are created in a dataset and linked to an external source. Dataset-level access controls can be applied to these tables. When you are using a temporary table, a table is created in a special dataset and will be available for approxi- mately 24 hours. Temporary tables are useful for one-time operations, such as loading data into a data warehouse

Comment 9

ID: 517628 User: medeis_jar Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 05 Jan 2023 16:31 Selected Answer: C Upvotes: 3

Bigtable is expensive. So Cloud Storage for storing and BigQuery with permanent table for linking and querying.

Comment 10

ID: 507267 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 22 Dec 2022 18:31 Selected Answer: C Upvotes: 5

Not A or B
Big table is expensive, que initial data is in csv format, besides, if others are going to query data with multiple engines… GCS is the storage. Between c and D is all about permanent or temorary.
Permanent table is a table that is created in a dataset and is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.
When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
I think is C.

Comment 10.1

ID: 507268 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 22 Dec 2022 18:31 Selected Answer: - Upvotes: 3

https://cloud.google.com/blog/products/gcp/accessing-external-federated-data-sources-with-bigquerys-data-access-layer.
Permanent table—You create a table in a BigQuery dataset that is linked to your external data source. This allows you to use BigQuery dataset-level IAM roles to share the table with others who may have access to the underlying external data source. Use permanent tables when you need to share the table with others.
Temporary table—You submit a command that includes a query and creates a non-permanent table linked to the external data source. With this approach you do not create a table in one of your BigQuery datasets, so make sure to give consideration towards sharing the query or table. Consider using a temporary table for one-time, ad-hoc queries, or for one time extract, transform, or load (ETL) workflows

Comment 11

ID: 492118 User: maurodipa Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Fri 02 Dec 2022 04:38 Selected Answer: - Upvotes: 2

Answer is A. While C seems the most reasonable answer there are 2 points to notics: a) load jobs are limited to 15 TB across all input files in BigQuery (https://cloud.google.com/bigquery/quotas); b) It is requested to minimize the cost of querying and queries in BigTable are free, while queries in BigQuery are charged per byte (https://cloud.google.com/bigquery/pricing)

Comment 12

ID: 485776 User: Abhi16820 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 24 Nov 2022 09:40 Selected Answer: - Upvotes: 1

https://cloud.google.com/bigquery/external-data-bigtable#:~:text=shared%20with%20others.-,Querying%20an%20external%20data%20source%20using%20a%20temporary%20table%20is%20useful%20for%20one%2Dtime%2C%20ad%2Dhoc%20queries%20over%20external%20data%2C%20or%20for%20extract%2C%20transform%2C%20and%20load%20(ETL)%20processes.,-Querying%20Cloud%20Bigtable

Comment 13

ID: 464880 User: tsoetan001 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 20 Oct 2022 03:46 Selected Answer: - Upvotes: 1

C is the answer.

Comment 14

ID: 453345 User: Ysance_AGS Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 28 Sep 2022 13:53 Selected Answer: - Upvotes: 3

A is correct since the question asks "You want to minimize the cost of querying aggregate values" => Big Table is free when querying data.

Comment 15

ID: 445480 User: nguyenmoon Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 15 Sep 2022 23:49 Selected Answer: - Upvotes: 1

Vote for C

Comment 16

ID: 399601 User: gcp_learner Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 06 Jul 2022 04:04 Selected Answer: - Upvotes: 4

Interesting options. For me, A & B ruled out because BigTable doesn’t fit this use case, leaves us with C & D. C will incur additional cost of storing data in GCS & BigQuery because it mentions linking.

So I would go with D ie store the data in GCS and create external tables in BigQuery.

Comment 16.1

ID: 419295 User: Yiouk Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Wed 03 Aug 2022 18:11 Selected Answer: - Upvotes: 1

https://cloud.google.com/bigquery/docs/writing-results#temporary_and_permanent_tables

Comment 16.2

ID: 432545 User: triipinbee Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 26 Aug 2022 21:38 Selected Answer: - Upvotes: 1

Storage cost of data for BQ is the same as standard cloud storage, actually less for long term storage as it automatically moves to nearline storage.

https://cloud.google.com/bigquery/pricing#storage
https://cloud.google.com/storage#section-10

Comment 17

ID: 393744 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 29 Jun 2022 13:57 Selected Answer: - Upvotes: 2

Vote for C

Comment 17.1

ID: 402242 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 23:30 Selected Answer: - Upvotes: 1

Vote for 'D'

BigQuery uses temporary tables to cache query results.
https://cloud.google.com/bigquery/docs/writing-results

Comment 17.1.1

ID: 402246 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 23:32 Selected Answer: - Upvotes: 1

There are no storage costs for temporary tables, but if you write query results to a permanent table, you are charged for storing the data.

Comment 17.1.1.1

ID: 432544 User: triipinbee Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Fri 26 Aug 2022 21:35 Selected Answer: - Upvotes: 1

even if it's a temp table, the storage cost for active storage in cloud bucket and the standard storage in BQ is the same. So no point creating temp tables.

44. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 236

Sequence: 205
Discussion ID: 130179
Source URL: https://www.examtopics.com/discussions/google/view/130179-exam-professional-data-engineer-topic-1-question-236/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 1:41 p.m.

Question

You are deploying a MySQL database workload onto Cloud SQL. The database must be able to scale up to support several readers from various geographic regions. The database must be highly available and meet low RTO and RPO requirements, even in the event of a regional outage. You need to ensure that interruptions to the readers are minimal during a database failover. What should you do?

A. Create a highly available Cloud SQL instance in region Create a highly available read replica in region B. Scale up read workloads by creating cascading read replicas in multiple regions. Backup the Cloud SQL instances to a multi-regional Cloud Storage bucket. Restore the Cloud SQL backup to a new instance in another region when Region A is down.
B. Create a highly available Cloud SQL instance in region A. Scale up read workloads by creating read replicas in multiple regions. Promote one of the read replicas when region A is down.
C. Create a highly available Cloud SQL instance in region A. Create a highly available read replica in region B. Scale up read workloads by creating cascading read replicas in multiple regions. Promote the read replica in region B when region A is down.
D. Create a highly available Cloud SQL instance in region A. Scale up read workloads by creating read replicas in the same region. Failover to the standby Cloud SQL instance when the primary instance fails.

Community Answer Votes

C: 15 most voted
B: 2
A: 1

Comments 10 comments Click to expand

Comment 1

ID: 1138953 User: rohan.sahi Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Sat 03 Feb 2024 03:02 Selected Answer: C Upvotes: 10

Option C: Because HA read replica in multiple regions.
NotA: Coz restore from back up is time taking
NotB: No HA in Multiple regions read replica
Not D: Only one region mentioned.

Comment 2

ID: 1113991 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 20:01 Selected Answer: C Upvotes: 5

- Combines high availability with geographic distribution of read workloads.
- Promoting a highly available read replica can provide a quick failover solution, potentially meeting low RTO and RPO requirements.

=====
Why not A:
Restoring from backup to a new instance in another region during a regional outage might not meet low RTO and RPO requirements due to the time it takes to perform a restore.

Comment 2.1

ID: 1122228 User: AllenChen123 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 14 Jan 2024 04:14 Selected Answer: - Upvotes: 1

Why not B?

Comment 2.1.1

ID: 1124028 User: datapassionate Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 16 Jan 2024 09:51 Selected Answer: - Upvotes: 5

Why not B:
While B option scales up read workloads across multiple regions, it doesn't specify high availability for the read replica in another region. In the event of a regional outage, promoting a non-highly available read replica might not provide the desired uptime and reliability.

Comment 3

ID: 1304362 User: mi_yulai Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Tue 29 Oct 2024 10:11 Selected Answer: - Upvotes: 1

Why C? Is it possible to have HA enable in different regions? How the synchronization in disk will wokr for HA?

Comment 4

ID: 1128443 User: tibuenoc Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 22 Jan 2024 09:49 Selected Answer: B Upvotes: 1

https://cloud.google.com/sql/docs/mysql/replication

This option involves having read replicas in multiple regions, allowing you to promote one of them in the event of a failure in region A. While there may still be a brief interruption during the failover, it is likely to be less than the time required for the synchronization of cascading read replicas.

Comment 5

ID: 1121570 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 12:41 Selected Answer: B Upvotes: 1

To me, it's B. it provides:
High availability: The highly available Cloud SQL instance in region A will ensure that the database remains accessible even if one of the zones in the region becomes unavailable.
Scalability: The read replicas in multiple regions will enable you to scale up the read capacity of the database to support the demands of readers from various geographic regions.
Minimal interruptions: When region A is down, one of the read replicas in another region will be promoted to become the new primary instance. This will ensure that there is no interruption to the readers.

Comment 5.1

ID: 1121571 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 12:42 Selected Answer: - Upvotes: 1

Why not others:
Approach A: This approach requires you to restore a backup from a different region, which could take some time. This could result in a significant RPO (Recovery Point Objective) for the database. Additionally, the restored instance may not be physically located in the same region as the readers, which could impact performance.
Approach C: This approach requires you to promote the read replica in region B, which could result in a temporary interruption to the readers while the promotion is taking place. Additionally, the read replica in region B may not be able to handle the same level of read traffic as the primary instance in region A.
Approach D: This approach does not provide the same level of scalability as the other approaches, as you are limited to read replicas in the same region. Additionally, failover to the standby instance could result in a temporary interruption to the readers.

Comment 5.1.1

ID: 1121575 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 12:46 Selected Answer: - Upvotes: 3

Ignore my previous messages, it's C :D

Comment 6

ID: 1112748 User: scaenruy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 13:41 Selected Answer: A Upvotes: 1

A.
Create a highly available Cloud SQL instance in region Create a highly available read replica in region B. Scale up read workloads by creating cascading read replicas in multiple regions. Backup the Cloud SQL instances to a multi-regional Cloud Storage bucket. Restore the Cloud SQL backup to a new instance in another region when Region A is down.

45. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 249

Sequence: 207
Discussion ID: 130353
Source URL: https://www.examtopics.com/discussions/google/view/130353-exam-professional-data-engineer-topic-1-question-249/
Posted By: raaad
Posted At: Jan. 5, 2024, 12:36 a.m.

Question

Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage. You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old data. The current business requirements are:

• The old data can be deleted anytime.
• There is no predefined access pattern of the old data.
• The old data should be available instantly when accessed.
• There should not be any charges for data retrieval.

What should you do to optimize for cost?

A. Create the bucket with the Autoclass storage class feature.
B. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline, and 365 days to archive storage class. Delete old data as needed.
C. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to coldline, 90 days to nearline, and 365 days to archive storage class. Delete old data as needed.
D. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 45 days to coldline, and 60 days to archive storage class. Delete old data as needed.

Community Answer Votes

A: 25 most voted
B: 8

Comments 9 comments Click to expand

Comment 1

ID: 1115763 User: Smakyel79 Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Sun 07 Jan 2024 12:30 Selected Answer: A Upvotes: 10

https://cloud.google.com/storage/docs/autoclass

Comment 2

ID: 1114126 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Fri 05 Jan 2024 00:36 Selected Answer: A Upvotes: 8

- Autoclass automatically moves objects between storage classes without impacting performance or availability, nor incurring retrieval costs.
- It continuously optimizes storage costs based on access patterns without the need to set specific lifecycle management policies.

Comment 3

ID: 1305114 User: SamuelTsch Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Wed 30 Oct 2024 19:50 Selected Answer: A Upvotes: 1

From the documentation https://cloud.google.com/storage/docs/autoclass

Comment 4

ID: 1248391 User: hussain.sain Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 15 Jul 2024 15:45 Selected Answer: B Upvotes: 1

the question clearly specifies there should not be any retrieval charges. so enabling autoclass is not recommended because we have to pay one time fees while retrieving the data. and usually soft delete is enable.

Comment 4.1

ID: 1272592 User: nadavw Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Mon 26 Aug 2024 11:13 Selected Answer: - Upvotes: 2

A one-time pay isn't considered a retrieval charge. A is correct

Comment 5

ID: 1191275 User: CGS22 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 08 Apr 2024 02:22 Selected Answer: B Upvotes: 4

Why B is the best choice:

Cost Optimization: This option leverages Cloud Storage's different storage classes to significantly reduce costs for storing older data. Nearline, coldline, and archive storage classes are progressively cheaper than the standard storage class, with trade-offs in availability and retrieval times.
Meets Requirements:
Old data deletion: You can manually delete old data whenever needed, fulfilling the first requirement.
No predefined access pattern: The policy automatically transitions data to cheaper storage classes based on age, regardless of access patterns.
Instant availability: Nearline storage provides immediate access to data, meeting the third requirement.
No retrieval charges: While there are retrieval charges for coldline and archive storage, nearline storage has no retrieval fees, satisfying the fourth requirement.

Comment 6

ID: 1117331 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 10:19 Selected Answer: A Upvotes: 4

For sure A, read the documentation

Comment 7

ID: 1117123 User: GCP001 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 01:03 Selected Answer: A Upvotes: 2

autoclass is the correct way to handle all business cases

Comment 8

ID: 1115155 User: therealsohail Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 06 Jan 2024 13:02 Selected Answer: B Upvotes: 3

Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline, and 365 days to archive storage class. Delete old data as needed.

46. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 122

Sequence: 221
Discussion ID: 16858
Source URL: https://www.examtopics.com/discussions/google/view/16858-exam-professional-data-engineer-topic-1-question-122/
Posted By: rickywck
Posted At: March 17, 2020, 12:49 p.m.

Question

You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time. Which two methods can accomplish this?
(Choose two.)

A. Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.
B. Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.
C. Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.
D. Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.
E. Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.

Community Answer Votes

AB: 19 most voted
AD: 3

Comments 29 comments Click to expand

Comment 1

ID: 74230 User: Ganshank Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 13 Oct 2020 21:50 Selected Answer: - Upvotes: 38

A,B
https://cloud.google.com/datastore/docs/export-import-entities

Comment 1.1

ID: 345885 User: salsabilsf Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sat 30 Oct 2021 06:53 Selected Answer: - Upvotes: 6

"while keeping the costs"

should be A,D

Comment 1.1.1

ID: 421258 User: MrCastro Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 07 Feb 2022 18:00 Selected Answer: - Upvotes: 9

Big query streaming inserts ARE NOT cheap

Comment 1.1.1.1

ID: 455245 User: hellofrnds Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 01 Apr 2022 04:05 Selected Answer: - Upvotes: 4

If you use B , not D , how can we do "point in time" recovery? is it possible?
Point in time recovery needs export along with timestamp, so that we can recover for a particular timestamp.

Comment 2

ID: 167150 User: atnafu2020 Badges: Highly Voted Relative Date: 5 years ago Absolute Date: Sat 27 Feb 2021 01:41 Selected Answer: - Upvotes: 23

AC
https://cloud.google.com/datastore/docs/export-import-entities
C: To import only a subset of entities or to import data into BigQuery, you must specify an entity filter in your export.
B: Not correct since you want to store in a different environment than Datastore. Tho this statment is true: Data exported from one Datastore mode database can be imported into another Datastore mode database, even one in another project.
A is correct
Billing and pricing for managed exports and imports in Datastore
Output files stored in Cloud Storage count towards your Cloud Storage data storage costs.
Steps to Export all the entities
1. Go to the Datastore Entities Export page in the Google Cloud Console.
2. Go to the Datastore Export page
2. Set the Namespace field to All Namespaces, and set the Kind field to All Kinds.
3. Below Destination, enter the name of your "Cloud Storage bucket".
4. Click Export.

Comment 2.1

ID: 419857 User: Yiouk Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Fri 04 Feb 2022 20:30 Selected Answer: - Upvotes: 1

C is valid because of table snapshots. Else standard time travel is valid only for 7 days
https://cloud.google.com/bigquery/docs/table-snapshots-intro#table_snapshots
https://cloud.google.com/bigquery/docs/time-travel#limitation

Comment 2.1.1

ID: 455055 User: Chelseajcole Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 30 Mar 2022 19:30 Selected Answer: - Upvotes: 1

you wanna say invalid?

Comment 2.2

ID: 576056 User: tavva_prudhvi Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 27 Sep 2022 09:47 Selected Answer: - Upvotes: 2

As you've mentioned in B, does the environment meant to be a project or a resource? As, we can clone a copy of the data in a datastore even in another project!? Then, it's B.

Also, in point C they didn't mention any entity filter hence we eliminate C how can you support your own statement with a different answer?

Comment 2.3

ID: 762448 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 02:55 Selected Answer: - Upvotes: 1

A, B is perfect

Comment 2.4

ID: 474749 User: aparna4387 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Mon 09 May 2022 11:03 Selected Answer: - Upvotes: 6

https://cloud.google.com/datastore/docs/export-import-entities#import-into-bigquery
Data exported without specifying an entity filter cannot be loaded into BigQuery. This is not mentioned explicitly. Safe to assume there is no filter on the exports. So options are AB

Comment 3

ID: 1189539 User: CGS22 Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Fri 04 Oct 2024 22:57 Selected Answer: AB Upvotes: 1

https://cloud.google.com/datastore/docs/export-import-entities

Comment 4

ID: 996975 User: kskssk Badges: - Relative Date: 2 years ago Absolute Date: Sat 02 Mar 2024 18:29 Selected Answer: - Upvotes: 5

AB chatgpt
A. Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class:

Managed export is a feature provided by Cloud Datastore to export your data.
Storing the data in a Cloud Storage bucket, especially using Nearline or Coldline storage classes, helps keep storage costs low while allowing you to retain the snapshots for a long time.
B. Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export:

This method allows you to create snapshots by exporting data from Cloud Datastore (using managed export) and then importing it into a separate project under a unique namespace.
By importing into a separate project, you can keep a copy of the data in a different environment, which is useful for point-in-time recovery or creating clones of the data.

Comment 5

ID: 733189 User: NicolasN Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 01 Jun 2023 23:09 Selected Answer: AB Upvotes: 11

A rather complicated question, of a kind I wish I won't face in the exam. My opinion:
✅ [A] A valid and cost-effective solution satisfying the requirement for PIT recovery
✅ [B] A valid solution but far from ideal for archiving. It satisfies the requirement part "you can … clone a copy of the data for Cloud Datastore in a different environment" (an objection to the word "namespace", I think it should be just "name")

Comment 5.1

ID: 733191 User: NicolasN Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 01 Jun 2023 23:09 Selected Answer: - Upvotes: 11

❌[C] There is the limitation "Data exported without specifying an entity filter cannot be loaded into BigQuery". The entity filter for this case should contain all the kinds of entities but there is another limitation of "100 entity filter combinations". We have no knowledge of the kinds or the namespaces of the entities.
Sources:
🔗 https://cloud.google.com/datastore/docs/export-import-entities#import-into-bigquery
🔗 https://cloud.google.com/datastore/docs/export-import-entities#exporting_specific_kinds_or_namespaces
❌ [D] seems a detailed candidate solution but it violates the limitation "You cannot append Datastore export data to an existing table."
🔗 https://cloud.google.com/bigquery/docs/loading-data-cloud-datastore#appending_to_or_overwriting_a_table_with_cloud_datastore_data
❌ [E] Cloud Source Repositories are for source code and not a suitable storage for this case.

Comment 6

ID: 683682 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Thu 30 Mar 2023 16:50 Selected Answer: AB Upvotes: 1

https://cloud.google.com/datastore/docs/export-import-entities

Comment 7

ID: 668040 User: John_Pongthorn Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Mon 13 Mar 2023 16:09 Selected Answer: AB Upvotes: 1

https://cloud.google.com/datastore/docs/export-import-entities

Comment 8

ID: 667773 User: John_Pongthorn Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Mon 13 Mar 2023 11:10 Selected Answer: AB Upvotes: 1

The answer is nothing to do with bigquery , so you can skip what mention to bigquery.

A B is the final answer

Comment 9

ID: 634033 User: DataEngineer_WideOps Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 20 Jan 2023 15:27 Selected Answer: - Upvotes: 2

A,B

For those who say using BQ as archival, How can we achieve that while datastore are NO-SQL whereas BQ are SQL , will that work? also BQ are not created for achieving purposes.

Comment 10

ID: 618360 User: AmirN Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 18 Dec 2022 20:32 Selected Answer: - Upvotes: 1

Option B is 36 times more expensive than C

Comment 11

ID: 524882 User: Nico1310 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sat 16 Jul 2022 11:21 Selected Answer: AB Upvotes: 2

AB. for sure streaming to BQ its quite expensive!

Comment 12

ID: 520080 User: MaxNRG Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 09 Jul 2022 09:56 Selected Answer: AD Upvotes: 3

A - Cloud Storage (long-term data + costs low)
D - BigQuery (timestamp for point-in-time (PIT) recovery)

Comment 12.1

ID: 589848 User: tavva_prudhvi Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sat 22 Oct 2022 10:53 Selected Answer: - Upvotes: 2

D is wrong, BQ Streaming inserts costs are high!

Comment 12.1.1

ID: 1099011 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 17 Jun 2024 14:55 Selected Answer: - Upvotes: 1

Agreed, AB
https://cloud.google.com/datastore/docs/export-import-entities

Comment 13

ID: 519480 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 11:50 Selected Answer: AB Upvotes: 2

Option A; Cheap storage and it is a supported method https://cloud.google.com/datastore/docs/export-import-entities
Option B; Rationale - "Data exported from one Datastore mode database can be imported into another Datastore mode database, even one in another project." <https://cloud.google.com/datastore/docs/export-import-entities>

Comment 14

ID: 463583 User: squishy_fishy Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sun 17 Apr 2022 17:01 Selected Answer: - Upvotes: 1

Answer is A, B.
https://cloud.google.com/datastore/docs/export-import-entities#exporting_specific_kinds_or_namespaces

Comment 15

ID: 459031 User: sergio6 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 08 Apr 2022 07:27 Selected Answer: - Upvotes: 4

A, D
A: Option for storage system that will account for the long-term data growth
D: Option for snapshots, PIT recovery, copy of the data for Cloud Datastore in a different environment and, above all, archive snapshots for a long time
B: not a good solution for archiving snapshots for a long time
C: to import data into BigQuery, you must specify an entity filter
E: Cloud Source Repositories is for code
One note: E --> would be my second choice if there was Cloud Storage instead of Source Repositories (typo?)

Comment 16

ID: 446713 User: Chelseajcole Badges: - Relative Date: 3 years, 12 months ago Absolute Date: Thu 17 Mar 2022 19:56 Selected Answer: - Upvotes: 1

Vote A B . What’s the purpose load into bigquery?

Comment 16.1

ID: 455052 User: Chelseajcole Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 30 Mar 2022 19:26 Selected Answer: - Upvotes: 3

https://cloud.google.com/datastore/docs/export-import-entities#import-into-bigquery
Importing into BigQuery
To import data from a managed export into BigQuery, see Loading Datastore export service data.

Data exported without specifying an entity filter cannot be loaded into BigQuery. If you want to import data into BigQuery, your export request must include one or more kind names in the entity filter.

You have to specify an entity fliter before you can load from datastore to BQ. It didn't mention that at all. So C is incorrect

Comment 17

ID: 426845 User: fire558787 Badges: - Relative Date: 4 years ago Absolute Date: Fri 18 Feb 2022 15:34 Selected Answer: - Upvotes: 7

A for sure. Then I was undecided between B and C; B has high costs and C has low costs (storage is more expensive in Datastore). However the question says that you want data to be used for Datastore. There is no native way to export data from BigQuery to Datastore, hence the only two options that allow data to be restored to Datastore are A and B.

47. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 283

Sequence: 232
Discussion ID: 130510
Source URL: https://www.examtopics.com/discussions/google/view/130510-exam-professional-data-engineer-topic-1-question-283/
Posted By: GCP001
Posted At: Jan. 7, 2024, 4:17 p.m.

Question

You have created an external table for Apache Hive partitioned data that resides in a Cloud Storage bucket, which contains a large number of files. You notice that queries against this table are slow. You want to improve the performance of these queries. What should you do?

A. Change the storage class of the Hive partitioned data objects from Coldline to Standard.
B. Create an individual external table for each Hive partition by using a common table name prefix. Use wildcard table queries to reference the partitioned data.
C. Upgrade the external table to a BigLake table. Enable metadata caching for the table.
D. Migrate the Hive partitioned data objects to a multi-region Cloud Storage bucket.

Community Answer Votes

C: 12 most voted

Comments 8 comments Click to expand

Comment 1

ID: 1117905 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 23:01 Selected Answer: C Upvotes: 8

- BigLake Table: BigLake allows for more efficient querying of data lakes stored in Cloud Storage. It can handle large datasets more effectively than standard external tables.
- Metadata Caching: Enabling metadata caching can significantly improve query performance by reducing the time taken to read and process metadata from a large number of files.

Comment 1.1

ID: 1127666 User: AllenChen123 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sun 21 Jul 2024 06:33 Selected Answer: - Upvotes: 4

Agree. https://cloud.google.com/bigquery/docs/biglake-intro#metadata_caching_for_performance

Comment 1.1.1

ID: 1131351 User: AllenChen123 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Thu 25 Jul 2024 06:08 Selected Answer: - Upvotes: 5

And https://cloud.google.com/bigquery/docs/external-data-cloud-storage#upgrade-external-tables-to-biglake-tables

Comment 2

ID: 1174374 User: hanoverquay Badges: Most Recent Relative Date: 1 year, 5 months ago Absolute Date: Sun 15 Sep 2024 16:15 Selected Answer: C Upvotes: 1

vote C

Comment 3

ID: 1155425 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Wed 21 Aug 2024 10:13 Selected Answer: C Upvotes: 1

Option C

Comment 4

ID: 1121863 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 16:43 Selected Answer: C Upvotes: 1

Option C

Comment 5

ID: 1118362 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 10 Jul 2024 08:17 Selected Answer: C Upvotes: 1

agree with C

Comment 6

ID: 1115952 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 15:17 Selected Answer: - Upvotes: 2

C. Upgrade the external table to a BigLake table. Enable metadata caching for the table.
Check ref - https://cloud.google.com/bigquery/docs/biglake-intro

48. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 152

Sequence: 235
Discussion ID: 17216
Source URL: https://www.examtopics.com/discussions/google/view/17216-exam-professional-data-engineer-topic-1-question-152/
Posted By: -
Posted At: March 22, 2020, 8:05 a.m.

Question

You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?

A. BigQuery
B. Cloud Bigtable
C. Cloud Datastore
D. Cloud SQL for PostgreSQL

Community Answer Votes

A: 4 most voted

Comments 13 comments Click to expand

Comment 1

ID: 1165484 User: moumou Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Wed 04 Sep 2024 09:49 Selected Answer: A Upvotes: 1

[Removed] Highly Voted 3 years, 11 months ago
Answer: A

Comment 2

ID: 1163905 User: mothkuri Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Mon 02 Sep 2024 03:59 Selected Answer: - Upvotes: 1

Answer : A
Statement "You want to have a dashboard that shows how many and which ships are likely to cause delays within a region" means we run analytical queries using ML. So BigQuery is Correct answer and it can able to store large volume of data

Comment 3

ID: 1015481 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 07:04 Selected Answer: A Upvotes: 3

Here's why BigQuery is a good choice:

Scalable Data Storage: BigQuery is a fully managed, highly scalable data warehouse that can handle large volumes of data, including your 40 TB dataset. It allows you to store and manage your data efficiently.

SQL for Predictive Analytics: BigQuery supports standard SQL and has built-in machine learning capabilities through BigQuery ML. You can easily build predictive models using SQL queries, which aligns with your goal of predicting ship delays.

Geospatial Processing: BigQuery has robust support for geospatial data processing. It provides functions for working with GeoJSON and geospatial data types, making it suitable for your ship telemetry data and geospatial analysis.

Integration with Dashboards: BigQuery can be easily integrated with visualization tools like Google Data Studio or other BI tools. You can create interactive dashboards to monitor ship delays based on your model's predictions.

Comment 4

ID: 812904 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 11:43 Selected Answer: - Upvotes: 1

Answer B: BigTable,
Catchup words: Telemetry (sensor- semi structured data) as data is bigger than 500GB, datastore is not a good option.
GEOJSON , bigquery has geospatical capabilites but still not quick enough for semi structure geojson data.
Prediction for delay of ships <<likely to>> For me its time crucial and almost real time requirement. BigQuery is not suitable for it.
Best solution for this case is: Use BigTable for storage, create a datflow pipeline / google cloud AI platform for time senstive prediction.

Comment 4.1

ID: 820939 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 24 Aug 2023 20:59 Selected Answer: - Upvotes: 4

answer A: You are just looking for a storage solution not a workflow

Comment 5

ID: 717773 User: Atnafu Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sun 14 May 2023 07:50 Selected Answer: - Upvotes: 1

A
Geospatial analytics let you analyze and visualize geospatial data in BigQuery by using geography data types and Google Standard SQL geography functions. + BigqueryML

Comment 6

ID: 486421 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 25 May 2022 04:01 Selected Answer: - Upvotes: 1

Answer: C

Comment 7

ID: 480760 User: Chihhanyu Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 18 May 2022 14:01 Selected Answer: - Upvotes: 3

GeoJson + Native functionality for prediction -> BigQuery

Comment 8

ID: 476765 User: singh_payal_1404 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Thu 12 May 2022 07:44 Selected Answer: - Upvotes: 1

Answer : A

Comment 9

ID: 458624 User: PM17 Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Thu 07 Apr 2022 10:58 Selected Answer: - Upvotes: 3

This is more of a question that an answer but: How much data can Bigquery handle?

40TB seems to be a lot and bigtable can handle that, but of course Bigquery is better when it comes to ML and GIS.

Comment 10

ID: 163751 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Mon 22 Feb 2021 19:28 Selected Answer: - Upvotes: 3

A is correct

Comment 11

ID: 157742 User: FARR Badges: - Relative Date: 5 years ago Absolute Date: Sun 14 Feb 2021 05:11 Selected Answer: - Upvotes: 3

A
https://cloud.google.com/bigquery/docs/gis-intro

Comment 12

ID: 70573 User: Rajokkiyam Badges: - Relative Date: 5 years, 5 months ago Absolute Date: Sat 03 Oct 2020 02:30 Selected Answer: - Upvotes: 2

Answer A

49. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 154

Sequence: 240
Discussion ID: 16688
Source URL: https://www.examtopics.com/discussions/google/view/16688-exam-professional-data-engineer-topic-1-question-154/
Posted By: madhu1171
Posted At: March 15, 2020, 7:17 p.m.

Question

You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure. What should you do?

A. Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
B. Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.
C. Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.
D. Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.

Community Answer Votes

A: 17 most voted
B: 12

Comments 20 comments Click to expand

Comment 1

ID: 64410 User: madhu1171 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Tue 15 Sep 2020 18:17 Selected Answer: - Upvotes: 32

A should be correct answer

Comment 1.1

ID: 504309 User: tycho Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 18 Jun 2022 15:52 Selected Answer: - Upvotes: 5

yes A is correct, whe creating ne cloud sql instance there is an option
"Multiple zones (Highly available)
Automatic failover to another zone within your selected region. Recommended for production instances. Increases cost."

Comment 2

ID: 416071 User: hdmi_switch Badges: Highly Voted Relative Date: 4 years, 1 month ago Absolute Date: Fri 28 Jan 2022 11:11 Selected Answer: - Upvotes: 7

Seems to depend on the date the question was published.

A) is mentioned in legacy config "The legacy configuration for high availability used a failover replica instance." https://cloud.google.com/sql/docs/mysql/configure-legacy-ha
B) Read replica is mentioned here https://cloud.google.com/sql/docs/mysql/high-availability scroll down to the diagrams
C) external read replica seems to be wrong, never heard of it
D) Aligns which Google steps, but automatic backup is not mentioned for HA https://cloud.google.com/sql/docs/mysql/configure-ha#ha-create just regional instance mentioned

I would pick D now, since regional instance is described as HA, although the automatic backup seems like a bonus.

Comment 3

ID: 1163908 User: mothkuri Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Mon 02 Sep 2024 04:05 Selected Answer: - Upvotes: 2

Answer : A
Question is about high availability in the event of zone failure. So create Fail over replica in another zone in same region.

Comment 4

ID: 1100414 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 19 Jun 2024 07:33 Selected Answer: A Upvotes: 6

A (failover replicas) as this is an old question:

In a legacy HA configuration, a Cloud SQL for MySQL instance uses a failover replica to add high availability to the instance. This functionality isn't available in Google Cloud console.

The new configuration doesn't use failover replicas. Instead, it uses Google's regional persistent disks, which synchronously replicate data at the block-level between two zones in a region.
https://cloud.google.com/sql/docs/mysql/configure-legacy-ha

Comment 5

ID: 1076632 User: pss111423 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 20:16 Selected Answer: - Upvotes: 1

Option A is good fro leagacy soultion
Note: Cloud SQL plans to discontinue support for legacy HA instances in the future and will soon be announcing a date to do so. Currently, legacy HA instances are still covered by the Cloud SQL SLA. We recommend you upgrade your existing legacy HA instances to regional persistent disk HA instances and create new instances using regional persistent disk HA as soon as possible
Option C makes more sense in this regrard

Comment 6

ID: 1070731 User: emmylou Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 14 May 2024 18:29 Selected Answer: - Upvotes: 1

A - Although it is legacy and will be deprecated. The correct answer is not an option--
"The legacy configuration for high availability used a failover replica instance. The new configuration does not use a failover replica. Instead, it uses Google's regional persistent disks, which synchronously replicate data at the block level between two zones in a region."

Comment 7

ID: 1015843 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 16:46 Selected Answer: A Upvotes: 2

Failover Replica: By creating a failover replica in another zone within the same region, you establish a high-availability configuration. The failover replica is kept in sync with the primary instance, and it can quickly take over in case of a failure of the primary instance.

Same Region: Placing the failover replica in the same region ensures minimal latency and data consistency. In the event of a zone failure, the failover can happen within the same region, reducing potential downtime.

Zone Resilience: Google Cloud's regional design ensures that zones within a region are independent of each other, which adds resilience to zone failures.

Automatic Failover: In case of a primary instance failure, Cloud SQL will automatically promote the failover replica to become the new primary instance, minimizing downtime.

Comment 8

ID: 989325 User: samstar4180 Badges: - Relative Date: 2 years ago Absolute Date: Sat 24 Feb 2024 19:05 Selected Answer: - Upvotes: 1

Per latest Google cloud document, B is the correct answer.

Comment 9

ID: 953254 User: wan2three Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 16 Jan 2024 14:23 Selected Answer: B Upvotes: 3

Cross-region read replicas
Cross-region replication lets you create a read replica in a different region from the primary instance. You create a cross-region read replica the same way as you create an in-region replica.

Cross-region replicas:

Improve read performance by making replicas available closer to your application's region.
Provide additional disaster recovery capability to guard against a regional failure.
Let you migrate data from one region to another.
https://cloud.google.com/sql/docs/mysql/replication#cross-region-read-replicas:~:text=memory%20(OOM)%20events.-,Cross%2Dregion%20read%20replicas,Let%20you%20migrate%20data%20from%20one%20region%20to%20another.,-See%20Promoting%20replicas

Comment 10

ID: 946572 User: MoeHaydar Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Mon 08 Jan 2024 17:57 Selected Answer: B Upvotes: 2

The legacy process for adding high availability to MySQL instances uses a failover replica. The legacy functionality isn't available in the Google Cloud console. See Legacy configuration: Creating a new instance configured for high availability or Legacy configuration: Configuring an existing instance for high availability.

Comment 11

ID: 941387 User: KK0202 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 03 Jan 2024 05:48 Selected Answer: B Upvotes: 4

The correct answer is most probably B as this his scenario has an update(As of July 2023). Failover replicas are not available anymore. Same region different zone read replicas are used in case of a failover or if primary zone is not available

Comment 12

ID: 902712 User: MBRSDG Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 20 Nov 2023 20:59 Selected Answer: B Upvotes: 2

The answer is B, the failover replica is a legacy feature.
See here: https://cloud.google.com/sql/docs/mysql/high-availability#legacy_mysql_high_availability_option

Comment 12.1

ID: 912040 User: forepick Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 01 Dec 2023 14:54 Selected Answer: - Upvotes: 1

Read replica isn't an alternative to the standby instance

Comment 13

ID: 893843 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 10 Nov 2023 14:03 Selected Answer: A Upvotes: 3

read replica (B) and external read replica (C) doesn't make sense here, since we potentially need all the functionalities. Using Cloud SQL in a region combined with Cloud Storage backup may not be the best choice (D) thinking about compliance reasons starting from what has been asked, it seems also "too much" compared with A that fullfills the request with simpler actions. Also, compliance is required at the regional level, so then A fits.

Comment 14

ID: 832990 User: wjtb Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 08 Sep 2023 13:21 Selected Answer: - Upvotes: 6

Failover replica's are a legacy feature. This question is outdated: https://cloud.google.com/sql/docs/mysql/configure-ha

Comment 15

ID: 812933 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 12:17 Selected Answer: - Upvotes: 2

Answer A, key words to remember, High Scale use extra read replica. High availablity use extra failure replica. Both should be in different zone but in same region.

Comment 16

ID: 789198 User: desertlotus1211 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 27 Jul 2023 00:39 Selected Answer: - Upvotes: 2

Answer is B: https://cloud.google.com/sql/docs/mysql/replication#read-replicas

'As a best practice, put read replicas in a different zone than the primary instance when you use HA on your primary instance'

Comment 16.1

ID: 789199 User: desertlotus1211 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 27 Jul 2023 00:40 Selected Answer: - Upvotes: 1

The questions asks to ensure high availability in the event of a zone failure

Comment 17

ID: 701560 User: louisgcpde Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 22 Apr 2023 15:49 Selected Answer: A Upvotes: 2

A should be the answer as the question is asking HA in event of a zone failure.
”Read Replicas CAN be promoted to master nodes in the case of DR. However, there is downtime entailed.
Failover Replicas are designed to automatically become master nodes.”
https://googlecloudarchitect.us/read-replica-versus-failover-replica-in-cloud-sql/

50. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 96

Sequence: 241
Discussion ID: 16843
Source URL: https://www.examtopics.com/discussions/google/view/16843-exam-professional-data-engineer-topic-1-question-96/
Posted By: rickywck
Posted At: March 17, 2020, 9:51 a.m.

Question

You want to analyze hundreds of thousands of social media posts daily at the lowest cost and with the fewest steps.
You have the following requirements:
✑ You will batch-load the posts once per day and run them through the Cloud Natural Language API.
✑ You will extract topics and sentiment from the posts.
✑ You must store the raw posts for archiving and reprocessing.
✑ You will create dashboards to be shared with people both inside and outside your organization.
You need to store both the data extracted from the API to perform analysis as well as the raw social media posts for historical archiving. What should you do?

A. Store the social media posts and the data extracted from the API in BigQuery.
B. Store the social media posts and the data extracted from the API in Cloud SQL.
C. Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery.
D. Feed to social media posts into the API directly from the source, and write the extracted data from the API into BigQuery.

Community Answer Votes

C: 13 most voted

Comments 17 comments Click to expand

Comment 1

ID: 74202 User: psu Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Wed 13 Oct 2021 20:45 Selected Answer: - Upvotes: 17

Answer should be C, becose they ask you to save a copy of the raw posts for archival, which may not be possible if you directly feed the posts to the API.

Comment 2

ID: 675310 User: sedado77 Badges: Highly Voted Relative Date: 1 year, 11 months ago Absolute Date: Thu 21 Mar 2024 19:22 Selected Answer: C Upvotes: 5

I got this question on sept 2022. Answer is C

Comment 3

ID: 825575 User: itz_me_sudhir Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Sun 01 Sep 2024 06:40 Selected Answer: - Upvotes: 2

can any one help me with the rest of question from 101 to 209 as i dont have a contributor access

Comment 4

ID: 664593 User: Erso Badges: - Relative Date: 2 years ago Absolute Date: Sat 09 Mar 2024 16:09 Selected Answer: C Upvotes: 1

C is the correct one

Comment 5

ID: 518491 User: medeis_jar Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 06 Jul 2023 19:02 Selected Answer: C Upvotes: 2

Only C make sense.

Comment 6

ID: 513459 User: MaxNRG Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 14:50 Selected Answer: C Upvotes: 3

You must store the raw posts for archiving and reprocessing, Store the raw social media posts in Cloud Storage.
B is expensive
D is not valid since you have to store the raw posts for archiving
Between A and C I’s say C, since we’re going to make dashboards and Data Studio will connect well with big query.
and besides A would probably be more expensive.

Comment 7

ID: 493831 User: BigQuery Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sun 04 Jun 2023 17:08 Selected Answer: - Upvotes: 4

SAY MY NAME!

Comment 8

ID: 490524 User: StefanoG Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 08:32 Selected Answer: C Upvotes: 2

Analysis BQ
Storage GCS

Comment 9

ID: 426221 User: fire558787 Badges: - Relative Date: 3 years ago Absolute Date: Fri 17 Feb 2023 12:11 Selected Answer: - Upvotes: 1

I believe the API accesses data only from GCS Buckets not BigQuery (but I'm not entirely sure)

Comment 10

ID: 396124 User: sumanshu Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 01 Jan 2023 18:34 Selected Answer: - Upvotes: 2

Vote for C

Comment 11

ID: 269205 User: DPonly Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sun 17 Jul 2022 01:01 Selected Answer: - Upvotes: 2

Answer should be C because we need to consider storage archival

Comment 12

ID: 221719 User: arghya13 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 18 May 2022 09:32 Selected Answer: - Upvotes: 2

I'll go with option C

Comment 13

ID: 215873 User: Alasmindas Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Mon 09 May 2022 10:56 Selected Answer: - Upvotes: 3

I will go with Option C, because of the following reasons:-
a) Social media posts are "raw" - which means - it can be of any format (blob/object storage) is preferred.
b) The output from the application (assuming the application is Cloud NLP) is to be future stored for archival purpose - and hence again Google Cloud storage is the best option - so option C
Option A &C - Incorrect, although Option D fulfils the requirement of "fewest step" but storing data in big query for archival purpose is not a google recommended approach
Option B : Cloud SQL rules out as it does not solve either for archival storage or for analytics purpose.

Comment 14

ID: 186543 User: singhkrishna Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 25 Mar 2022 00:25 Selected Answer: - Upvotes: 1

cost of long term storing is almost same in GCS and BQ, so answer D makes sense from that angle..

Comment 15

ID: 175782 User: Tanmoyk Badges: - Relative Date: 4 years ago Absolute Date: Tue 08 Mar 2022 12:50 Selected Answer: - Upvotes: 2

The job is supposed to run in batch process once in a day , so there is no requirement of stream data. The best economical and less complex steps is answer C

Comment 16

ID: 163509 User: Ravivarma4786 Badges: - Relative Date: 4 years ago Absolute Date: Tue 22 Feb 2022 12:54 Selected Answer: - Upvotes: 2

Bigquery is suitable for social media post ans should be C

Comment 17

ID: 134207 User: tprashanth Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Thu 13 Jan 2022 21:27 Selected Answer: - Upvotes: 3

C.
First store the data on GCS, then extract only the relavant info for analysis and load into BQ. This way, huge data ie., audio, videos can stay on GCS (not lost). BQ cannot store audio/video. And note that Cloud Natural Language API is used for analysis which uses Text as it's source

51. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 92

Sequence: 244
Discussion ID: 17257
Source URL: https://www.examtopics.com/discussions/google/view/17257-exam-professional-data-engineer-topic-1-question-92/
Posted By: -
Posted At: March 22, 2020, 4:39 p.m.

Question

You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern.
Which service do you select for storing and serving your data?

A. Cloud Spanner
B. Cloud Bigtable
C. Cloud Firestore
D. Cloud SQL

Community Answer Votes

D: 4 most voted

Comments 16 comments Click to expand

Comment 1

ID: 826623 User: midgoo Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 02 Sep 2023 07:53 Selected Answer: D Upvotes: 2

Cloud SQL: max storage for shared core = 3TB and for dedicated core = up to 64TB

Only use Spanner if we need autoscale (Note that Cloud SQL could scale too but not automatic yet) or the size is too big (as above) or 4/5 9s HA (Cloud SQL is only 99.95)

Comment 2

ID: 693119 User: Nirca Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 12 Apr 2023 15:29 Selected Answer: D Upvotes: 1

Cloud SQL is relational DB (pg mssql, mysql)

Comment 3

ID: 614930 User: Dhass Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 11 Dec 2022 12:55 Selected Answer: - Upvotes: 1

Answer - D

Comment 4

ID: 600278 User: homaj Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Fri 11 Nov 2022 21:35 Selected Answer: D Upvotes: 1

answer D

Comment 5

ID: 395963 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 01 Jan 2022 15:49 Selected Answer: - Upvotes: 3

Vote for D

Comment 6

ID: 308357 User: daghayeghi Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Sat 11 Sep 2021 21:31 Selected Answer: - Upvotes: 1

D:
https://cloud.google.com/sql/docs/features

Comment 7

ID: 246191 User: GypsyMonkey Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 17 Jun 2021 04:32 Selected Answer: - Upvotes: 3

D, cloud SQL is a relational database; if > 10tb, then choose spanner

Comment 8

ID: 163383 User: atnafu2020 Badges: - Relative Date: 5 years ago Absolute Date: Mon 22 Feb 2021 09:20 Selected Answer: - Upvotes: 3

D
Cloud SQL supports MySQL 5.6 or 5.7, and provides up to 624 GB of RAM and 30 TB of data storage, with the option to automatically increase the storage size as needed.

Comment 8.1

ID: 485592 User: Abhi16820 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Tue 24 May 2022 04:01 Selected Answer: - Upvotes: 5

64TB AS OF TODAY

Comment 9

ID: 162553 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Sun 21 Feb 2021 02:57 Selected Answer: - Upvotes: 3

D is correct. Obviously

Comment 10

ID: 76130 User: Barniyah Badges: - Relative Date: 5 years, 4 months ago Absolute Date: Sun 18 Oct 2020 19:28 Selected Answer: - Upvotes: 3

But cloud SQL storage is limited to several hundreds of GB's for all instances
and we need 2TB.
So, Cloud spanner is much closer to this, with the exception of the cost

Comment 10.1

ID: 80602 User: taepyung Badges: - Relative Date: 5 years, 4 months ago Absolute Date: Wed 28 Oct 2020 06:34 Selected Answer: - Upvotes: 13

At this moment, Cloud SQL is providing up to 30,720GB(about 30TB)
So I think it's D.

Comment 10.2

ID: 1158078 User: Preetmehta1234 Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 24 Aug 2024 17:55 Selected Answer: - Upvotes: 1

Nope. Now, the Dedicated core is Up to 64 TB
https://cloud.google.com/sql/docs/quotas

Comment 10.3

ID: 93784 User: Barniyah Badges: - Relative Date: 5 years, 3 months ago Absolute Date: Sun 22 Nov 2020 10:11 Selected Answer: - Upvotes: 5

Sorry , I think it's D
https://cloud.google.com/sql/docs/features
(Cloud SQL supports MySQL 5.6 or 5.7, and provides up to 416 GB of RAM and 30 TB of data storage, with the option to automatically increase the storage size as needed.)

Comment 10.4

ID: 241007 User: xrun Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Fri 11 Jun 2021 14:47 Selected Answer: - Upvotes: 5

Another consideration is that Cloud SQL uses standard databases like MySQL, PostgreSQL and now MS SQL. Cloud Spanner is a proprietary product of Google and does some things differently than typical databases (no stored procedures and triggers). So migrating to Cloud Spanner makes application refactoring necessary. So Cloud SQL is the answer.

Comment 10.4.1

ID: 1099094 User: LaxmanTiwari Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 17 Jun 2024 16:59 Selected Answer: - Upvotes: 1

Well explained I can confirmed.

52. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 136

Sequence: 246
Discussion ID: 16932
Source URL: https://www.examtopics.com/discussions/google/view/16932-exam-professional-data-engineer-topic-1-question-136/
Posted By: jvg637
Posted At: March 18, 2020, 4:39 p.m.

Question

You are running a pipeline in Dataflow that receives messages from a Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization. Which two actions can you take to increase performance of your pipeline? (Choose two.)

A. Increase the number of max workers
B. Use a larger instance type for your Dataflow workers
C. Change the zone of your Dataflow pipeline to run in us-central1
D. Create a temporary table in Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Bigtable to BigQuery
E. Create a temporary table in Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery

Community Answer Votes

AB: 17 most voted

Comments 21 comments Click to expand

Comment 1

ID: 65686 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Fri 18 Sep 2020 15:39 Selected Answer: - Upvotes: 50

A & B
instance n1-standard-1 is low configuration and hence need to be larger configuration, definitely B should be one of the option.
Increase max workers will increase parallelism and hence will be able to process faster given larger CPU size and multi core processor instance type is chosen. Option A can be a better step.

Comment 1.1

ID: 762720 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 16:58 Selected Answer: - Upvotes: 2

Agreed

Comment 2

ID: 398143 User: sumanshu Badges: Highly Voted Relative Date: 4 years, 2 months ago Absolute Date: Tue 04 Jan 2022 11:39 Selected Answer: - Upvotes: 14

A & B.

With autoscaling enabled, the Dataflow service does not allow user control of the exact number of worker instances allocated to your job. You might still cap the number of workers by specifying the --max_num_workers option when you run your pipeline. Here as per question CAP is 3, So we can change that CAP.

For batch jobs, the default machine type is n1-standard-1. For streaming jobs, the default machine type for Streaming Engine-enabled jobs is n1-standard-2 and the default machine type for non-Streaming Engine jobs is n1-standard-4. When using the default machine types, the Dataflow service can therefore allocate up to 4000 cores per job. If you need more cores for your job, you can select a larger machine type.

Comment 3

ID: 1155275 User: et2137 Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Wed 21 Aug 2024 05:48 Selected Answer: AB Upvotes: 1

A & B is correct

Comment 4

ID: 1022734 User: kcl10 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Tue 02 Apr 2024 04:03 Selected Answer: AB Upvotes: 1

A & B is correct

Comment 5

ID: 1016200 User: juliorevk Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 25 Mar 2024 00:55 Selected Answer: AB Upvotes: 1

A because more workers improves performance through parallel work
B because the current instance size is too small

Comment 6

ID: 1015429 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 05:27 Selected Answer: AB Upvotes: 3

A. Increase the number of max workers:
By increasing the number of maximum workers, you allow Dataflow to allocate more computing resources to handle the peak load of incoming data. This can help improve processing speed and reduce CPU utilization per worker.

B. Use a larger instance type for your Dataflow workers:
Using a larger instance type with more CPU and memory resources can help your Dataflow workers handle a higher volume of data and processing tasks more efficiently. It can address CPU bottlenecks during peak periods.

Comment 7

ID: 717228 User: mbacelar Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 13 May 2023 10:05 Selected Answer: AB Upvotes: 1

Scale in and Scale Out

Comment 8

ID: 612748 User: FrankT2L Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 15:18 Selected Answer: AB Upvotes: 2

maximum of 3 workers: Increase the number of max workers (A)
instance type n1-standard-1: Use a larger instance type for your Cloud Dataflow workers (B)

Comment 9

ID: 520364 User: MaxNRG Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 09 Jul 2022 17:19 Selected Answer: AB Upvotes: 4

A & B, other options don't make sense

Comment 10

ID: 519527 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 13:05 Selected Answer: AB Upvotes: 2

Only A & B make sense for improving pipeline performance.

Comment 11

ID: 510507 User: Mjvsj Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Mon 27 Jun 2022 17:23 Selected Answer: AB Upvotes: 2

Should be A & B

Comment 12

ID: 293951 User: daghayeghi Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Thu 19 Aug 2021 05:05 Selected Answer: - Upvotes: 2

B, E:
B: Dataflow manage number of worker automatically, then we only can define machine type worker.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline
E: and adding a horizontally scale-able database like cloud spanner will reduce pressure on dataflow as it don't have to move data to specific zone and can be remain in same zone of EU, then E is correct.

Comment 12.1

ID: 368642 User: Vasu_1 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Sun 28 Nov 2021 13:07 Selected Answer: - Upvotes: 3

A & B is the right answer: You can set disable auto-scaling by setting the option --numWorkers (default is 3) and select the machine type by setting --workerMachineType at the time of creation of the pipeline (this applies to both auto and manual scaling)

Comment 13

ID: 222013 User: kavs Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Tue 18 May 2021 15:25 Selected Answer: - Upvotes: 3

Dataset is in EU so data can't be moved outside EU due to privacy law so zone option is ruled out. AB is Ok but intermediate table will boost perf apanee ruled out not sure of bigtable

Comment 14

ID: 217204 User: Alasmindas Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Tue 11 May 2021 10:12 Selected Answer: - Upvotes: 3

Option A and B for sure,
Option C : Changing Zone has nothing to do in improving performance
Option D and E : Adding BQ and BT is waste of many and does not solve the purpose of the question.

Comment 15

ID: 185845 User: SureshKotla Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Wed 24 Mar 2021 06:44 Selected Answer: - Upvotes: 2

B & D
DF will automatically take care of increasing workers. Developers won't need to access the settings . https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling

Comment 15.1

ID: 185854 User: SureshKotla Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Wed 24 Mar 2021 06:49 Selected Answer: - Upvotes: 2

On second thought, A B is looking right

Comment 15.2

ID: 398142 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 04 Jan 2022 11:36 Selected Answer: - Upvotes: 1

automatically taking care of workers up to 3 (as the maximum worker is 3 set as per questions)

Comment 16

ID: 167627 User: atnafu2020 Badges: - Relative Date: 5 years ago Absolute Date: Sat 27 Feb 2021 17:51 Selected Answer: - Upvotes: 2

AB
is correct

Comment 17

ID: 163190 User: haroldbenites Badges: - Relative Date: 5 years ago Absolute Date: Mon 22 Feb 2021 00:52 Selected Answer: - Upvotes: 5

A , E is correct

53. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 215

Sequence: 255
Discussion ID: 129862
Source URL: https://www.examtopics.com/discussions/google/view/129862-exam-professional-data-engineer-topic-1-question-215/
Posted By: e70ea9e
Posted At: Dec. 30, 2023, 9:40 a.m.

Question

You are administering a BigQuery dataset that uses a customer-managed encryption key (CMEK). You need to share the dataset with a partner organization that does not have access to your CMEK. What should you do?

A. Provide the partner organization a copy of your CMEKs to decrypt the data.
B. Export the tables to parquet files to a Cloud Storage bucket and grant the storageinsights.viewer role on the bucket to the partner organization.
C. Copy the tables you need to share to a dataset without CMEKs. Create an Analytics Hub listing for this dataset.
D. Create an authorized view that contains the CMEK to decrypt the data when accessed.

Community Answer Votes

C: 7 most voted

Comments 3 comments Click to expand

Comment 1

ID: 1152525 User: JyoGCP Badges: - Relative Date: 1 year, 6 months ago Absolute Date: Sat 17 Aug 2024 12:06 Selected Answer: C Upvotes: 2

Analytics Hub

Comment 2

ID: 1113213 User: raaad Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 23:24 Selected Answer: C Upvotes: 3

- Create a copy of the necessary tables into a new dataset that doesn't use CMEK, ensuring the data is accessible without requiring the partner to have access to the encryption key.
- Analytics Hub can then be used to share this data securely and efficiently with the partner organization, maintaining control and governance over the shared data.

Comment 3

ID: 1109540 User: e70ea9e Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sun 30 Jun 2024 08:40 Selected Answer: C Upvotes: 2

Preserves Key Confidentiality:

Avoids sharing your CMEK with the partner, upholding key security and control.

54. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 251

Sequence: 259
Discussion ID: 130202
Source URL: https://www.examtopics.com/discussions/google/view/130202-exam-professional-data-engineer-topic-1-question-251/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 4:16 p.m.

Question

You have important legal hold documents in a Cloud Storage bucket. You need to ensure that these documents are not deleted or modified. What should you do?

A. Set a retention policy. Lock the retention policy.
B. Set a retention policy. Set the default storage class to Archive for long-term digital preservation.
C. Enable the Object Versioning feature. Add a lifecycle rule.
D. Enable the Object Versioning feature. Create a copy in a bucket in a different region.

Community Answer Votes

A: 11 most voted

Comments 5 comments Click to expand

Comment 1

ID: 1114138 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Fri 05 Jul 2024 00:12 Selected Answer: A Upvotes: 6

- Setting a retention policy on a Cloud Storage bucket prevents objects from being deleted for the duration of the retention period.
- Locking the policy makes it immutable, meaning that the retention period cannot be reduced or removed, thus ensuring that the documents cannot be deleted or overwritten until the retention period expires.

Comment 1.1

ID: 1124983 User: AllenChen123 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 17 Jul 2024 13:10 Selected Answer: - Upvotes: 2

Agree. https://cloud.google.com/storage/docs/bucket-lock#overview

Comment 2

ID: 1154479 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Tue 20 Aug 2024 04:07 Selected Answer: A Upvotes: 2

Option A

Comment 3

ID: 1121704 User: Matt_108 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 14:04 Selected Answer: A Upvotes: 2

Option A - set retention policy to prevent deletion, lock it to make it immutable (not subject to edits)

Comment 4

ID: 1112882 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 15:16 Selected Answer: A Upvotes: 1

A. Set a retention policy. Lock the retention policy.

55. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 252

Sequence: 260
Discussion ID: 130203
Source URL: https://www.examtopics.com/discussions/google/view/130203-exam-professional-data-engineer-topic-1-question-252/
Posted By: scaenruy
Posted At: Jan. 3, 2024, 4:20 p.m.

Question

You are designing a data warehouse in BigQuery to analyze sales data for a telecommunication service provider. You need to create a data model for customers, products, and subscriptions. All customers, products, and subscriptions can be updated monthly, but you must maintain a historical record of all data. You plan to use the visualization layer for current and historical reporting. You need to ensure that the data model is simple, easy-to-use, and cost-effective. What should you do?

A. Create a normalized model with tables for each entity. Use snapshots before updates to track historical data.
B. Create a normalized model with tables for each entity. Keep all input files in a Cloud Storage bucket to track historical data.
C. Create a denormalized model with nested and repeated fields. Update the table and use snapshots to track historical data.
D. Create a denormalized, append-only model with nested and repeated fields. Use the ingestion timestamp to track historical data.

Community Answer Votes

D: 6 most voted

Comments 6 comments Click to expand

Comment 1

ID: 1114139 User: raaad Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Fri 05 Jul 2024 00:18 Selected Answer: - Upvotes: 11

- A denormalized, append-only model simplifies query complexity by eliminating the need for joins.
- Adding data with an ingestion timestamp allows for easy retrieval of both current and historical states.
- Instead of updating records, new records are appended, which maintains historical information without the need to create separate snapshots.

Comment 2

ID: 1154484 User: JyoGCP Badges: Most Recent Relative Date: 1 year, 6 months ago Absolute Date: Tue 20 Aug 2024 04:14 Selected Answer: D Upvotes: 1

Option D

Comment 3

ID: 1124792 User: JimmyBK Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Wed 17 Jul 2024 08:33 Selected Answer: D Upvotes: 1

Straight forward, good for costs

Comment 4

ID: 1117347 User: Sofiia98 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 09 Jul 2024 09:37 Selected Answer: D Upvotes: 1

D looks logical

Comment 5

ID: 1117059 User: GCP001 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Mon 08 Jul 2024 22:14 Selected Answer: D Upvotes: 1

Easy, cost effective and no cpmpexity

Comment 6

ID: 1112884 User: scaenruy Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Wed 03 Jul 2024 15:20 Selected Answer: D Upvotes: 2

D. Create a denormalized, append-only model with nested and repeated fields. Use the ingestion timestamp to track historical data.

56. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 139

Sequence: 264
Discussion ID: 79676
Source URL: https://www.examtopics.com/discussions/google/view/79676-exam-professional-data-engineer-topic-1-question-139/
Posted By: ducc
Posted At: Sept. 3, 2022, 6:40 a.m.

Question

You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones. What should you do?

A. Create an API using App Engine to receive and send messages to the applications
B. Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them
C. Create a table on Cloud SQL, and insert and delete rows with the job information
D. Create a table on Cloud Spanner, and insert and delete rows with the job information

Community Answer Votes

B: 19 most voted

Comments 10 comments Click to expand

Comment 1

ID: 738070 User: jkhong Badges: Highly Voted Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 15:52 Selected Answer: B Upvotes: 9

Job generators (they would be the publishers).
Job runners = subscribers

Question mentions that it must scale (of which push subscription has automatic scaling) and can accommodate additional new applications (this can be solved by having multiple subscriptions, with each relating to a unique application) to a central topic

Comment 1.1

ID: 762727 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 17:02 Selected Answer: - Upvotes: 4

Yes it is
B. Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them

Comment 2

ID: 1146312 User: srivastavas08 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 12:47 Selected Answer: - Upvotes: 1

A. App Engine API: While scalable, it introduces a central point of failure and might not be as performant as Pub/Sub for high-volume data.
C. Cloud SQL: Not designed for real-time data sharing and continuous updates, leading to potential bottlenecks and performance issues.
D. Cloud Spanner: Offers strong consistency and global distribution, but its pricing model might be less suitable for high-volume, cost-sensitive workloads compared to Pub/Sub.

Comment 3

ID: 1016203 User: juliorevk Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 25 Mar 2024 01:14 Selected Answer: B Upvotes: 1

B to decouple jobs being generated and run. Pub/Sub also scales seamlessly

Comment 4

ID: 1015436 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 05:39 Selected Answer: B Upvotes: 2

B. Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them.

Scalability: Cloud Pub/Sub is a highly scalable messaging service that can handle a significant volume of messages and subscribers. It can easily accommodate increases in usage as your data pipeline scales.

Decoupling: Using Pub/Sub decouples the job generators from the job runners, which is a good architectural choice for flexibility and scalability. Job generators publish messages to a topic, and job runners subscribe to that topic to execute jobs when they are available.

Adding New Applications: With Cloud Pub/Sub, adding new applications (new publishers or subscribers) is straightforward. You can simply create new publishers to send jobs or new subscribers to consume jobs without impacting existing components.

Comment 5

ID: 812349 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 17 Aug 2023 20:43 Selected Answer: - Upvotes: 2

key words here: job generators (pushlish message on pub/sub) and job runners(subscribe message for further analysis). You may add as much as pushlishing job and subscribing job to same topic. So Answer B.
Using API , app engine is also good approach but its more complex than pub/sub.

Comment 6

ID: 716836 User: Atnafu Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Fri 12 May 2023 17:32 Selected Answer: - Upvotes: 1

A
Since it's application i will go with

Comment 7

ID: 661074 User: arpitagrawal Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 13:14 Selected Answer: B Upvotes: 3

use pubsub

Comment 8

ID: 660015 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Sun 05 Mar 2023 12:56 Selected Answer: B Upvotes: 2

I would tend to think B , one of the use of pub/sub is decoupling app

Comment 9

ID: 658071 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 07:40 Selected Answer: B Upvotes: 2

I choose B

57. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 76

Sequence: 269
Discussion ID: 17114
Source URL: https://www.examtopics.com/discussions/google/view/17114-exam-professional-data-engineer-topic-1-question-76/
Posted By: -
Posted At: March 21, 2020, 6:01 p.m.

Question

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of data. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?

A. Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user.
B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.
C. In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin activity logs will be used to provide the auditability.
D. In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket.

Community Answer Votes

B: 34 most voted
D: 22
A: 1

Comments 28 comments Click to expand

Comment 1

ID: 315800 User: Mitra123 Badges: Highly Voted Relative Date: 4 years, 5 months ago Absolute Date: Mon 20 Sep 2021 18:27 Selected Answer: - Upvotes: 50

Keywords here are
1. "Archived": Immutable and hence, BQ and Cloud SQL are ruled out
2. "Auditable": Means track any changes done.
Only D can provide the audibility piece!
I will go with D

Comment 1.1

ID: 888751 User: Jarek7 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 03 Nov 2023 19:41 Selected Answer: - Upvotes: 11

I have no idea why so many upvotes on this answer:
1) archived doesn't mean immutable and cloud storage is not immutable too.
2) auditable means viewable for authorized personel - and in this case not changes need to be monitored but any access.
3) with option D it is easy to go around logging - you can add another access to the bucket read the data remove the access and no one will ever know that you accessed the data.
4) option D is much more difficult - you need to application on AppEngine to log the data and provide access for users.
5) option D doesn;t explain where and how it stores the audit data - it could be accessed and modified from some side app/service.

Comment 2

ID: 68738 User: [Removed] Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Mon 28 Sep 2020 03:57 Selected Answer: - Upvotes: 23

Answer: B
Description: Bigquery is used to analyse access logs, data access logs capture the details of the user that accessed the data

Comment 2.1

ID: 399184 User: awssp12345 Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Wed 05 Jan 2022 15:54 Selected Answer: - Upvotes: 12

The question has no mention of ANALYZE.. BQ is not correct. I would go with D.

Comment 2.2

ID: 524528 User: sraakesh95 Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 16 Jul 2022 00:42 Selected Answer: - Upvotes: 1

There is no option for archiving with BQ

Comment 2.2.1

ID: 586811 User: tavva_prudhvi Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 16 Oct 2022 16:23 Selected Answer: - Upvotes: 5

You dont need to archive the expiring logs, you have to archive the un-archived data here! See the question, it says "Assuming that all expiring logs will be archived correctly", which means they are already stored somewhere like in GCS!!! Hence, better to store the remaining un-archived data in BQ.

Comment 2.2.1.1

ID: 631923 User: vartiklis Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 15 Jan 2023 23:15 Selected Answer: - Upvotes: 3

The question is about where to store the _data_ for which the logs will be generated.

The bit you quoted is about the _logs_ that will be generated when accesssing data. The “archived correctly” implies that proper retention policies will be set up if you choose GCS.

Comment 3

ID: 1142965 User: philli1011 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Wed 07 Aug 2024 05:15 Selected Answer: - Upvotes: 1

In recent GCP, we have cloud audit.

Comment 4

ID: 1086201 User: Nandababy Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sun 02 Jun 2024 13:21 Selected Answer: - Upvotes: 1

Option B is valid only when analytics to be performed over logs, which is not mentioned anywhere

Comment 5

ID: 1084836 User: rocky48 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 31 May 2024 23:38 Selected Answer: B Upvotes: 3

For maintaining an auditable record of access to certain types of data, especially when government regulations are in place, the most suitable option would be:

B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.

Storing the data in a BigQuery dataset with restricted access ensures control over who can view the data, and utilizing Data Access logs provides a comprehensive audit trail for compliance purposes. This option aligns well with the need for maintaining an auditable record as mandated by government regulations.

Comment 6

ID: 888760 User: Jarek7 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 03 Nov 2023 19:53 Selected Answer: B Upvotes: 9

If you are going for option D, why do you eliminate option B? The only REAL difference is that for opption D you need to develop an app for storing log data and providing bucket link and in option B you have it all done BETTER by GCP. You might also pay a bit more for BQ storage, but the question never mentions about cost optimization.
BTW in the D option the bucket is accessible only by AppEngine service, so what will the user do with the provided link? he has no access anyway... And if he even has the access to this link what stops him form using the same link many times? How the AppEngine get and store the information what specific data he accessed and how?

Comment 6.1

ID: 899427 User: Kiroo Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 16 Nov 2023 20:27 Selected Answer: - Upvotes: 3

That was my thought, either B or D could work but D it’s a little bit odd create an app to do something that could be achieved natively gcp

Comment 6.2

ID: 915534 User: phidelics Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 05 Dec 2023 18:34 Selected Answer: - Upvotes: 2

I was about to say the same thing. Why go through that stress?

Comment 7

ID: 875677 User: Rodrigo4N Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 20 Oct 2023 16:02 Selected Answer: D Upvotes: 2

D amongus

Comment 8

ID: 848578 User: juliobs Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 23 Sep 2023 19:06 Selected Answer: B Upvotes: 3

They want to know where you can store **data** in a way that every access is logged in an auditable way.

Both BQ and GCS have audit logs, except that in alternative D you're circumventing it by creating your own logs. I doubt Google would recommend that.

By types of data you can understand "financial type", "marketing type", etc.

Comment 9

ID: 826470 User: midgoo Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 02 Sep 2023 03:27 Selected Answer: D Upvotes: 4

I was thinking it should be A. However, 'data' in this question is too vague. It does not say anywhere that the data could fit in BigQuery tables. It could be unstructure data such as videos or images
Option D seems to involve more setup but it is the only viable option for this scenario. Note that GCS do have Cloud Audit logs. That should be the best option. Maybe this question was asked when Cloud Audit log is not yet available for GCS.

Comment 10

ID: 791581 User: aleixfc96 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 29 Jul 2023 11:24 Selected Answer: - Upvotes: 1

It is so clear that is B lol

Comment 11

ID: 791361 User: NamitSehgal Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 29 Jul 2023 03:31 Selected Answer: - Upvotes: 1

B bigquery for a record set store

Comment 12

ID: 789456 User: PolyMoe Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 27 Jul 2023 09:12 Selected Answer: B Upvotes: 2

B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.

BigQuery provides built-in logging of all data access, including the user's identity, the specific query run and the time of the query. This log can be used to provide an auditable record of access to the data. Additionally, BigQuery allows you to control access to the dataset using Identity and Access Management (IAM) roles, so you can ensure that only authorized personnel can view the dataset.

Comment 13

ID: 786094 User: samdhimal Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 02:52 Selected Answer: - Upvotes: 3

B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.

BigQuery provides built-in logging of all data access, including the user's identity, the specific query run and the time of the query. This log can be used to provide an auditable record of access to the data. Additionally, BigQuery allows you to control access to the dataset using Identity and Access Management (IAM) roles, so you can ensure that only authorized personnel can view the dataset.

Comment 13.1

ID: 786095 User: samdhimal Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 02:52 Selected Answer: - Upvotes: 2

A. Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user. is a good option for data security but it does not provide an auditable record of access to the data.

C. In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin activity logs will be used to provide the auditability. is also a good option for data security but it does not provide an auditable record of access to the data.

D. In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket. is also a good option but it requires additional setup and maintenance of the AppEngine service, and it may not provide an auditable record of access to the data.

Comment 13.2

ID: 880113 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 25 Oct 2023 10:08 Selected Answer: - Upvotes: 2

gpt: You are correct that option A does not provide an auditable record of access to the data, as it only addresses data security through encryption. Option C provides auditability through Cloud SQL Admin activity logs, but it may not be the best option as it requires additional setup and management.

Option D is a feasible solution, but as you mentioned, it requires additional setup and maintenance of the AppEngine service. It also may not provide a comprehensive audit log of all data access.

Option B, storing the data in a BigQuery dataset that is viewable only by authorized personnel and using the Data Access log to provide auditability, is the most appropriate option as it provides built-in logging of all data access and allows you to control access to the dataset using IAM roles. Therefore, it provides both data security and auditable access to the data. /// ok let it be B

Comment 13.2.1

ID: 882564 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 27 Oct 2023 13:08 Selected Answer: - Upvotes: 1

OR MAYBE D....

Comment 13.2.1.1

ID: 894198 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 10 Nov 2023 20:05 Selected Answer: - Upvotes: 1

!!! confused. Give 69% confidence to B, as user Jarek7 explained

Comment 14

ID: 782112 User: GCPpro Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 20 Jul 2023 10:32 Selected Answer: - Upvotes: 1

D is the correct answer

Comment 15

ID: 773001 User: RoshanAshraf Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Wed 12 Jul 2023 00:46 Selected Answer: D Upvotes: 1

Keys
TYPES of data --> Cloud Storage not BQ
Archival --> Cloud Storage
Access --> No decryption keys to all users

Comment 16

ID: 759377 User: PrashantGupta1616 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Wed 28 Jun 2023 05:12 Selected Answer: D Upvotes: 1

I will go with D

Comment 17

ID: 745647 User: DGames Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 15 Jun 2023 02:44 Selected Answer: D Upvotes: 1

Keyword, Archiver , certain type of data, auditable, GCS is better option . Durability 11 time 9 to store log immutable for long time.

58. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 32

Sequence: 277
Discussion ID: 17052
Source URL: https://www.examtopics.com/discussions/google/view/17052-exam-professional-data-engineer-topic-1-question-32/
Posted By: -
Posted At: March 20, 2020, 2:15 p.m.

Question

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

A. Redefine the schema by evenly distributing reads and writes across the row space of the table.
B. The performance issue should be resolved over time as the site of the BigDate cluster is increased.
C. Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.
D. Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Community Answer Votes

A: 17 most voted

Comments 18 comments Click to expand

Comment 1

ID: 179287 User: IsaB Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Mon 14 Sep 2020 14:32 Selected Answer: - Upvotes: 54

I hate it when I read the question, than I think oh easy and I KNOW the answer, then I look at the choices and the answer I thought of is just not there at all... and I realize I absolutely have no idea :'D

Comment 2

ID: 478168 User: MaxNRG Badges: Highly Voted Relative Date: 4 years, 3 months ago Absolute Date: Sun 14 Nov 2021 15:33 Selected Answer: - Upvotes: 11

A as the schema needs to be redesigned to distribute the reads and writes evenly across each table.
Refer GCP documentation - Bigtable Performance:
https://cloud.google.com/bigtable/docs/performance
The table's schema is not designed correctly. To get good performance from Cloud Bigtable, it's essential to design a schema that makes it possible to distribute reads and writes evenly across each table. See Designing Your Schema for more information.
https://cloud.google.com/bigtable/docs/schema-design
Option B is wrong as increasing the size of cluster would increase the cost.
Option C is wrong as single row key for frequently updated identifiers reduces performance
Option D is wrong as sequential IDs would degrade the performance.
A safer approach is to use a reversed version of the user's numeric ID, which spreads traffic more evenly across all of the nodes for your Cloud Bigtable table.

Comment 3

ID: 1255191 User: 09878d5 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Thu 25 Jul 2024 21:02 Selected Answer: A Upvotes: 1

B is a Lie
C and D are actually not recommended
A is correct as it will help in even distribution of load and avoid hotspots

Comment 4

ID: 1087723 User: JOKKUNO Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 04 Dec 2023 15:46 Selected Answer: - Upvotes: 3

Improving performance in Google Cloud Bigtable involves optimizing the schema design to distribute the load efficiently across the clusters. Given the scenario, the best option would be:

A. Redefine the schema by evenly distributing reads and writes across the row space of the table.

Explanation:

Distributing reads and writes evenly across the row space helps prevent hotspots and ensures that the load is spread evenly, avoiding performance bottlenecks.
Google Cloud Bigtable's performance is influenced by how well the data is distributed across the tablet servers, and evenly distributing the load can lead to better performance.
This approach aligns with best practices for designing scalable and performant Bigtable schemas.

Comment 5

ID: 1076453 User: axantroff Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 17:32 Selected Answer: A Upvotes: 1

The comment from hilel_eth totally makes sense to me. I would go with A

Comment 6

ID: 980413 User: hkris909 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 14 Aug 2023 05:23 Selected Answer: - Upvotes: 7

Guys, how relevant are these questions, as of Aug 14, 2023 Could we clear the PDE exam with these set of questions?

Comment 6.1

ID: 1093371 User: roty Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 11 Dec 2023 11:25 Selected Answer: - Upvotes: 2

HEY DID U CLEAR THE EXAM

Comment 7

ID: 966263 User: FP77 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 29 Jul 2023 11:13 Selected Answer: A Upvotes: 1

A is the only one that makes sense and is correct

Comment 8

ID: 961315 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 10:08 Selected Answer: - Upvotes: 1

I understand why it could be A. But why not B also? Is it because of the typo saying BigDate instead of BigTable?

Comment 9

ID: 866660 User: Adswerve Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Tue 11 Apr 2023 00:29 Selected Answer: A Upvotes: 2

A to avoid hot-spotting https://cloud.google.com/bigtable/docs/schema-design

Comment 10

ID: 755903 User: Brillianttyagi Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Sun 25 Dec 2022 19:23 Selected Answer: A Upvotes: 6

A -
Make sure you're reading and writing many different rows in your table. Bigtable performs best when reads and writes are evenly distributed throughout your table, which helps Bigtable distribute the workload across all of the nodes in your cluster. If reads and writes cannot be spread across all of your Bigtable nodes, performance will suffer.

https://cloud.google.com/bigtable/docs/performance#troubleshooting

Comment 11

ID: 741262 User: hilel_eth Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sat 10 Dec 2022 23:00 Selected Answer: A Upvotes: 3

A good way to improve read and write performance in a database system like Google Cloud Bigtable is to redefine the schema of the table so that reads and writes are evenly distributed across the row space of the table. This can help reduce bottlenecks in processing capacity and improve efficiency in table management. In addition, by evenly distributing read and write operations, it can prevent the accumulation of operations in one part of the table, which can improve the overall performance of the system.

Comment 12

ID: 560182 User: Arkon88 Badges: - Relative Date: 4 years ago Absolute Date: Thu 03 Mar 2022 16:43 Selected Answer: A Upvotes: 3

A is correct
https://cloud.google.com/bigtable/docs/performance#troubleshooting

If you find that you're reading and writing only a small number of rows, you might need to redesign your schema so that reads and writes are more evenly distributed.

Comment 13

ID: 530925 User: samdhimal Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Mon 24 Jan 2022 01:21 Selected Answer: - Upvotes: 2

correct answer -> Redefine the schema by evenly distributing reads and writes across the row space of the table.

Make sure you're reading and writing many different rows in your table. Bigtable performs best when reads and writes are evenly distributed throughout your table, which helps Bigtable distribute the workload across all of the nodes in your cluster. If reads and writes cannot be spread across all of your Bigtable nodes, performance will suffer.
If you find that you're reading and writing only a small number of rows, you might need to redesign your schema so that reads and writes are more evenly distributed.

Reference: https://cloud.google.com/bigtable/docs/performance#troubleshooting

Comment 14

ID: 461142 User: anji007 Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Tue 12 Oct 2021 17:52 Selected Answer: - Upvotes: 1

Ans: A

Comment 15

ID: 401888 User: sumanshu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Thu 08 Jul 2021 14:03 Selected Answer: - Upvotes: 3

Vote for A

Comment 16

ID: 398223 User: timolo Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Sun 04 Jul 2021 12:41 Selected Answer: - Upvotes: 4

Correct is A: https://cloud.google.com/bigtable/docs/performance#troubleshooting

Make sure you're reading and writing many different rows in your table. Bigtable performs best when reads and writes are evenly distributed throughout your table, which helps Bigtable distribute the workload across all of the nodes in your cluster. If reads and writes cannot be spread across all of your Bigtable nodes, performance will suffer.

Comment 17

ID: 285611 User: naga Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Sun 07 Feb 2021 17:00 Selected Answer: - Upvotes: 4

Correct A

59. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 201

Sequence: 280
Discussion ID: 79651
Source URL: https://www.examtopics.com/discussions/google/view/79651-exam-professional-data-engineer-topic-1-question-201/
Posted By: ducc
Posted At: Sept. 3, 2022, 4:12 a.m.

Question

You need to migrate a Redis database from an on-premises data center to a Memorystore for Redis instance. You want to follow Google-recommended practices and perform the migration for minimal cost, time and effort. What should you do?

A. Make an RDB backup of the Redis database, use the gsutil utility to copy the RDB file into a Cloud Storage bucket, and then import the RDB file into the Memorystore for Redis instance.
B. Make a secondary instance of the Redis database on a Compute Engine instance and then perform a live cutover.
C. Create a Dataflow job to read the Redis database from the on-premises data center and write the data to a Memorystore for Redis instance.
D. Write a shell script to migrate the Redis data and create a new Memorystore for Redis instance.

Community Answer Votes

A: 19 most voted

Comments 6 comments Click to expand

Comment 1

ID: 658066 User: AWSandeep Badges: Highly Voted Relative Date: 2 years, 6 months ago Absolute Date: Sun 03 Sep 2023 06:32 Selected Answer: A Upvotes: 14

A. Make an RDB backup of the Redis database, use the gsutil utility to copy the RDB file into a Cloud Storage bucket, and then import the RDB file into the Memorystore for Redis instance.

The import and export feature uses the native RDB snapshot feature of Redis to import data into or export data out of a Memorystore for Redis instance. The use of the native RDB format prevents lock-in and makes it very easy to move data within Google Cloud or outside of Google Cloud. Import and export uses Cloud Storage buckets to store RDB files.

Reference:
https://cloud.google.com/memorystore/docs/redis/import-export-overview

Comment 2

ID: 960808 User: vamgcp Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Tue 23 Jul 2024 22:26 Selected Answer: A Upvotes: 2

Option A

Comment 3

ID: 763427 User: AzureDP900 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 02 Jan 2024 02:09 Selected Answer: - Upvotes: 1

A. Make an RDB backup of the Redis database, use the gsutil utility to copy the RDB file into a Cloud Storage bucket, and then import the RDB file into the Memorystore for Redis instance.

Comment 4

ID: 725723 User: gudiking Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 24 Nov 2023 11:04 Selected Answer: A Upvotes: 1

A
https://cloud.google.com/memorystore/docs/redis/import-data

Comment 5

ID: 725634 User: Atnafu Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 24 Nov 2023 08:39 Selected Answer: - Upvotes: 2

A
Import and export uses Cloud Storage buckets to store RDB files.
https://cloud.google.com/memorystore/docs/redis/about-importing-exporting#:~:text=Import%20and%20export%20uses%20Cloud%20Storage%20buckets%20to%20store%20RDB%20files.

Comment 6

ID: 657978 User: ducc Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sun 03 Sep 2023 04:12 Selected Answer: A Upvotes: 2

A
https://cloud.google.com/memorystore/docs/redis/general-best-practices

60. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 67

Sequence: 283
Discussion ID: 17110
Source URL: https://www.examtopics.com/discussions/google/view/17110-exam-professional-data-engineer-topic-1-question-67/
Posted By: -
Posted At: March 21, 2020, 4:25 p.m.

Question

You are developing an application that uses a recommendation engine on Google Cloud. Your solution should display new videos to customers based on past views. Your solution needs to generate labels for the entities in videos that the customer has viewed. Your design must be able to provide very fast filtering suggestions based on data from other customer preferences on several TB of data. What should you do?

A. Build and train a complex classification model with Spark MLlib to generate labels and filter the results. Deploy the models using Cloud Dataproc. Call the model from your application.
B. Build and train a classification model with Spark MLlib to generate labels. Build and train a second classification model with Spark MLlib to filter results to match customer preferences. Deploy the models using Cloud Dataproc. Call the models from your application.
C. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud Bigtable, and filter the predicted labels to match the user's viewing history to generate preferences.
D. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud SQL, and join and filter the predicted labels to match the user's viewing history to generate preferences.

Community Answer Votes

C: 5 most voted

Comments 21 comments Click to expand

Comment 1

ID: 307799 User: daghayeghi Badges: Highly Voted Relative Date: 4 years ago Absolute Date: Fri 11 Mar 2022 12:23 Selected Answer: - Upvotes: 7

answer C:
If we presume that use label of video as a rowkey, Bigtable will be the best option. because it can store several TB, but Cloud SQL is limited to 30TB.

Comment 2

ID: 219034 User: Alasmindas Badges: Highly Voted Relative Date: 4 years, 3 months ago Absolute Date: Sun 14 Nov 2021 10:42 Selected Answer: - Upvotes: 7

Option C is the correct answer.
1. Rather than building a new model - it is better to use Google provide APIs, here - Google Video Intelligence.
So option A and B rules out
2) Between SQL and Bigtable - Bigtable is the better option as Bigtable support row-key filtering. Joining the filters is not required.

Comment 3

ID: 959634 User: Mathew106 Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Mon 22 Jul 2024 16:52 Selected Answer: C Upvotes: 2

I don't even know if MLLib has out-of-the-box Computer Vision models. Developing this in Dataproc would be a nightmare.

Using the computer vision API on the other hand makes perfect sense.

The fact that the filtering must happen very fast and that this is a customer facing application points to BigTable so that there is very little latency and high availability. BigTable is eventually consistent but that doesn't really matter for this application.

CloudSQL will ensure strong consistency which we don't really need but it is slower and supports max 64 TB. The description mentions multiple TBs. Not really sure what several means here, but Cloud SQL doesn't have a high cap.

Comment 4

ID: 943407 User: euro202 Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 05 Jul 2024 09:15 Selected Answer: C Upvotes: 1

We need a model that extracts labels from videos, so Vision API could be used.
Then we need a DB very fast and that can handle several TB of data, so BigTable is the best choice.
Answer is C.

Comment 5

ID: 783858 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Mon 22 Jan 2024 02:35 Selected Answer: - Upvotes: 2

Option C is the correct choice because it utilizes the Cloud Video Intelligence API to generate labels for the entities in the videos, which would save time and resources compared to building and training a model from scratch. Additionally, by storing the data in Cloud Bigtable, it allows for fast and efficient filtering of the predicted labels based on the user's viewing history and preferences. This is a more efficient and cost-effective approach than storing the data in Cloud SQL and performing joins and filters.

Comment 6

ID: 766095 User: AzureDP900 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 22:08 Selected Answer: - Upvotes: 1

Answer is C
Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud Bigtable, and filter the predicted labels to match the user's viewing history to generate preferences.

1. Rather than building a new model - it is better to use Google provide APIs, here - Google Video Intelligence. So option A and B rules out
2. Between SQL and Bigtable - Bigtable is the better option as Bigtable support row-key filtering. Joining the filters is not required.

Reference:
https://cloud.google.com/video-intelligence/docs/feature-label-detection

Comment 7

ID: 506283 User: MaxNRG Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 21 Dec 2022 17:52 Selected Answer: C Upvotes: 2

C.
The cloud video intelillence api does the label generation without the need of building any model, A and B are excluded. Now, the bbdd most suitable for this is bigtable and not SQL (this big joins would be anything but fast).
https://cloud.google.com/video-intelligence/docs/feature-label-detection

Comment 8

ID: 393299 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 29 Jun 2022 00:52 Selected Answer: - Upvotes: 4

Vote for C

Comment 9

ID: 318511 User: timolo Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 23 Mar 2022 22:30 Selected Answer: - Upvotes: 2

Answer: C
Reference https://cloud.google.com/video-intelligence/docs/feature-label-detection

Comment 10

ID: 246152 User: NamitSehgal Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Fri 17 Dec 2021 03:58 Selected Answer: - Upvotes: 3

Answer: C

Comment 11

ID: 185359 User: SureshKotla Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Thu 23 Sep 2021 15:59 Selected Answer: - Upvotes: 2

Answer is D : BigTable doesnt support JOIN and not built for transactions - https://cloud.google.com/bigtable/docs/overview

Comment 11.1

ID: 207156 User: Surjit24 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Wed 27 Oct 2021 16:09 Selected Answer: - Upvotes: 4

There are no joins but filtering based on condition.

Comment 11.1.1

ID: 289947 User: karthik89 Badges: - Relative Date: 4 years ago Absolute Date: Mon 14 Feb 2022 03:50 Selected Answer: - Upvotes: 2

but the requirement involves join as well, it is stated in the problem.

Comment 11.1.1.1

ID: 402212 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 22:20 Selected Answer: - Upvotes: 1

Where? Though it's mention - " very fast filtering suggestions" - which means something like dictionary in python OR Key: Value (which is Bigtable)

Comment 11.1.1.1.1

ID: 524378 User: sraakesh95 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 15 Jan 2023 20:07 Selected Answer: - Upvotes: 1

I think "based on other customer preferences" from the questions requires a join before a filter is applied for collaborative filtering

Comment 11.1.1.1.1.1

ID: 579739 User: Deepakd Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 02 Apr 2023 08:30 Selected Answer: - Upvotes: 1

Recommendation based on other customer”s views cannot be achieved through simple joins. A class pf machine learning algorithms called collaborative filtering is required for that. You need big table to run these algorithms.

Comment 12

ID: 161820 User: haroldbenites Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Fri 20 Aug 2021 00:32 Selected Answer: - Upvotes: 2

Correct C

Comment 13

ID: 127966 User: dg63 Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Tue 06 Jul 2021 18:14 Selected Answer: - Upvotes: 2

I doubt if C can be an answer. Will Bigtable allow filtering on labels?

Comment 13.1

ID: 134000 User: tprashanth Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Tue 13 Jul 2021 16:21 Selected Answer: - Upvotes: 3

Yes, if its part of the rowkey

Comment 14

ID: 126835 User: Rajuuu Badges: - Relative Date: 4 years, 8 months ago Absolute Date: Mon 05 Jul 2021 14:27 Selected Answer: - Upvotes: 4

Answer is C.

Comment 15

ID: 73186 User: Ganshank Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Sun 11 Apr 2021 08:41 Selected Answer: - Upvotes: 7

C.
The recommendation requires filtering based on several TB of data, therefore BigTable is the recommended option vs Cloud SQL which is limited to 10TB.

61. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 73

Sequence: 284
Discussion ID: 17112
Source URL: https://www.examtopics.com/discussions/google/view/17112-exam-professional-data-engineer-topic-1-question-73/
Posted By: -
Posted At: March 21, 2020, 4:52 p.m.

Question

You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally.
You also want to optimize data for range queries on non-key columns. What should you do?

A. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
B. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
C. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
D. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.

Community Answer Votes

C: 4 most voted

Comments 13 comments Click to expand

Comment 1

ID: 870169 User: nhanhoangle Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 14 Apr 2024 14:28 Selected Answer: C Upvotes: 1

Correct: C

Comment 2

ID: 789093 User: PolyMoe Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 21:30 Selected Answer: C Upvotes: 2

Cloud Spanner is a fully-managed, horizontally scalable relational database service that supports transactions and allows you to optimize data for range queries on non-key columns. By using Cloud Spanner for storage, you can ensure that your database can scale horizontally to meet the needs of your application.
To optimize data for range queries on non-key columns, you can add secondary indexes, this will allow you to perform range scans on non-key columns, which can improve the performance of queries that filter on non-key columns.

Comment 3

ID: 785839 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 22:09 Selected Answer: - Upvotes: 3

C. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.

Cloud Spanner is a fully-managed, horizontally scalable relational database service that supports transactions and allows you to optimize data for range queries on non-key columns. By using Cloud Spanner for storage, you can ensure that your database can scale horizontally to meet the needs of your application.
To optimize data for range queries on non-key columns, you can add secondary indexes, this will allow you to perform range scans on non-key columns, which can improve the performance of queries that filter on non-key columns.

Comment 3.1

ID: 785840 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 22:09 Selected Answer: - Upvotes: 2

- Option A, Using Cloud SQL for storage and adding secondary indexes to support query patterns, may not be the best option as Cloud SQL is a relational database service that does not support horizontal scaling and may not be able to handle the large amount of data and the number of queries required by your application.

Comment 3.1.1

ID: 960225 User: Mathew106 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Tue 23 Jul 2024 10:16 Selected Answer: - Upvotes: 2

Cloud SQL does support replicas to increase availability. Why is that not considered horizontal scaling?

Comment 3.1.2

ID: 785841 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 22:09 Selected Answer: - Upvotes: 2

- Option B, Using Cloud SQL for storage and using Cloud Dataflow to transform data to support query patterns, may not be the best option as Cloud SQL is a relational database service that does not support horizontal scaling and may not be able to handle the large amount of data and the number of queries required by your application. Additionally, Cloud Dataflow is a data processing service and not a storage service, so it may not be the best fit for this use case.

- Option D, Using Cloud Spanner for storage and using Cloud Dataflow to transform data to support query patterns, is not necessary as Cloud Spanner provides the ability to optimize data for range queries on non-key columns by adding secondary indexes. Cloud Spanner also supports transactional consistency, which is a feature that allows you to perform multiple operations that must be performed together in a single transaction. Additionally, Cloud Dataflow is a data processing service and not a storage service, so it may not be the best fit for this use case.

Comment 4

ID: 668443 User: sedado77 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 14 Sep 2023 00:15 Selected Answer: C Upvotes: 1

As sumanshu said

Comment 5

ID: 464881 User: tsoetan001 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 20 Oct 2022 03:47 Selected Answer: - Upvotes: 1

Answer: C

Comment 6

ID: 393747 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 29 Jun 2022 13:59 Selected Answer: - Upvotes: 4

Vote for C

Comment 6.1

ID: 402248 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Fri 08 Jul 2022 23:33 Selected Answer: - Upvotes: 8

A is not correct because Cloud SQL does not natively scale horizontally.
B is not correct because Cloud SQL does not natively scale horizontally.
C is correct because Cloud Spanner scales horizontally, and you can create secondary indexes for the range queries that are required.
D is not correct because Dataflow is a data pipelining tool to move and transform data, but the use case is centered around querying.

Comment 7

ID: 318533 User: timolo Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Wed 23 Mar 2022 23:11 Selected Answer: - Upvotes: 2

Answer: C
https://cloud.google.com/spanner/docs/secondary-indexes

Comment 8

ID: 249578 User: Nileshk611 Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Tue 21 Dec 2021 19:42 Selected Answer: - Upvotes: 3

Correct: C

Comment 9

ID: 219765 User: arghya13 Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Mon 15 Nov 2021 16:26 Selected Answer: - Upvotes: 2

Correct answers is C

62. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 197

Sequence: 287
Discussion ID: 79647
Source URL: https://www.examtopics.com/discussions/google/view/79647-exam-professional-data-engineer-topic-1-question-197/
Posted By: ducc
Posted At: Sept. 3, 2022, 3:57 a.m.

Question

You are designing a system that requires an ACID-compliant database. You must ensure that the system requires minimal human intervention in case of a failure.
What should you do?

A. Configure a Cloud SQL for MySQL instance with point-in-time recovery enabled.
B. Configure a Cloud SQL for PostgreSQL instance with high availability enabled.
C. Configure a Bigtable instance with more than one cluster.
D. Configure a BigQuery table with a multi-region configuration.

Community Answer Votes

B: 54 most voted
D: 2

Comments 15 comments Click to expand

Comment 1

ID: 713071 User: NicolasN Badges: Highly Voted Relative Date: 3 years, 4 months ago Absolute Date: Mon 07 Nov 2022 14:49 Selected Answer: B Upvotes: 43

We exclude [C[ as non ACID and [D] for being invalid (location is configured on Dataset level, not Table).
Then, let's focus on "minimal human intervention in case of a failure" requirement in order to eliminate one answer among [A] and [B].
Basically, we have to compare point-in-time recovery with high availability. It doesn't matter whether it's about MySQL or PostgreSQL since both databases support those features.
- Point-in-time recovery logs are created automatically, but restoring an instance in case of failure requires manual steps (described here: https://cloud.google.com/sql/docs/mysql/backup-recovery/pitr#perform-pitr)
- High availability, in case of failure requires no human intervention: "If an HA-configured instance becomes unresponsive, Cloud SQL automatically switches to serving data from the standby instance." (from https://cloud.google.com/sql/docs/postgres/high-availability#failover-overview)
So answer [B] wins.

Comment 1.1

ID: 714993 User: Mcloudgirl Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 10 Nov 2022 08:09 Selected Answer: - Upvotes: 2

Your explanation is perfect, thanks

Comment 1.2

ID: 1052282 User: squishy_fishy Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 23 Oct 2023 23:52 Selected Answer: - Upvotes: 2

Will you change your answer if the answer D says dataset instead of table?

Comment 2

ID: 1250594 User: edre Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Thu 18 Jul 2024 19:39 Selected Answer: B Upvotes: 1

Its B because of HA
cant be A because point in time recovery still requires human intervention

Comment 3

ID: 1102912 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 21 Dec 2023 20:44 Selected Answer: B Upvotes: 2

The best option to meet the ACID compliance and minimal human intervention requirements is to configure a Cloud SQL for PostgreSQL instance with high availability enabled.

Key reasons:

Cloud SQL for PostgreSQL provides full ACID compliance, unlike Bigtable which provides only atomicity and consistency guarantees.
Enabling high availability removes the need for manual failover as Cloud SQL will automatically failover to a standby replica if the leader instance goes down.
Point-in-time recovery in MySQL requires manual intervention to restore data if needed.
BigQuery does not provide transactional guarantees required for an ACID database.
Therefore, a Cloud SQL for PostgreSQL instance with high availability meets the ACID and minimal intervention requirements best. The automatic failover will ensure availability and uptime without administrative effort.

Comment 4

ID: 976427 User: [Removed] Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Wed 09 Aug 2023 11:36 Selected Answer: D Upvotes: 2

I vote for D - BigQuery with multi region configuration.
According to https://cloud.google.com/bigquery/docs/introduction , BigQuery support ACID and automatically replicated for high availability.
"""BigQuery stores data using a columnar storage format that is optimized for analytical queries. BigQuery presents data in tables, rows, and columns and provides full support for database transaction semantics (ACID). BigQuery storage is automatically replicated across multiple locations to provide high availability."""

Comment 5

ID: 961460 User: vamgcp Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 24 Jul 2023 12:22 Selected Answer: B Upvotes: 1

Option B

Comment 6

ID: 814121 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Sun 19 Feb 2023 14:35 Selected Answer: - Upvotes: 2

Answer B,
ACID -compliant database are Spanner and CloudSQL
Option A could be the answer if they setup a secondary or failure replicas and auto maintenance window that could trigger in non business hours.
Option B, does not explain about extra replica but in postgresql Highavailablity option means the same extra replicas instances are available for emergency.

Comment 7

ID: 763423 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 02 Jan 2023 02:02 Selected Answer: - Upvotes: 1

B. Configure a Cloud SQL for PostgreSQL instance with high availability enabled.

Comment 8

ID: 725409 User: samirzubair Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 23 Nov 2022 23:05 Selected Answer: - Upvotes: 1

I voted for B

Comment 9

ID: 680477 User: John_Pongthorn Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 27 Sep 2022 09:18 Selected Answer: B Upvotes: 1

B it is exact anwer.

Comment 10

ID: 666700 User: TNT87 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Mon 12 Sep 2022 09:28 Selected Answer: B Upvotes: 2

Ans B
Postgres is highly ACID compliant as compared to Mysql

Comment 11

ID: 665378 User: Remi2021 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 10 Sep 2022 14:05 Selected Answer: B Upvotes: 2

cloud sql with high availability enabled is enough

Comment 12

ID: 658054 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 06:20 Selected Answer: B Upvotes: 1

B. Configure a Cloud SQL for PostgreSQL instance with high availability enabled.

Comment 13

ID: 657972 User: ducc Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 03:57 Selected Answer: B Upvotes: 1

I voted for B

63. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 190

Sequence: 303
Discussion ID: 79609
Source URL: https://www.examtopics.com/discussions/google/view/79609-exam-professional-data-engineer-topic-1-question-190/
Posted By: AWSandeep
Posted At: Sept. 2, 2022, 11:15 p.m.

Question

You are loading CSV files from Cloud Storage to BigQuery. The files have known data quality issues, including mismatched data types, such as STRINGs and
INT64s in the same column, and inconsistent formatting of values such as phone numbers or addresses. You need to create the data pipeline to maintain data quality and perform the required cleansing and transformation. What should you do?

A. Use Data Fusion to transform the data before loading it into BigQuery.
B. Use Data Fusion to convert the CSV files to a self-describing data format, such as AVRO, before loading the data to BigQuery.
C. Load the CSV files into a staging table with the desired schema, perform the transformations with SQL, and then write the results to the final destination table.
D. Create a table with the desired schema, load the CSV files into the table, and perform the transformations in place using SQL.

Community Answer Votes

A: 23 most voted
C: 4

Comments 18 comments Click to expand

Comment 1

ID: 748741 User: saurabhsingh4k Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Sun 18 Jun 2023 10:36 Selected Answer: A Upvotes: 6

I'm kinda inclined towards C as SQL seems a powerful option to treat this kind of use case.

Also, I didn't get how the transformations mentioned on this page will help to clean the data (https://cloud.google.com/data-fusion/docs/concepts/transformation-pushdown#supported_transformations)

But I guess using Wrangler plugin, this kind of stuff can be done on DataFusion, also the question talks about an pipeline, so A is the final choice.

Comment 2

ID: 1102387 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 11:16 Selected Answer: A Upvotes: 3

Data Fusion's advantages:

Visual interface: Offers a user-friendly interface for designing data pipelines without extensive coding, making it accessible to a wider range of users.
Built-in transformations: Includes a wide range of pre-built transformations to handle common data quality issues, such as:
Data type conversions
Data cleansing (e.g., removing invalid characters, correcting formatting)
Data validation (e.g., checking for missing values, enforcing constraints)
Data enrichment (e.g., adding derived fields, joining with other datasets)
Custom transformations: Allows for custom transformations using SQL or Java code for more complex cleaning tasks.
Scalability: Can handle large datasets efficiently, making it suitable for processing CSV files with potential data quality issues.
Integration with BigQuery: Integrates seamlessly with BigQuery, allowing for direct loading of transformed data.

Comment 2.1

ID: 1102388 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 11:16 Selected Answer: - Upvotes: 1

Why other options are less suitable:

B. Converting to AVRO: While AVRO is a self-describing format, it doesn't inherently address data quality issues. Transformations would still be needed, and Data Fusion provides a more comprehensive environment for this.
C. Staging table: Requires manual SQL transformations, which can be time-consuming and error-prone for large datasets with complex data quality issues.
D. Transformations in place: Modifying data directly in the destination table can lead to data loss or corruption if errors occur. It's generally safer to keep raw data intact and perform transformations separately.
By using Data Fusion, you can create a robust and efficient data pipeline that addresses data quality issues upfront, ensuring that only clean and consistent data is loaded into BigQuery for accurate analysis and insights.

Comment 3

ID: 1051814 User: squishy_fishy Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 23 Apr 2024 14:20 Selected Answer: - Upvotes: 4

The answer is C. That is what we do at work. We have landing/staging table, sort table and deliver table,

Comment 3.1

ID: 1051818 User: squishy_fishy Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 23 Apr 2024 14:25 Selected Answer: - Upvotes: 4

Okay, second thought, it is asking for a pipeline, so the answer should be A. At work, we use dataflow inside the composer to build a pipeline injecting data into landing/staging table, then transform/clean data in the sort table, then send the cleaned data to deliver table.

Comment 4

ID: 920836 User: phidelics Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 11 Dec 2023 19:23 Selected Answer: A Upvotes: 4

Keyword: Data Pipeline

Comment 5

ID: 891170 User: mialll Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 07 Nov 2023 08:18 Selected Answer: A Upvotes: 2

same as @saurabhsingh4k

Comment 6

ID: 872217 User: Adswerve Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 17 Oct 2023 01:23 Selected Answer: C Upvotes: 4

C is the right answer. Do ELT in BigQuery. Data Fusion is not the right too for this job.

Comment 7

ID: 814099 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 19 Aug 2023 13:12 Selected Answer: - Upvotes: 4

Answer C,
Datafusion is costly and current transformation is just a cast transformation in a column.
I guess no one wanna pay for datafusion for this little transformation and Staging table processing handles such minor cleaning.

Comment 8

ID: 788172 User: maci_f Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 25 Jul 2023 22:12 Selected Answer: A Upvotes: 4

Data Fusion enables changing the data type directly as shown in this lab: https://www.cloudskillsboost.google/focuses/25335?parent=catalog
Wrangler is the feature to enable that, as already mentioned: https://stackoverflow.com/questions/58699872/google-cloud-data-fusion-how-to-change-datatype-from-string-to-date

Comment 9

ID: 763134 User: Mike422 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sat 01 Jul 2023 09:59 Selected Answer: - Upvotes: 2

Apparently chatGPT thinks C is the correct answer just sayin (for the same reason that @saurabhsingh4k wrote)

Comment 10

ID: 746987 User: Atnafu Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 16 Jun 2023 09:00 Selected Answer: - Upvotes: 1

A
https://cloud.google.com/data-fusion/docs/concepts/overview#:~:text=The%20Cloud%20Data%20Fusion%20web%20UI%20lets%20you%20to%20build%20scalable%20data%20integration%20solutions%20to%20clean%2C%20prepare%2C%20blend%2C%20transfer%2C%20and%20transform%20data%2C%20without%20having%20to%20manage%20the%20infrastructure.

Comment 11

ID: 725425 User: samirzubair Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 22:37 Selected Answer: - Upvotes: 2

The Correct Ans is C

Comment 11.1

ID: 748196 User: jkhong Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sat 17 Jun 2023 16:16 Selected Answer: - Upvotes: 1

although this is my preferred answer. this doesn’t satisfy how this becomes a pipeline.

Comment 12

ID: 723527 User: hiromi Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sun 21 May 2023 13:13 Selected Answer: A Upvotes: 1

Data Fusion

Comment 13

ID: 667696 User: TNT87 Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Mon 13 Mar 2023 09:00 Selected Answer: A Upvotes: 1

Ans A
https://cloud.google.com/data-fusion/docs/concepts/transformation-pushdown#supported_transformations

Comment 14

ID: 657959 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 04:43 Selected Answer: A Upvotes: 1

A is correct for me

Comment 15

ID: 657846 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 00:15 Selected Answer: A Upvotes: 1

A. Use Data Fusion to transform the data before loading it into BigQuery.

64. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 195

Sequence: 306
Discussion ID: 79645
Source URL: https://www.examtopics.com/discussions/google/view/79645-exam-professional-data-engineer-topic-1-question-195/
Posted By: ducc
Posted At: Sept. 3, 2022, 3:56 a.m.

Question

Your company wants to be able to retrieve large result sets of medical information from your current system, which has over 10 TBs in the database, and store the data in new tables for further query. The database must have a low-maintenance architecture and be accessible via SQL. You need to implement a cost-effective solution that can support data analytics for large result sets. What should you do?

A. Use Cloud SQL, but first organize the data into tables. Use JOIN in queries to retrieve data.
B. Use BigQuery as a data warehouse. Set output destinations for caching large queries.
C. Use a MySQL cluster installed on a Compute Engine managed instance group for scalability.
D. Use Cloud Spanner to replicate the data across regions. Normalize the data in a series of tables.

Community Answer Votes

B: 13 most voted

Comments 5 comments Click to expand

Comment 1

ID: 658048 User: AWSandeep Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 07:01 Selected Answer: B Upvotes: 8

B. Use BigQuery as a data warehouse. Set output destinations for caching large queries.

Comment 2

ID: 1102826 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 17:57 Selected Answer: B Upvotes: 3

Option B is the best approach - use BigQuery as a data warehouse, and set output destinations for caching large queries.

The key reasons why BigQuery fits the requirements:

It is a fully managed data warehouse built to scale to handle massive datasets and perform fast SQL analytics
It has a low maintenance architecture with no infrastructure to manage
SQL capabilities allow easy querying of the medical data
Output destinations allow configurable caching for fast retrieval of large result sets
It provides a very cost-effective solution for these large scale analytics use cases
In contrast, Cloud Spanner and Cloud SQL would not scale as cost effectively for 10TB+ data volumes. Self-managed MySQL on Compute Engine also requires more maintenance. Hence, leveraging BigQuery as a fully managed data warehouse is the optimal solution here.

Comment 3

ID: 763421 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 02 Jul 2023 00:58 Selected Answer: - Upvotes: 2

B. Use BigQuery as a data warehouse. Set output destinations for caching large queries. Most Voted

Comment 4

ID: 666721 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Sun 12 Mar 2023 10:57 Selected Answer: - Upvotes: 4

Answer B.
https://cloud.google.com/bigquery/docs/query-overview

Comment 5

ID: 657969 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 04:56 Selected Answer: B Upvotes: 2

B is correct

65. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 165

Sequence: 309
Discussion ID: 79627
Source URL: https://www.examtopics.com/discussions/google/view/79627-exam-professional-data-engineer-topic-1-question-165/
Posted By: ducc
Posted At: Sept. 3, 2022, 1:19 a.m.

Question

You work for a large bank that operates in locations throughout North America. You are setting up a data storage system that will handle bank account transactions. You require ACID compliance and the ability to access data with SQL. Which solution is appropriate?

A. Store transaction data in Cloud Spanner. Enable stale reads to reduce latency.
B. Store transaction in Cloud Spanner. Use locking read-write transactions.
C. Store transaction data in BigQuery. Disabled the query cache to ensure consistency.
D. Store transaction data in Cloud SQL. Use a federated query BigQuery for analysis.

Community Answer Votes

B: 39 most voted
D: 9
C: 1

Comments 24 comments Click to expand

Comment 1

ID: 686487 User: devaid Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Wed 05 Apr 2023 01:31 Selected Answer: B Upvotes: 12

I'd say B as the documentation primarily says ACID compliance for Spanner, not Cloud SQL.
https://cloud.google.com/blog/topics/developers-practitioners/your-google-cloud-database-options-explained
Also, spanner supports read-write transactions for use cases, as handling bank transactions:
https://cloud.google.com/spanner/docs/transactions#read-write_transactions

Comment 1.1

ID: 762849 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 18:44 Selected Answer: - Upvotes: 1

B is right

Comment 1.2

ID: 723175 User: Jay_Krish Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sun 21 May 2023 05:20 Selected Answer: - Upvotes: 14

I wonder if you understood the meaning of ACID. This is an inherent property of any relational DB. Cloud SQL is fully ACID complaint

Comment 2

ID: 847033 User: juliobs Badges: Highly Voted Relative Date: 2 years, 5 months ago Absolute Date: Fri 22 Sep 2023 12:15 Selected Answer: B Upvotes: 11

"locations throughout North America" implies multi-region (northamerica-northeast1, us-central1, us-south1, us-west4, us-east5, etc.)
Cloud SQL can only do read replicas in other regions.

Comment 2.1

ID: 982866 User: FP77 Badges: - Relative Date: 2 years ago Absolute Date: Fri 16 Feb 2024 21:24 Selected Answer: - Upvotes: 4

Read replicas are enough to make Cloud SQL work as a multi-region service. That's not the point. The point is that the answer introduces the use of Bigquery when it's not needed for the use case. That's why B is right.

Comment 3

ID: 1100915 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Wed 19 Jun 2024 18:59 Selected Answer: B Upvotes: 4

B. Store transaction in Cloud Spanner. Use locking read-write transactions.

Since the banking transaction system requires ACID compliance and SQL access to the data, Cloud Spanner is the most appropriate solution. Unlike Cloud SQL, Cloud Spanner natively provides ACID transactions and horizontal scalability.

Enabling stale reads in Spanner (option A) would reduce data consistency, violating the ACID compliance requirement of banking transactions.

BigQuery (option C) does not natively support ACID transactions or SQL writes which are necessary for a banking transactions system.

Cloud SQL (option D) provides ACID compliance but does not scale horizontally like Cloud Spanner can to handle large transaction volumes.

By using Cloud Spanner and specifically locking read-write transactions, ACID compliance is ensured while providing fast, horizontally scalable SQL processing of banking transactions.

Comment 4

ID: 1096259 User: Aman47 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 14 Jun 2024 09:06 Selected Answer: B Upvotes: 1

Spanner is an enterprise level resource which Banks require, Cloud SQL is limited to 30TB of storage. And Banking transactions should be read write locked.

Comment 5

ID: 1016275 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Mon 25 Mar 2024 04:19 Selected Answer: B Upvotes: 3

ACID Compliance: Cloud Spanner is a globally distributed, strongly consistent database service that offers ACID compliance, making it a suitable choice for handling bank account transactions where data consistency and integrity are crucial.

SQL Access: Cloud Spanner supports SQL queries, which align with your requirement to access data with SQL. You can use standard SQL to interact with the data stored in Cloud Spanner.

Locking Read-Write Transactions: Cloud Spanner allows you to perform locking read-write transactions, ensuring that transactions are executed in a serializable and consistent manner. This is essential for financial transactions to prevent conflicts and maintain data integrity.

Comment 6

ID: 966880 User: NeoNitin Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 30 Jan 2024 08:27 Selected Answer: - Upvotes: 1

B. Store transaction data in Cloud Spanner. Use locking read-write transactions.

Here's why:

ACID Compliance: ACID stands for Atomicity, Consistency, Isolation, and Durability. Cloud Spanner is a fully managed, globally distributed database that provides strong consistency and ACID compliance. This ensures that bank account transactions are processed reliably and accurately, avoiding issues like data corruption or incomplete transactions.

Ability to access data with SQL: Cloud Spanner supports SQL, which allows you to perform standard SQL queries on the data. This means that you can use familiar SQL commands to access, retrieve, and manipulate transaction data easily.

Comment 7

ID: 872177 User: Adswerve Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Mon 16 Oct 2023 23:01 Selected Answer: D Upvotes: 1

I initially selected B. However, it might be D.

https://cloud.google.com/blog/topics/developers-practitioners/your-google-cloud-database-options-explained
Cloud Spanner: Cloud Spanner is an enterprise-grade, globally-distributed, and strongly-consistent database that offers up to 99.999% availability, built specifically to combine the benefits of relational database structure with non-relational horizontal scale. It is a unique database that combines ACID transactions, SQL queries, and relational structure with the scalability that you typically associate with non-relational or NoSQL databases. As a result, Spanner is best used for applications such as gaming, payment solutions, global financial ledgers, retail banking and inventory management that require ability to scale limitlessly with strong-consistency and high-availability.

Comment 8

ID: 820972 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 24 Aug 2023 21:38 Selected Answer: - Upvotes: 2

Answer B:
locking read-write = for data accuracy
state read = for speed up or latency

Comment 9

ID: 813394 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 18:43 Selected Answer: - Upvotes: 2

Answer B: Spanner
It's incomplete question, what do you assume by large bank, until we are not sure about size and scale. Region is north america, that can be managed by cloud sql. but
i am going for spanner, as its large bank and transaction data.

Comment 10

ID: 799296 User: cajica Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 06 Aug 2023 00:54 Selected Answer: - Upvotes: 5

This is definitely a tricky question because both B and D are "appropriate" as the question suggests, of course we can make assumptions with the "large bank" sentence but there are other questions here where making assumptions is not accepted by the community so I wonder when can we make assumptions and when we can't. I think the real problem here is the ambiguous question. This is one of the few questions where the community accept that both (B and D) answers are appropriate but some comments (and I agree) argue the BEST approach is B. I really think some questions can be written in a better and non-ambiguous way, it's just about thinking a little bit more and not conforming when a poor spelling.

Comment 11

ID: 747867 User: jkhong Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sat 17 Jun 2023 08:16 Selected Answer: B Upvotes: 6

The question is hinting a requirement for global consistency, i.e. being available for NA region, which does not just include US but also Mexico, Argentina etc.

Large bank = priority over consistency over read-write

Comment 11.1

ID: 789242 User: desertlotus1211 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 27 Jul 2023 01:23 Selected Answer: - Upvotes: 5

Argentina is South America...

Comment 11.2

ID: 1012303 User: ckanaar Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Wed 20 Mar 2024 15:26 Selected Answer: - Upvotes: 1

Good catch, definetely Spanner in that case.

Comment 12

ID: 737688 User: NicolasN Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 09:45 Selected Answer: B Upvotes: 1

Finally, it's [B].
There is no measurable requirement that rules out [D] (Cloud SQL) and this fact made me to select it as a preferrable answer.
But since we are talking about a large bank (which normally implies massive reads/writes per sec.) and nobody has posed any cost limitation, in a real case I would definitely prefer the advantages of Spanner.

Comment 13

ID: 714338 User: NicolasN Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 09 May 2023 06:40 Selected Answer: D Upvotes: 2

[A] No - Stale reads not accepted for bank account transactions. "A stale read is read at a timestamp in the past. If your application is latency sensitive but tolerant of stale data, then stale reads can provide performance benefits."
[B] Yes - Fulfills all requirements
[C] No - BigQuery is ACID-compliant, but it is too much to use it for such a case (mainly a CRUD app)
[D] Yes+ - Fulfills all requirements. The BigQuery part may seem redundant, but it states a true fact that doesn't violate the "access data with SQL" requirement.

So, when SQL Cloud and SQL Spanner fit both, there is no reason to prefer the second.
And the question doesn't mention any obvious fact for which should we prefer the expensive SQL Spanner:
- We don't know if we have to deal with a big amount of data and thousands of writes per second.
- We don't know the database size.
- There is no need for multi-regional writes that would exclude SQL Cloud as an alternative. Is it a coincidence that the question limits the problem to the single region of North America?

Comment 13.1

ID: 737687 User: NicolasN Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 07 Jun 2023 09:44 Selected Answer: - Upvotes: 1

I changed my mind to [B] since I underestimated the given of a "large bank" where the cost difference for a single region Spanner wouldn't matter.

Comment 13.2

ID: 830849 User: SuperVee Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 06 Sep 2023 13:44 Selected Answer: - Upvotes: 1

Also, correct me if I am wrong, Bigquery cannot query Cloud SQL directly, only when Cloud SQL is exported into GCS, then BQ can connect to GCS using federated queries.

Comment 14

ID: 712673 User: cloudmon Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 06 May 2023 22:13 Selected Answer: B Upvotes: 1

I'd go for B.
The only other somewhat valid option is D, but there's no requirement for analytics in the question.

Comment 15

ID: 701581 User: mattab1627 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 22 Apr 2023 16:30 Selected Answer: - Upvotes: 1

Surely its B, transactional data at a large US based bank would surely be massive in size and probably too much for CloudSQL? There is also no mention of a requirement for analytics

Comment 16

ID: 664747 User: MounicaN Badges: - Relative Date: 3 years ago Absolute Date: Thu 09 Mar 2023 18:41 Selected Answer: - Upvotes: 2

why not spanner?

Comment 17

ID: 662282 User: pluiedust Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 12:09 Selected Answer: D Upvotes: 1

D is correct

66. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 74

Sequence: 321
Discussion ID: 79320
Source URL: https://www.examtopics.com/discussions/google/view/79320-exam-professional-data-engineer-topic-1-question-74/
Posted By: YorelNation
Posted At: Sept. 2, 2022, 9:42 a.m.

Question

Your financial services company is moving to cloud technology and wants to store 50 TB of financial time-series data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?

A. Cloud Bigtable
B. Google BigQuery
C. Google Cloud Storage
D. Google Cloud Datastore

Community Answer Votes

A: 19 most voted
C: 7

Comments 23 comments Click to expand

Comment 1

ID: 1142955 User: philli1011 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 07 Feb 2024 05:58 Selected Answer: - Upvotes: 4

Every time you hear financial, time series, fast reads and write data, Any of that combinations, think Big Table first.
So A.

Comment 2

ID: 960231 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 10:26 Selected Answer: A Upvotes: 2

At first I thought that GCS was the answer but the question does mention that the data is updated frequently. Thereby, it has to be BigTable since we talk about a large amount of data, a streaming application and many individual updates. Storing the data in BigQuery and having to make individual updates doesn't make sense, and neither does running Apache jobs.

If the requirement for updates was not there I would not see any issue with GCS. GCS could serve as a replacement to HDFS and run Hadoop jobs from Dataproc.

Comment 3

ID: 927168 User: KC_go_reply Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Mon 19 Jun 2023 07:28 Selected Answer: A Upvotes: 1

This scenario screams for BigTable.

It's not B) BigQuery or C) Cloud Storage because both aren't supposed to contain data that is updated frequently. Then, we have to decide between A) BigTable and D) Datastore.

It is A) BigTable because
- it is the most suited for real-time / high-frequency updates
- it is similar to HBase, which is commonly used in Hadoop ecosystem stacks to store streaming / time-series data.

Comment 4

ID: 906478 User: AmmarFasih Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 25 May 2023 09:28 Selected Answer: A Upvotes: 1

Many here also selected Cloud Storage. But the way I see it BigTable is specifically for low latency, high throughput, mission critical streaming data (financial data is one of them). Also the mentioning of Hadoop that points to HBase functionality if BigTable clarifies the choice more.

Comment 5

ID: 885962 User: Hisayuki Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 01 May 2023 09:26 Selected Answer: A Upvotes: 2

BigTable - a No-SQL database but does not support SQL Querying
Apache HBase - Based on Google's BigTable on top of HDFS and you can migrate Hadoop Apps to Cloud BigTable with the HBase API

Comment 6

ID: 869863 User: izekc Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 14 Apr 2023 02:44 Selected Answer: A Upvotes: 1

A. time series data

Comment 7

ID: 824363 User: midgoo Badges: - Relative Date: 3 years ago Absolute Date: Tue 28 Feb 2023 04:44 Selected Answer: A Upvotes: 1

Please note that there is Connector for Bigtable for Hadoop
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-bigtable

Comment 8

ID: 786031 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 02:22 Selected Answer: - Upvotes: 1

Why not Biquery?

Google BigQuery would be the best option for storing and analyzing large amounts of financial time-series data that is frequently updated and streamed in real-time. It is a fully managed, cloud-native data warehouse that allows you to analyze large datasets using SQL-like queries, and it can handle streaming data as well as batch data. Additionally, it can easily integrate with Apache Hadoop to allow your company to run their existing Hadoop jobs in the cloud and gain insights into the data.

Comment 8.1

ID: 786032 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 02:22 Selected Answer: - Upvotes: 2

A. Google Bigtable is a fully managed, NoSQL, wide-column database that is designed for large scale, low-latency workloads. It is well suited for use cases such as real-time analytics, IoT, and gaming, but it may not be the best fit for storing and analyzing large amounts of financial time-series data that is frequently updated and streamed in real-time. It lacks built-in support for SQL-like queries, which is a standard way of analyzing data in Data Warehousing and Business Intelligence. It is more focused on handling high-performance low-latency workloads, while BigQuery is focused on providing an easy and cost-effective way to analyze large amounts of data using SQL-like queries. Additionally, Bigtable doesn't provide built-in support for running Apache Hadoop jobs, and it would require additional work to integrate it with Hadoop and set it up for data warehousing and Business Intelligence use cases.

Comment 8.1.1

ID: 786033 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 02:23 Selected Answer: - Upvotes: 2

C. Google Cloud Storage is an object storage service that allows you to store and retrieve large amounts of unstructured data, such as video, audio, images and other files. It is not a data warehouse and does not provide built-in support for SQL-like queries, which is a standard way of analyzing data in Data Warehousing and Business Intelligence. It would not be suitable for storing and analyzing large amounts of financial time-series data that is frequently updated and streamed in real-time.

D. Google Cloud Datastore is a fully-managed, NoSQL document database that allows you to store, retrieve, and query data. It is not a data warehouse and does not provide built-in support for SQL-like queries, which is a standard way of analyzing data in Data Warehousing and Business Intelligence. It would not be suitable for storing and analyzing large amounts of financial time-series data that is frequently updated and streamed in real-time.

Comment 8.1.1.1

ID: 786034 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 24 Jan 2023 02:23 Selected Answer: - Upvotes: 1

Can someone clarify why Bigtable and Not Bigquery? Super Confused.

Comment 8.1.1.1.1

ID: 880089 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 09:46 Selected Answer: - Upvotes: 1

Yes, it is possible to analyze data in Bigtable. Bigtable is a distributed NoSQL database that is designed to handle large volumes of structured data with high read and write throughput. While Bigtable itself does not provide analysis tools, it is often used in combination with other tools and technologies to perform analysis on the stored data.

Comment 9

ID: 779219 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 17 Jan 2023 19:45 Selected Answer: - Upvotes: 2

https://cloud.google.com/bigtable/docs/schema-design-time-series

Comment 10

ID: 723438 User: Yazar97 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Mon 21 Nov 2022 12:54 Selected Answer: - Upvotes: 3

Time series data = Bigtable... So it's A

Comment 11

ID: 722947 User: Jay_Krish Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 20 Nov 2022 20:47 Selected Answer: A Upvotes: 1

Option A seems right

Comment 12

ID: 722844 User: drunk_goat82 Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 20 Nov 2022 18:50 Selected Answer: A Upvotes: 1

Big Table has a HBase compliant API and is transactional unlike GCS.

Comment 13

ID: 720453 User: solar_maker Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 17 Nov 2022 13:31 Selected Answer: A Upvotes: 1

BigTable can take in data from dataproc, spark and hadoop
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-bigtable#using_with

Comment 14

ID: 712546 User: cloudmon Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 06 Nov 2022 19:08 Selected Answer: C Upvotes: 3

It must be C because of the existing Hadoop jobs

Comment 14.1

ID: 714135 User: cloudmon Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Tue 08 Nov 2022 22:56 Selected Answer: - Upvotes: 6

On 2nd thought, it’s Bigtable: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-bigtable

Comment 15

ID: 686447 User: pluiedust Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Tue 04 Oct 2022 23:43 Selected Answer: C Upvotes: 2

I think it is C

Comment 16

ID: 681829 User: maia01 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 28 Sep 2022 17:02 Selected Answer: C Upvotes: 2

Use Datarproc with Cloud Storage in combo with HDFS
https://cloud.google.com/dataproc/docs/concepts/dataproc-hdfs

Comment 16.1

ID: 945684 User: euro202 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 07 Jul 2023 14:52 Selected Answer: - Upvotes: 1

Answer is A: Hadoop doesn't mean Dataproc + HDFS. This scenario is about time series that is a use-case for BigTable. Coincidentally BigTable is the best solution for migration of HBase...

Comment 17

ID: 658398 User: AWSandeep Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Sat 03 Sep 2022 13:52 Selected Answer: A Upvotes: 4

A. Cloud Bigtable

67. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 79

Sequence: 322
Discussion ID: 17118
Source URL: https://www.examtopics.com/discussions/google/view/17118-exam-professional-data-engineer-topic-1-question-79/
Posted By: -
Posted At: March 21, 2020, 6:17 p.m.

Question

Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data. The data are imported to Cloud
Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have asked you to fix the problem. You want to maximize transfer speeds. Which action should you take?

A. Increase the CPU size on your server.
B. Increase the size of the Google Persistent Disk on your server.
C. Increase your network bandwidth from your datacenter to GCP.
D. Increase your network bandwidth from Compute Engine to Cloud Storage.

Community Answer Votes

C: 6 most voted
A: 3

Comments 21 comments Click to expand

Comment 1

ID: 960294 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 12:28 Selected Answer: C Upvotes: 1

We are talking about transfer speed. Network transfer speed does not increase with CPU, but with bandwidth. Since there is no other extra information about what the issue, we have to assume that they imply network transfer speed.

Comment 2

ID: 899444 User: Kiroo Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Thu 16 Nov 2023 20:36 Selected Answer: C Upvotes: 3

To be honest this question is incomplete, I would go increasing the bandwidth, but first I would analyze why it’s taking long time maybe I’m uploading many files so I could compress and agregate then and upload just one, maybe the target cpu is overloaded at the time of the upload, maybe the target disk reaching the max iops,

Comment 3

ID: 888775 User: Jarek7 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 03 Nov 2023 20:13 Selected Answer: C Upvotes: 1

Even if transfer server is deployed on the slowest machine available in GCP there is no way it is bottleneck for simple data transfer without any data processing.

Comment 4

ID: 882567 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 27 Oct 2023 13:15 Selected Answer: A Upvotes: 1

GPT: Option A, increasing the CPU size on the data transfer server, could potentially increase the transfer speeds if the bottleneck in the data transfer process is the processing power of the server. By increasing the CPU size, the server may be able to process data more quickly, leading to faster transfers.
Option C, increasing the network bandwidth from the datacenter to GCP, could potentially improve the transfer speeds, but it may not be feasible or cost-effective depending on the current infrastructure and network limitations.

Comment 4.1

ID: 888776 User: Jarek7 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 03 Nov 2023 20:14 Selected Answer: - Upvotes: 4

Please stop using GPT as knowledge source. v3.5 is usually wrong even in simple cases. v4 is much better, but it is not designed to be knowledge source. Looking at the answer you must have used v3.5. The question says nothing about cost-effectivness. The issue is data transfer. No any data processing is done on the data while it is transferred. Simple transfer doesn't need much processing power - the real bottleneck even on slowest machines available on GCP must be data transfer - it is obvious.
BTW for me GPT3.5 said it is C.

Comment 4.1.1

ID: 889929 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 05 Nov 2023 12:54 Selected Answer: - Upvotes: 1

yea, i know it can make mistakes. Thank you.
That`s why i always mark "GPT" at the start of my answer.

Comment 4.1.2

ID: 889932 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sun 05 Nov 2023 12:59 Selected Answer: - Upvotes: 1

it should be C, for real, bcz nothing said about cost restrictions in the question. And the user "snamburi3 " found docs.

Comment 5

ID: 879763 User: izekc Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 25 Oct 2023 01:14 Selected Answer: A Upvotes: 1

it's refer to data transfer server slow here. not transfer data to cloud slow.
100% A

Comment 6

ID: 829038 User: jonathanthezombieboy Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 04 Sep 2023 14:27 Selected Answer: C Upvotes: 1

Answer is C

Comment 7

ID: 825475 User: jin0 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 01 Sep 2023 02:08 Selected Answer: - Upvotes: 1

This question makes people confused only. there is no refer to network or size of data or something could be referred. the answer could be A or C

Comment 8

ID: 816553 User: mahdiaqim Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 12:31 Selected Answer: A Upvotes: 1

Very confusing question. I selected A because I assume increasing the CPU size on the cloud server is easier to change, as a data engineer, than the bandwidth.

Comment 9

ID: 789753 User: samdhimal Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 27 Jul 2023 15:57 Selected Answer: - Upvotes: 3

C. Increase your network bandwidth from your datacenter to GCP.
This will likely have the most impact on transfer speeds as it addresses the bottleneck in the transfer between your data center and GCP. Increasing the CPU size or the size of the Google Persistent Disk on the server may help with processing the data once it has been transferred, but will not address the bottleneck in the transfer itself. Increasing the network bandwidth from Compute Engine to Cloud Storage would also help with processing the data once it has been transferred but will not address the bottleneck in the transfer itself as well.

Comment 10

ID: 692767 User: Nirca Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Wed 12 Apr 2023 08:31 Selected Answer: - Upvotes: 1

A bit unprofessional question, having performance issues should be addressed by analyzing and looking for saturation in the system and understanding "wait-events". Only than adding more resources.

Comment 11

ID: 624740 User: rr4444 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 29 Dec 2022 18:25 Selected Answer: - Upvotes: 3

"The data are imported to Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP. "

This makes zero sense. Is it to GCS or GCE? Question had to make up its mind. Nonsense, literally.

Comment 12

ID: 477357 User: Thierry_1 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Fri 13 May 2022 08:59 Selected Answer: - Upvotes: 1

Vote for C. Mostly because other options seems useless

Comment 13

ID: 394277 User: sumanshu Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 30 Dec 2021 03:04 Selected Answer: - Upvotes: 3

Vote for 'C'

You want to maximize transfer speeds.

Comment 14

ID: 287763 User: Tolberic Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Tue 10 Aug 2021 17:51 Selected Answer: - Upvotes: 2

https://cloud.google.com/compute/docs/machine-types#n1_machine_types

Comment 15

ID: 222029 User: snamburi3 Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Tue 18 May 2021 15:51 Selected Answer: - Upvotes: 1

"The data are imported to Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP."
This is confusing, as the upload is through a data transfer server on GCP - not directly to Storage. In this case, maybe A?

Comment 15.1

ID: 222031 User: snamburi3 Badges: - Relative Date: 4 years, 9 months ago Absolute Date: Tue 18 May 2021 15:55 Selected Answer: - Upvotes: 4

I am going with C as I found a doc: https://cloud.google.com/solutions/migration-to-google-cloud-transferring-your-large-datasets#increasing_network_bandwidth. still a confusing question...

Comment 16

ID: 185414 User: SureshKotla Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Tue 23 Mar 2021 17:48 Selected Answer: - Upvotes: 1

I guess option A because, increasing processing power for parallel uploads is the first thing to try. And if it doesn't work, then go to bandwidth issues.

Comment 17

ID: 138627 User: bbozz_ Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Tue 19 Jan 2021 14:25 Selected Answer: - Upvotes: 3

Answer: CA

Google Professional Data Engineer Storage and Data Modeling

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes

Question

Suggested Answer

Community Answer Votes