Google Professional Data Engineer Machine Learning and AI

Use for TensorFlow, Vertex AI, AutoML, Dialogflow, model training and evaluation, feature engineering, and ML-oriented product decisions.

Exams
PROFESSIONAL-DATA-ENGINEER
Questions
21
Comments
374

1. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 14

Sequence
10
Discussion ID
16281
Source URL
https://www.examtopics.com/discussions/google/view/16281-exam-professional-data-engineer-topic-1-question-14/
Posted By
jvg637
Posted At
March 11, 2020, 6:24 p.m.

Question

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

  • A. There are very few occurrences of mutations relative to normal samples.
  • B. There are roughly equal occurrences of both normal and mutated samples in the database.
  • C. You expect future mutations to have different features from the mutated samples in the database.
  • D. You expect future mutations to have similar features to the mutated samples in the database.
  • E. You already have labels for which samples are mutated and which are normal in the database.

Suggested Answer

AC

Answer Description Click to expand


Community Answer Votes

Comments 21 comments Click to expand

Comment 1

ID: 64229 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Sun 15 Mar 2020 12:28 Selected Answer: - Upvotes: 74

I think that AD makes more sense. D is the explanation you gave. In the rest, A makes more sense, in any anomaly detection algorithm it is assumed a priori that you have much more "normal" samples than mutated ones, so that you can model normal patterns and detect patterns that are "off" that normal pattern. For that you will always need the no. of normal samples to be much bigger than the no. of mutated samples.

Comment 1.1

ID: 494902 User: BigQuery Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Mon 06 Dec 2021 05:55 Selected Answer: - Upvotes: 23

Guys its A & C.
Anomaly detection has two basic assumptions:
->Anomalies only occur very rarely in the data. (a)
->Their features differ from the normal instances significantly. (c)

link -> https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e559c1#:~:text=Unsupervised%20Anomaly%20Detection%20for%20Univariate%20%26%20Multivariate%20Data.&text=Anomaly%20detection%20has%20two%20basic,from%20the%20normal%20instances%20significantly.

Comment 1.1.1

ID: 502480 User: szefco Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Wed 15 Dec 2021 21:59 Selected Answer: - Upvotes: 29

I don't agree on C. Anomaly detection assumes "Their features differ from the NORMAL INSTANCES significantly" and in the C option you have:
"You expect future mutations to have different features from the MUTATED SAMPLES IN THE DATABASE".

IMHO Answer D fits better: "D. You expect future mutations to have similar features to the mutated samples in the database." - in other words: Expect future anomalies to be similar to the anomalies that we already have in database

Comment 1.2

ID: 1326075 User: AmitK121981 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 13 Dec 2024 11:03 Selected Answer: - Upvotes: 1

as per chatGPT, it can be different (C) - thats how unsupervised anomaly detection works - as long as they are different than "normal' tissues , they would be detected

Comment 2

ID: 62579 User: jvg637 Badges: Highly Voted Relative Date: 6 years ago Absolute Date: Wed 11 Mar 2020 18:24 Selected Answer: - Upvotes: 21

A instead of B:
"anomaly detection (also outlier detection[1]) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data

Comment 3

ID: 1705788 User: assiafella Badges: Most Recent Relative Date: 2 months ago Absolute Date: Sat 10 Jan 2026 18:13 Selected Answer: AC Upvotes: 1

I think it's A & C, the anomaly detection algorithm train on "normal" data and anything that deviates from that is classified "mutated", and so there is no commun caracterstics or patterns for mutated data, this is why the statement C is correct

Comment 4

ID: 1561837 User: vosang5299 Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Sat 19 Apr 2025 05:58 Selected Answer: AD Upvotes: 1

A. There are very few occurrences of mutations relative to normal samples. This is a strong characteristic supporting unsupervised anomaly detection. Anomaly detection methods often work well when the "normal" class is the majority and anomalies are rare outliers. The mutated samples, being few, would likely appear as anomalies relative to the cluster of normal samples.
D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic supports unsupervised anomaly detection. If future mutations are expected to share similar characteristics with the existing mutated samples (even if they are rare), an anomaly detection method could potentially learn the pattern of the normal samples and flag anything significantly different (including the mutated samples) as an anomaly.

Comment 5

ID: 1398847 User: Parandhaman_Margan Badges: - Relative Date: 12 months ago Absolute Date: Sat 15 Mar 2025 14:43 Selected Answer: AC Upvotes: 2

Unsupervised anomaly detection. Characteristics are few anomalies and future mutations different. So A and C.

Comment 6

ID: 1346613 User: LP_PDE Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 25 Jan 2025 20:12 Selected Answer: AC Upvotes: 3

AC. Answer C. Unsupervised anomaly detection methods are particularly useful when you don't have labeled examples of the anomalies you're trying to detect. Why not D, if future mutations are similar to existing ones, a supervised model trained on labeled examples of known mutations would likely be more accurate in classifying new samples.

Comment 7

ID: 1342431 User: cqrm3n Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 07:13 Selected Answer: AC Upvotes: 2

The answer should be A and C.

Unsupervised anomaly detection is useful when labels are unavailable, or when anomalies are rare and distinct.

Hence A is definitely correct because anomaly detection excels when anomalies are rare compared to normal data.

I think C is correct because by adding new mutation data that is similar to the existing mutation data, the model will learn in a broader sense on what constitutes to 'mutation', and it leads to a better generalization. If the new data is too similar to the existing mutation data (answer D), the model might overfit to those specific examples. However, the new data should still share some fundamental characteristic to the existing data so that the model can recognize them as belonging to the same anomaly category.

Comment 8

ID: 1331265 User: kumar34 Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Wed 25 Dec 2024 00:07 Selected Answer: AC Upvotes: 2

I think it's A&C. For an anomaly detection model, the ratio of normal vs abnormal is expected to be high. 'C' because the model is expected to be adaptive meaning the model detects the abnormal features that can be different from the abnormal features currently being trained on.

Comment 9

ID: 1301180 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 19:04 Selected Answer: AC Upvotes: 2

The keyword is unsupervised anomaly detection. So A is correct. We think and should ensure the majority of data represents 'normal'. Unsupervised methods are good for detecting unknown patterns. Thus C could be correct.

Comment 9.1

ID: 1301201 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 19:25 Selected Answer: - Upvotes: 1

I correct my answer. AD should be better. Unsupervised method is usually used for grouping the data. So, if the future mutations have similar features to the mutated samples, our trained model should group it into anomalies even though no label exists.

Comment 10

ID: 503327 User: hendrixlives Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:22 Selected Answer: AD Upvotes: 4

AD: to use unsupervised anomaly detection the anomalies a) must be rare b) they must differ from the NORMAL. So...
A: mutated samples must be scarce compared to normal tissue.
D: yes, we expect the future mutated samples to have similar features to the mutated samples currently in the database.
Why not C? If I train my model with mutated samples with specific characteristics, I do not expect it to find different mutations. In the future, when new mutations appear, I would retrain my model including those new samples.

Comment 11

ID: 528611 User: MaxNRG Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:22 Selected Answer: AD Upvotes: 4

Anomaly detection has two basic assumptions:
*Anomalies only occur very rarely in the data.
*Their features differ from the normal instances significantly.
Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called “normal” instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection.
The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them.
https://www.science.gov/topicpages/u/unsupervised+anomaly+detection
A because “Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal”, B is for Supervised anomaly detection https://en.wikipedia.org/wiki/Anomaly_detection

Comment 12

ID: 719602 User: gudiking Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:22 Selected Answer: AD Upvotes: 2

A - anomaly detection is used for detecting rare events, meaning it is expected that there are much less of those than of normal ones.
D - you expect the future mutations to be similar to the mutations you already have, so that you can detect them (pattern recognition)

Comment 13

ID: 741533 User: jkhong Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:22 Selected Answer: AD Upvotes: 4

A makes sense

C and D compares future mutations to mutated samples in database

The question is pretty badly worded… If we were to run a full unsupervised anomaly detection over the entire dataset, C and D will be true, since some future mutations may be similar to current mutations and some will be significantly different to current mutations.

The question is suggesting "labelling" tissue samples using unsupervised anomaly detection, and subsequently using the labels with a supervised algorithm to classify future samples. If this interpretation of the question is correct, then D makes sense

Comment 14

ID: 768244 User: korntewin Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:22 Selected Answer: AD Upvotes: 2

The answer should be AD.

A, anomaly should have a little amount, if there are many samples then we should do classification instead, because unsupervised will give a lot of false positive.

D, the future anomaly should be of the same distribution as present anomaly! or else our anomaly detection will not be generalize to the future feature.

Comment 15

ID: 799159 User: samdhimal Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:22 Selected Answer: - Upvotes: 2

A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated.

D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.

Comment 16

ID: 949537 User: azmiozgen Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:21 Selected Answer: AD Upvotes: 5

D should be correct. You expect future samples will correlate with the training samples. That's the whole point of learning procedure. If you do not expect that they have similar features, then why would you use features in the training samples in the first place? A is also correct, since anomaly labels would be seen rarely.

Comment 17

ID: 1062185 User: rocky48 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:21 Selected Answer: AD Upvotes: 2

A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated.

D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.

2. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 91

Sequence
22
Discussion ID
17256
Source URL
https://www.examtopics.com/discussions/google/view/17256-exam-professional-data-engineer-topic-1-question-91/
Posted By
-
Posted At
March 22, 2020, 4:37 p.m.

Question

You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants.
What should you do?

  • A. Increase the size of the dataset by collecting additional data.
  • B. Train a linear regression to predict a credit default risk score.
  • C. Remove the bias from the data and collect applications that have been declined loans.
  • D. Match loan applicants with their social profiles to enable feature engineering.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 25 comments Click to expand

Comment 1

ID: 171799 User: GHN74 Badges: Highly Voted Relative Date: 5 years, 6 months ago Absolute Date: Wed 02 Sep 2020 08:55 Selected Answer: - Upvotes: 40

A is incorrect as you need to work with the data you have available
C is an optimisation not a solution
D is ethically incorrect and invasion to privacy, there could be several legal implications with this
B although oversimplified but is a workable solution

Comment 1.1

ID: 457000 User: sergio6 Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Mon 04 Oct 2021 11:38 Selected Answer: - Upvotes: 1

Information in social profiles are public

Comment 1.1.1

ID: 457002 User: sergio6 Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Mon 04 Oct 2021 11:40 Selected Answer: - Upvotes: 1

according to the privacy settings and shareable informations

Comment 2

ID: 329871 User: sumanshu Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Tue 06 Apr 2021 20:58 Selected Answer: - Upvotes: 21

We have labelled data that contains whether a loan application is accepted or defaulted - So Classification Problem Data.

We need to predict (Default Rates for applicants) - I think whether application will be granted or defaulted. - So Binary Classification.

No option matches the answer. - if we mark 'B' - It should be Logistic Regression, Instead of Linear Regression

Comment 2.1

ID: 504131 User: szefco Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Sat 18 Dec 2021 10:24 Selected Answer: - Upvotes: 23

Question says: "to predict default RATES for credit applicants".
It is not binary classification, so Linear Regression would work here.
I think B is correct answer.

Comment 2.1.1

ID: 900636 User: cchen8181 Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 18 May 2023 02:59 Selected Answer: - Upvotes: 1

Correct approach is to use logistic regression to predict default/not default, and then take the confidence/probability of the outcome as the "default rate". Linear regression doesn't make sense since we are not given a default rate label in our data, we are just given the labels default vs no default.

Comment 2.1.2

ID: 832204 User: Aaronn14 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 19:52 Selected Answer: - Upvotes: 2

You cannot predict rate. You predict a realization, which is either default or not. This question is terribly written.

Comment 3

ID: 1699383 User: 50336e5 Badges: Most Recent Relative Date: 2 months, 4 weeks ago Absolute Date: Sun 14 Dec 2025 16:02 Selected Answer: C Upvotes: 1

it's not B because it's not a problem of linear regression, it' s a classification !

Comment 4

ID: 1338848 User: Skyw4lk3r Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Fri 10 Jan 2025 16:08 Selected Answer: A Upvotes: 1

A. Because the dataset is just of granted loans of the business, but is needed a grater database where to train the model in order to get aceptable results

Comment 5

ID: 1297319 User: nairoh Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 14 Oct 2024 10:01 Selected Answer: - Upvotes: 1

"social" does not mean Social Media. This could be linked to demograhic data, si it could improve the score.

Comment 6

ID: 1102412 User: TVH_Data_Engineer Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 21 Dec 2023 13:06 Selected Answer: B Upvotes: 5

To predict default rates for credit applicants using the labeled dataset of granted loan applications, the most appropriate course of action would be:

B. Train a linear regression to predict a credit default risk score.

Here's the rationale for this approach:

Appropriate Model for Prediction: Linear regression is a common statistical method used for predictive modeling, particularly when the outcome variable (in this case, the likelihood of default) is continuous. In the context of credit scoring, linear regression can be used to predict a risk score that represents the probability of default.

Utilization of Labeled Data: Since you already have a labeled dataset containing information on loans that have been granted and whether they have defaulted, you can use this data to train the regression model. This historical data provides the model with examples of borrower characteristics and their corresponding default outcomes.

Comment 7

ID: 1087644 User: rocky48 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 04 Dec 2023 13:56 Selected Answer: B Upvotes: 3

B. Train a linear regression to predict a credit default risk score.

Comment 8

ID: 984141 User: gaurav0480 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 05:10 Selected Answer: - Upvotes: 1

What would be the target variable if B is correct i.e. training a linear regression model? Default/No-Default is a categorical variable one cannot train a linear regression model with this target variable

Comment 9

ID: 965539 User: FDS1993 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Fri 28 Jul 2023 13:42 Selected Answer: C Upvotes: 1

C - it is a typical approach in credit loans.
Keeping only the accepted loans leads to a bias in the application

Comment 10

ID: 960319 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 23 Jul 2023 12:04 Selected Answer: B Upvotes: 2

Linear regression is not the good way to solve such a problem, but you can totally apply linear regression to solve a classification problem. Just set the labels to numeric values 0 and 1 and linear regression will try to predict a value inbetween and round to the closest label (0 or 1).

Totally not the way to go about it, but actually it's possible.

Comment 11

ID: 849572 User: juliobs Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Fri 24 Mar 2023 20:11 Selected Answer: D Upvotes: 1

Cannot be B. This is logistic regression, not linear regression.

D is the only acceptable option.
Social profile can include things like high or low income, for example.
When you apply for a credit you usually have to give this information, so totally legal.

Comment 11.1

ID: 880468 User: Oleksandr0501 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 16:19 Selected Answer: - Upvotes: 1

D. Matching loan applicants with their social profiles to enable feature engineering is not recommended as it raises privacy concerns and may not be legal in some jurisdictions. Additionally, social profiles may not be a good indicator of creditworthiness, and relying on them may introduce bias or discrimination.

Comment 11.2

ID: 1288929 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 10:11 Selected Answer: - Upvotes: 1

Credit scores are numbers, so this is a regression. Whether or not a client defaults could be a classification, but the option specifies the use of scores, which is fine.

Comment 12

ID: 826439 User: jin0 Badges: - Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 03:59 Selected Answer: - Upvotes: 1

Because there is no option to know what dataset schema is even though B is needed for this question's purpose Nobody can't select B. So there is none to answer

Comment 13

ID: 809532 User: musumusu Badges: - Relative Date: 3 years ago Absolute Date: Wed 15 Feb 2023 13:48 Selected Answer: - Upvotes: 1

all options are wrong: Still in favour of B
A: ofc its good to have more data but its not clear how much data we have
B: Linear can be a workable approach but current situation is not for linear approach, decision tree, random forest etc can be good for it.
C: DAta should be unbiased, removing bias is negative for tranining

Comment 14

ID: 784375 User: Besss Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sun 22 Jan 2023 15:07 Selected Answer: B Upvotes: 2

default rates can be predicted with linear regression.

Comment 14.1

ID: 1087692 User: 21c17b3 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 04 Dec 2023 15:09 Selected Answer: - Upvotes: 1

default rates is classification probability

Comment 15

ID: 756146 User: Whoswho Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 26 Dec 2022 03:43 Selected Answer: - Upvotes: 2

Answer should actually be a logistic Regression model

Comment 16

ID: 732583 User: ladistar Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 01 Dec 2022 13:18 Selected Answer: - Upvotes: 2

The question asks about default RATES, as in you are predicting a continuous variable, not a discrete one (classification). This is a regression problem, so choice B.

Comment 17

ID: 713903 User: woyaolai Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Tue 08 Nov 2022 15:39 Selected Answer: - Upvotes: 8

I used to be a Credit Risk modeler and I think this question is stupid.

3. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 1

Sequence
30
Discussion ID
79414
Source URL
https://www.examtopics.com/discussions/google/view/79414-exam-professional-data-engineer-topic-1-question-1/
Posted By
henriksoder24
Posted At
Sept. 2, 2022, 2:46 p.m.

Question

Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

  • A. Threading
  • B. Serialization
  • C. Dropout Methods
  • D. Dimensionality Reduction

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 19 comments Click to expand

Comment 1

ID: 657434 User: henriksoder24 Badges: Highly Voted Relative Date: 3 years, 6 months ago Absolute Date: Fri 02 Sep 2022 14:46 Selected Answer: - Upvotes: 32

Answer is C.
Bad performance of a model is either due to lack of relationship between dependent and independent variables used, or just overfit due to having used too many features and/or bad features.

A: Threading parallelisation can reduce training time, but if the selected featuers are the same then the resulting performance won't have changed
B: Serialization is only changing data into byte streams. This won't be useful.
C: This can show which features are bad. E.g. if it is one feature causing bad performance, then the dropout method will show it, so you can remove it from the model and retrain it.
D: This would become clear if the model did not fit the training data well. But the question says that the model fits the training data well, so D is not the answer.

Comment 2

ID: 784777 User: samdhimal Badges: Highly Voted Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 00:37 Selected Answer: - Upvotes: 7

C. Dropout Methods

Dropout is a regularization technique that can be used to prevent overfitting of the model to the training data. It works by randomly dropping out a certain percentage of neurons during training, which helps to reduce the complexity of the model and prevent it from memorizing the training data. This can improve the model's ability to generalize to new data and reduce the risk of poor performance when tested against new data.

Comment 2.1

ID: 784778 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 00:37 Selected Answer: - Upvotes: 7

A. Threading: it's not a method to address overfitting, it's a technique to improve the performance of the model by parallelizing the computations using multiple threads.

B. Serialization: it's a technique to save the model's architecture and trained parameters to a file, it's helpful when you want to reuse the model later, but it doesn't address overfitting problem.

D. Dimensionality Reduction: it's a technique that can be used to reduce the number of features in the data, it's helpful when the data contains redundant or irrelevant features, but it doesn't address overfitting problem directly.

Comment 3

ID: 1618123 User: 3244fd8 Badges: Most Recent Relative Date: 4 months, 3 weeks ago Absolute Date: Sun 19 Oct 2025 09:49 Selected Answer: C Upvotes: 1

The model fits the training data well but performs poorly on new data → this is overfitting, especially with a large/deep network.
Dropout is a regularization technique that randomly deactivates neurons during training, reduces co-adaptation, and improves generalization.
Threading only affects runtime, Serialization is for saving models, and Dimensionality Reduction isn’t the primary fix here since the model already fits the training set well.
Therefore: C. Dropout Methods.

Comment 4

ID: 1606794 User: gunnerski Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Sat 06 Sep 2025 22:49 Selected Answer: C Upvotes: 1

Only option which fix overfitting

Comment 5

ID: 1605909 User: israndroid Badges: - Relative Date: 6 months, 1 week ago Absolute Date: Wed 03 Sep 2025 18:17 Selected Answer: C Upvotes: 1

Answer is C.
For Neural Networks Dropout method of regularization thecnique, helps to prevent the overfit of the model, the main idea is remove/turn/off random % of neurons at training.

Comment 6

ID: 1572521 User: MikeFR Badges: - Relative Date: 9 months, 2 weeks ago Absolute Date: Mon 26 May 2025 20:12 Selected Answer: C Upvotes: 1

Droput method

Comment 7

ID: 1562345 User: su787 Badges: - Relative Date: 10 months, 3 weeks ago Absolute Date: Mon 21 Apr 2025 00:23 Selected Answer: C Upvotes: 1

Dropout method

Comment 8

ID: 1366619 User: monyu Badges: - Relative Date: 1 year ago Absolute Date: Sat 08 Mar 2025 16:10 Selected Answer: C Upvotes: 1

Correct answer is C.
A - Is not Threading because it is used to accelerate the training in order to reduce training time.
B - Is not Serialization because it transforms (serializes into bytes) the training data but does not increase or change the original nature.
D - Is not dimensionality reduction because the model fits the training data.

Comment 9

ID: 1362309 User: Ahamada Badges: - Relative Date: 1 year ago Absolute Date: Wed 26 Feb 2025 22:34 Selected Answer: C Upvotes: 2

Dropout methods is the solution here to resolve overfitting issue

Comment 9.1

ID: 1579075 User: Git3 Badges: - Relative Date: 8 months, 3 weeks ago Absolute Date: Fri 20 Jun 2025 07:02 Selected Answer: - Upvotes: 1

okay, like in linear overfitting?

Comment 10

ID: 1343636 User: onschamekh Badges: - Relative Date: 1 year, 1 month ago Absolute Date: Mon 20 Jan 2025 14:53 Selected Answer: C Upvotes: 1

Dropout is a specific technique to prevent overfitting by randomly disabling a certain percentage of neurons during training. This helps the network avoid relying too heavily on a subset of neurons, thereby improving its ability to generalize to new data.

Comment 11

ID: 1334170 User: jithinlife Badges: - Relative Date: 1 year, 2 months ago Absolute Date: Mon 30 Dec 2024 14:44 Selected Answer: C Upvotes: 1

Can we expect similar questions like this in GCP exam as well?

Comment 12

ID: 1300752 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 06:34 Selected Answer: C Upvotes: 2

It occurs overfitting problem. A general idea is to simplify the model. A GENERALIZATION related method should be used.

Comment 13

ID: 1050457 User: rtcpost Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:01 Selected Answer: C Upvotes: 1

C. Dropout Methods
Dropout is a regularization technique commonly used in neural networks to prevent overfitting. It helps improve the generalization of the model by randomly setting a fraction of the neurons to zero during each training iteration, which prevents the network from relying too heavily on specific neurons. This, in turn, can lead to better performance on new, unseen data.

Comment 14

ID: 1060864 User: rocky48 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:01 Selected Answer: C Upvotes: 1

A: Threading parallelisation can reduce training time, but if the selected featuers are the same then the resulting performance won't have changed
B: Serialization is only changing data into byte streams. This won't be useful.
C: This can show which features are bad. E.g. if it is one feature causing bad performance, then the dropout method will show it, so you can remove it from the model and retrain it.
D: This would become clear if the model did not fit the training data well. But the question says that the model fits the training data well.

So, C is the answer.

Comment 15

ID: 1207631 User: trashbox Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Tue 07 May 2024 02:12 Selected Answer: C Upvotes: 1

Dropout Methods are useful to prevent a TensorFlow model from overfitting

Comment 16

ID: 948846 User: azmiozgen Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 11 Jul 2023 11:53 Selected Answer: C Upvotes: 1

Answer is C. Dropout methods are used to mitigate overfitting. Hence, it is commonly used in training phase and it's beneficial for test-time performance.

Comment 17

ID: 919153 User: dgteixeira Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 09 Jun 2023 11:48 Selected Answer: C Upvotes: 1

Answer is C

4. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 148

Sequence
70
Discussion ID
16677
Source URL
https://www.examtopics.com/discussions/google/view/16677-exam-professional-data-engineer-topic-1-question-148/
Posted By
madhu1171
Posted At
March 15, 2020, 5:16 p.m.

Question

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

  • A. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
  • B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
  • C. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.
  • D. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 21 comments Click to expand

Comment 1

ID: 120080 User: dambilwa Badges: Highly Voted Relative Date: 5 years, 2 months ago Absolute Date: Sat 26 Dec 2020 04:52 Selected Answer: - Upvotes: 10

Option [B] - looks to be correct

Comment 2

ID: 178685 User: nehaxlpb Badges: Highly Voted Relative Date: 4 years, 12 months ago Absolute Date: Sat 13 Mar 2021 14:33 Selected Answer: - Upvotes: 8

Answer is B.
Cloud Vision API detects lot of things for not damages. The description of Damages can be different for each business . So we need to train the model with test and training data to give our definition of Damages, so we need ML capabilities so answer is B, AutoML.

Comment 3

ID: 1581592 User: 2fbe820 Badges: Most Recent Relative Date: 8 months, 2 weeks ago Absolute Date: Sun 29 Jun 2025 13:01 Selected Answer: B Upvotes: 1

why not C - as per Gemini: The pre-trained Cloud Vision API provides general-purpose image analysis (e.g., object detection for common objects, landmark detection, text detection, safe search). It is not designed or trained to specifically detect "damage" on arbitrary packages. "Damage" is a highly specific and custom visual pattern that would require a custom model. The API would not inherently know what "damage" looks like on a package.
Fit: Incorrect tool for custom damage detection.

Comment 4

ID: 1108770 User: enthGCP Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Sat 29 Jun 2024 14:22 Selected Answer: - Upvotes: 2

as per chat gpt One of the features of Cloud Vision API is damage detection, which can be used to identify and classify various types of damage in images, such as cracks, dents, scratches, stains, etc

Comment 5

ID: 1099858 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 16:26 Selected Answer: B Upvotes: 1

For this scenario, where you need to automate the detection of damaged packages in real time while they are in transit, the most suitable solution among the provided options would be B.

Here's why this option is the most appropriate:

Real-Time Analysis: AutoML provides the capability to train a custom model specifically tailored to recognize patterns of damage in packages. This model can process images in real-time, which is essential in your scenario.

Integration with Existing Systems: By building an API around the AutoML model, you can seamlessly integrate this solution with your existing package tracking applications. This ensures that the system can flag damaged packages for human review efficiently.

Customization and Accuracy: Since the model is trained on your specific corpus of images, it can be more accurate in detecting damages relevant to your use case compared to pre-trained models.

Comment 5.1

ID: 1099859 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Tue 18 Jun 2024 16:27 Selected Answer: - Upvotes: 1

Let's briefly consider why the other options are less suitable:

A. Use BigQuery machine learning: BigQuery is great for handling large-scale data analytics but is not optimized for real-time image processing or complex image recognition tasks like damage detection on packages.

C. Use the Cloud Vision API: While the Cloud Vision API is powerful for general image analysis, it might not be as effective for the specific task of detecting damage on packages, which requires a more customized approach.

D. Use TensorFlow in Cloud Datalab: While this is a viable option for creating a custom model, it might be more complex and time-consuming compared to using AutoML. Additionally, setting up a real-time analysis system through a Python notebook might not be as straightforward as an API integration.

Comment 6

ID: 1092267 User: juliorevk Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Mon 10 Jun 2024 03:16 Selected Answer: B Upvotes: 3

I was leaning towards C but tested out uploading some damaged boxes to Vision API. It seems to have a lot of trouble detecting damaged boxes. It mislabeled boxes as a tire or toy. Also, there is no part of the API that seems to be able to detect damage. So I'll have to go with B. You should train a model to accomplish this then integrate with your app.

Comment 7

ID: 1015471 User: barnac1es Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sun 24 Mar 2024 06:48 Selected Answer: B Upvotes: 2

AutoML for Custom Models: AutoML (Auto Machine Learning) is designed to simplify the process of training custom machine learning models, including image classification models. It allows you to leverage Google Cloud's pre-built AutoML Vision service to train a model specifically for detecting package damage based on your corpus of images. This ensures accurate and customized results.
Real-time API Integration: After training the AutoML model, you can create an API endpoint that integrates seamlessly with your package tracking applications. This means that as packages move on the delivery lines, you can send images in real-time to the API for immediate analysis.
Scalability: AutoML Vision is built to scale, so it can handle the analysis of images in real-time, even as packages move continuously on the delivery lines.

Comment 8

ID: 985654 User: arien_chen Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 11:50 Selected Answer: C Upvotes: 5

Keywords: realtime, camera streaming

https://cloud.google.com/vision#:~:text=where%20you%20are-,Vertex%20AI%20Vision,-Vertex%C2%A0AI%20Vision

Option B AutoML would be too complex and not time efficient.

Using Vision AI(Vertex AI Vision) first + AutoML
Option D is better than B (just AutoML).

Comment 8.1

ID: 985656 User: arien_chen Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 11:53 Selected Answer: - Upvotes: 3

typo: Option C is better than B.

Comment 9

ID: 983588 User: piyush7777 Badges: - Relative Date: 2 years ago Absolute Date: Sat 17 Feb 2024 15:11 Selected Answer: - Upvotes: 1

B
https://www.cloudskillsboost.google/focuses/22020?parent=catalog

Comment 10

ID: 963279 User: vamgcp Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 05:46 Selected Answer: B Upvotes: 1

Option B - AutoML

Comment 11

ID: 893747 User: Oleksandr0501 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 10 Nov 2023 11:21 Selected Answer: - Upvotes: 3

will stay with B. Might be more reliable, accurate.
as many says in duscission, Vision Api does not say it has defect detection.
i remember labs with Auto ML, where models were trained. Vertex AI labs.

Comment 12

ID: 761617 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 02:21 Selected Answer: - Upvotes: 2

B is right

Comment 13

ID: 727272 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Fri 26 May 2023 05:14 Selected Answer: - Upvotes: 3

B
C-is not answer
Vision API currently allows you to use the following features:
https://cloud.google.com/vision/docs/features-list#:~:text=Vision%20API%20currently%20allows%20you%20to%20use%20the%20following%20features%3A

Comment 14

ID: 712668 User: cloudmon Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 06 May 2023 21:53 Selected Answer: B Upvotes: 3

It looks like B is the only valid option, with the assumption that you have a corpus of images (the question does not say that you do not).
It would not be Cloud Vision API because that does not do damage detection (https://cloud.google.com/vision/docs/features-list).

Comment 15

ID: 677284 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Thu 23 Mar 2023 18:50 Selected Answer: - Upvotes: 2

https://cloud.google.com/solutions/visual-inspection-ai#all-features

Comment 15.1

ID: 712667 User: cloudmon Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 06 May 2023 21:50 Selected Answer: - Upvotes: 2

This is how you would do it nowadays, but the question is not referring to this solution. It only refers to "Cloud Vision API" (not Visual Inspection API). Cloud Vision API does not do damage detection (https://cloud.google.com/vision/docs/features-list) so you would need to do AutoML. It looks like they assume that you have your own corpus of images.

Comment 16

ID: 543522 User: Deepakd Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Tue 09 Aug 2022 04:17 Selected Answer: - Upvotes: 1

Here it is mentioned that the company is planning to implement a camera system. So it does not have the training data yet. Without having training data , the only option left is to use pre- trained models through cloud API . C Is the answer. B is wrong as you dont have data to train the model.

Comment 16.1

ID: 577201 User: Deepakd Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Thu 29 Sep 2022 05:15 Selected Answer: - Upvotes: 3

I would correct myself and go for B. I did not find any mention of cloud vision api being used for object detection.

Comment 17

ID: 530078 User: sraakesh95 Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Fri 22 Jul 2022 21:22 Selected Answer: B Upvotes: 2

AutoML is used to train model and do damage detection Auto Vision is used is a pre trained model used to detect objects in images

5. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 127

Sequence
98
Discussion ID
17236
Source URL
https://www.examtopics.com/discussions/google/view/17236-exam-professional-data-engineer-topic-1-question-127/
Posted By
-
Posted At
March 22, 2020, 11:01 a.m.

Question

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do?

  • A. Use Cloud TPUs without any additional adjustment to your code.
  • B. Use Cloud TPUs after implementing GPU kernel support for your customs ops.
  • C. Use Cloud GPUs after implementing GPU kernel support for your customs ops.
  • D. Stay on CPUs, and increase the size of the cluster you're training your model on.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 24 comments Click to expand

Comment 1

ID: 70205 User: dhs227 Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Wed 01 Apr 2020 20:00 Selected Answer: - Upvotes: 70

The correct answer is C
TPU does not support custom C++ tensorflow ops
https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus

Comment 1.1

ID: 1053168 User: ffggrre Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 24 Oct 2023 23:06 Selected Answer: - Upvotes: 1

the link doesn't say TPU does not support custom C++ tensorflow ops

Comment 1.1.1

ID: 1107933 User: Helinia Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 28 Dec 2023 16:53 Selected Answer: - Upvotes: 1

It does. TPU is good for "Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop".

Comment 2

ID: 70171 User: aiguy Badges: Highly Voted Relative Date: 5 years, 11 months ago Absolute Date: Wed 01 Apr 2020 18:05 Selected Answer: - Upvotes: 44

D:
Cloud TPUs are not suited to the following workloads: [...] Neural network workloads that contain custom TensorFlow operations written in C++. Specifically, custom operations in the body of the main training loop are not suitable for TPUs.

Comment 2.1

ID: 576078 User: tavva_prudhvi Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Sun 27 Mar 2022 10:44 Selected Answer: - Upvotes: 1

But, in the question it also says we have to decrease the time significantly?? If you gonna use the CPU, it will take more time to train, right?

Comment 2.1.1

ID: 909957 User: cetanx Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 08:03 Selected Answer: - Upvotes: 1

Chat GPT says C
Option D is not the most cost-effective or efficient solution. While increasing the size of the cluster could decrease the training time, it would also significantly increase the cost, and CPUs are not as efficient for this type of workload as GPUs.

Comment 2.1.1.1

ID: 982643 User: FP77 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 16 Aug 2023 15:50 Selected Answer: - Upvotes: 3

chatgpt will give you different answers if you ask 10 times. The correct answer is B

Comment 2.1.1.1.1

ID: 1056229 User: squishy_fishy Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Sat 28 Oct 2023 15:26 Selected Answer: - Upvotes: 2

Totally agree. ChatGPT is garbage. It is still learning.

Comment 2.2

ID: 309431 User: gopinath_k Badges: - Relative Date: 4 years, 12 months ago Absolute Date: Sat 13 Mar 2021 05:55 Selected Answer: - Upvotes: 11

B:
1. You need to provide support for the matrix multiplication - TPU
2. You need to provide support for the Custom TF written in C++ - GPU

Comment 3

ID: 1561389 User: aaaaaaaasdasdasfs Badges: Most Recent Relative Date: 10 months, 4 weeks ago Absolute Date: Thu 17 Apr 2025 09:48 Selected Answer: C Upvotes: 1

C. Use Cloud GPUs after implementing GPU kernel support for your customs ops.
Here's why:

Your current bottleneck is in the custom C++ TensorFlow ops performing matrix multiplications, which are computationally intensive operations.
GPUs are specifically designed to efficiently handle matrix operations and would significantly speed up your training time compared to CPUs.
Since you've developed custom C++ TensorFlow ops, you'll need to implement GPU kernel support for these operations to take advantage of GPU acceleration.

Comment 4

ID: 1302549 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Thu 24 Oct 2024 19:13 Selected Answer: D Upvotes: 3

According to the official documentation. Models that contain many custom TensorFlow operations written in C++ should keep using CPUs.

Comment 5

ID: 1289411 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Thu 26 Sep 2024 13:50 Selected Answer: - Upvotes: 3

I think this is D. I recently did the ML professional exam and they ask that there, and it's always "c++ custom ops = CPU", it's in fact the only scenario for non-small models on CPU. It's written in black and white here: https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus, check out the CPU/GPU/TPU "when to use" section.

Comment 6

ID: 1226620 User: Anudeep58 Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 08 Jun 2024 10:01 Selected Answer: C Upvotes: 2

Why Not Other Options?
A. Use Cloud TPUs without any additional adjustment to your code:

TPUs are optimized for standard TensorFlow operations and require custom TensorFlow ops to be adapted to TPU-compatible kernels, which is not trivial.
Without modifications, your custom C++ ops will not run efficiently on TPUs.
B. Use Cloud TPUs after implementing GPU kernel support for your customs ops:

Implementing GPU kernel support alone is not sufficient for running on TPUs. TPUs require specific optimizations and adaptations beyond GPU kernels.
D. Stay on CPUs, and increase the size of the cluster you're training your model on:

While increasing the CPU cluster size might reduce training time, it is not as efficient or cost-effective as using GPUs, especially for matrix multiplication tasks.

Comment 7

ID: 1224523 User: AlizCert Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Wed 05 Jun 2024 07:06 Selected Answer: C Upvotes: 2

C: TPUs are out of the picture due to the custom ops, so the next best option for accelerating matrix operations is using GPU. Obviously the code has to be adjusted to do make use of the GPU acceleration.

Comment 8

ID: 1218377 User: GCP_data_engineer Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 25 May 2024 16:38 Selected Answer: - Upvotes: 1

CPU : Simple models
GPU: Custom TensorFlow/PyTorch/JAX operations

Comment 9

ID: 1189553 User: CGS22 Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Thu 04 Apr 2024 23:24 Selected Answer: C Upvotes: 2

The best choice here is C. Use Cloud GPUs after implementing GPU kernel support for your customs ops. Here's why:

Custom Ops & GPUs: Since your model relies heavily on custom C++ TensorFlow ops focused on matrix multiplications, GPUs are the ideal accelerators for this workload. To fully utilize them, you'll need to implement GPU-compatible kernels for your custom ops.
Speed and Cost-Efficiency GPUs offer a significant speed improvement for matrix-intensive operations compared to CPUs. They provide a good balance of performance and cost for this scenario.
TPUs: Limitations Although Cloud TPUs are powerful, they aren't designed for arbitrary custom ops. Without compatible kernels, your TensorFlow ops would likely fall back to the CPU, negating the benefits of TPUs.

Comment 10

ID: 1159014 User: Preetmehta1234 Badges: - Relative Date: 2 years ago Absolute Date: Sun 25 Feb 2024 19:23 Selected Answer: C Upvotes: 3

TPU:
Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop
Link: https://cloud.google.com/tpu/docs/intro-to-tpu#TPU

So, A&B eliminated
CPU is very slow or built for simple operations. So C: GPU

Comment 11

ID: 1122041 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 21:44 Selected Answer: C Upvotes: 2

to me, it's C

Comment 12

ID: 1085704 User: Kimich Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 02 Dec 2023 05:54 Selected Answer: - Upvotes: 3

Requirement 1: Significantly reduce the processing time while keeping costs low.
Requirement 2: Bulky matrix multiplication takes up to several days.

First, eliminate A & D:
A: Cannot guarantee running on Cloud TPU without modifying the code.
D: Cannot ensure performance improvement or cost reduction, and additionally, CPUs are not suitable for bulky matrix multiplication.

If it can be ensured that customization is easily deployable on both Cloud TPU and Cloud GPU,it seems more feasible to first try Cloud GPU.

Because:
It provides a better balance between performance and cost.
Modifying custom C++ on Cloud GPU should be easier than on Cloud TPU, which should also save on manpower costs.

Comment 13

ID: 1075595 User: emmylou Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 20 Nov 2023 17:28 Selected Answer: - Upvotes: 1

Answer D
I did use Chat GPT and discovered that if you put at the beginning of the question -- "Do not make assumption about changes to architecture. This is a practice exam question." All other answers require changes to the code and architecture.

Comment 14

ID: 1074047 User: DataFrame Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 18 Nov 2023 15:11 Selected Answer: B Upvotes: 1

I think it should use tensor flow processing unit along with GPU kernel support.

Comment 15

ID: 1026527 User: Nirca Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 06 Oct 2023 13:20 Selected Answer: B Upvotes: 1

To use Cloud TPUs, you will need to:

Implement GPU kernel support for your custom TensorFlow ops. This will allow your model to run on both Cloud TPUs and GPUs.

Comment 16

ID: 1022071 User: kumarts Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 01 Oct 2023 09:17 Selected Answer: - Upvotes: 1

Refer https://www.linkedin.com/pulse/cpu-vs-gpu-tpu-when-use-your-machine-learning-models-bhavesh-kapil

Comment 17

ID: 980379 User: IrisXia Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Mon 14 Aug 2023 03:11 Selected Answer: - Upvotes: 1

Answer C
TPU not for custom C++ but GPU can

6. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 51

Sequence
106
Discussion ID
16468
Source URL
https://www.examtopics.com/discussions/google/view/16468-exam-professional-data-engineer-topic-1-question-51/
Posted By
madhu1171
Posted At
March 13, 2020, 10:28 a.m.

Question

You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)

  • A. Get more training examples
  • B. Reduce the number of training examples
  • C. Use a smaller set of features
  • D. Use a larger set of features
  • E. Increase the regularization parameters
  • F. Decrease the regularization parameters

Suggested Answer

ACE

Answer Description Click to expand


Community Answer Votes

Comments 18 comments Click to expand

Comment 1

ID: 63454 User: madhu1171 Badges: Highly Voted Relative Date: 5 years, 6 months ago Absolute Date: Sun 13 Sep 2020 09:28 Selected Answer: - Upvotes: 68

it should be ACE

Comment 2

ID: 66661 User: Rajokkiyam Badges: Highly Voted Relative Date: 5 years, 5 months ago Absolute Date: Mon 21 Sep 2020 20:08 Selected Answer: - Upvotes: 6

Answer: ACE

Comment 3

ID: 1402070 User: monyu Badges: Most Recent Relative Date: 11 months, 3 weeks ago Absolute Date: Sat 22 Mar 2025 23:02 Selected Answer: ACE Upvotes: 1

A. Because getting more training samples reduces significantly the risk of overfitting since the algorithm can learn from a more general dataset.

C. Introducing lots of features increases the risk to introducing irrelevant information, driving the model to avoid focusing on the truly important patterns.

E. Because regularization increases the penalty term to the loss function, which discourages complex models with large coefficients avoiding overfitting.

Comment 4

ID: 1096468 User: TVH_Data_Engineer Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Fri 14 Jun 2024 12:50 Selected Answer: ACE Upvotes: 5

To address the problem of overfitting in training a spam classifier, you should consider the following three actions:

A. Get more training examples:

Why: More training examples can help the model generalize better to unseen data. A larger dataset typically reduces the chance of overfitting, as the model has more varied examples to learn from.
C. Use a smaller set of features:

Why: Reducing the number of features can help prevent the model from learning noise in the data. Overfitting often occurs when the model is too complex for the amount of data available, and having too many features can contribute to this complexity.
E. Increase the regularization parameters:

Why: Regularization techniques (like L1 or L2 regularization) add a penalty to the model for complexity. Increasing the regularization parameter will strengthen this penalty, encouraging the model to be simpler and thus reducing overfitting.

Comment 5

ID: 958543 User: Mathew106 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sun 21 Jan 2024 16:25 Selected Answer: ACE Upvotes: 2

100% ACE

We need more data because less data induces overfitting. We need less features to make the problem simpler to learn and not promote learning a very complex function for thousands of features that might not apply to the test data. We also need to use regularization to keep the weights constrained.

Comment 6

ID: 954523 User: theseawillclaim Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 17 Jan 2024 21:25 Selected Answer: ACE Upvotes: 1

Definitely ACE.
More training data and less variables can prevent the model from being too picky or specific.

Comment 7

ID: 818897 User: jin0 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 23 Aug 2023 07:08 Selected Answer: - Upvotes: 1

? why A is answer? even though 'more training example' not 'more dataset example'. I understand that there is dataset same and there is only change the size of training examples size. in this case there are valid and test example should be reduced. isn't it?

Comment 8

ID: 774731 User: desertlotus1211 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 13 Jul 2023 17:35 Selected Answer: - Upvotes: 1

Collect more training data: This will help the model generalize better and reduce overfitting.

Use regularization techniques: Techniques such as L1 and L2 regularization can be applied to the model's weights to prevent them from becoming too large and causing overfitting.

Use early stopping: This involves monitoring the performance of the model on a validation set during training, and stopping the training when the performance on the validation set starts to degrade. This helps to prevent the model from becoming too complex and overfitting the training data.

Comment 8.1

ID: 774736 User: desertlotus1211 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 13 Jul 2023 17:39 Selected Answer: - Upvotes: 1

Regularization is a technique that penalizes the coefficient. In an overfit model, the coefficients are generally inflated. Thus, Regularization adds penalties to the parameters and avoids them weigh heavily.

A & C are correct... the third one --- not sure on

Comment 9

ID: 770710 User: RoshanAshraf Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 09 Jul 2023 17:45 Selected Answer: ACE Upvotes: 1

A -The training data is causing the overfiting for the testing data, so addition of training data will solve this.
C - Larger sets will cause overfitting, so we have to use smaller sets or reduce features
E - Increase the regularization is a method for solving the Overfitting model

Comment 10

ID: 766087 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 21:02 Selected Answer: - Upvotes: 3

Answers are;
A. Get more training examples
C. Use a smaller set of features
E. Increase the regularization parameters

Prevent overfitting: less variables, regularisation, early ending on the training

Reference:
https://cloud.google.com/bigquery-ml/docs/preventing-overfitting

Comment 11

ID: 744634 User: DGames Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 14 Jun 2023 02:19 Selected Answer: ADE Upvotes: 1

Answer ADE

Comment 12

ID: 703964 User: MisuLava Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Tue 25 Apr 2023 16:56 Selected Answer: ACE Upvotes: 1

100% sure ACE

https://elitedatascience.com/overfitting-in-machine-learning

Comment 13

ID: 652308 User: MisuLava Badges: - Relative Date: 3 years ago Absolute Date: Sun 26 Feb 2023 21:14 Selected Answer: - Upvotes: 1

Answer is : ACE
https://www.ibm.com/cloud/learn/overfitting#:~:text=Overfitting%20is%20a%20concept%20in,unseen%20data%2C%20defeating%20its%20purpose.

Comment 14

ID: 646818 User: Noahz110 Badges: - Relative Date: 3 years ago Absolute Date: Tue 14 Feb 2023 18:41 Selected Answer: ACE Upvotes: 1

im vote for ACE

Comment 15

ID: 642542 User: Dip1994 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Sat 04 Feb 2023 19:12 Selected Answer: - Upvotes: 1

It should be ACE

Comment 16

ID: 523706 User: sraakesh95 Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Thu 14 Jul 2022 19:49 Selected Answer: ACE Upvotes: 1

@medeis_jar

Comment 17

ID: 516724 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Mon 04 Jul 2022 14:36 Selected Answer: ACE Upvotes: 4

As MaxNRG wrote:
The tools to prevent overfitting: less variables, regularization, early ending on the training.

- Adding more training data will increase the complexity of the training set and help with the variance problem.
- Reducing the feature set will ameliorate the overfitting and help with the variance problem.
- Increasing the regularization parameter will reduce overfitting and help with the variance problem.

7. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 245

Sequence
109
Discussion ID
129901
Source URL
https://www.examtopics.com/discussions/google/view/129901-exam-professional-data-engineer-topic-1-question-245/
Posted By
chickenwingz
Posted At
Dec. 30, 2023, 6:23 p.m.

Question

You are developing a model to identify the factors that lead to sales conversions for your customers. You have completed processing your data. You want to continue through the model development lifecycle. What should you do next?

  • A. Use your model to run predictions on fresh customer input data.
  • B. Monitor your model performance, and make any adjustments needed.
  • C. Delineate what data will be used for testing and what will be used for training the model.
  • D. Test and evaluate your model on your curated data to determine how well the model performs.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 8 comments Click to expand

Comment 1

ID: 1114097 User: raaad Badges: Highly Voted Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 22:53 Selected Answer: C Upvotes: 7

- Before you can train a model, you need to decide how to split your dataset.

Comment 2

ID: 1121688 User: Matt_108 Badges: Highly Voted Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 14:37 Selected Answer: C Upvotes: 6

Option C - you've just concluded processing data, ending up with clean and prepared data for the model. Now you need to decide how to split the data for testing and for training. Only afterwards, you can train the model, evaluate it, fine tune it and, eventually, predict with it

Comment 3

ID: 1402338 User: desertlotus1211 Badges: Most Recent Relative Date: 11 months, 3 weeks ago Absolute Date: Sun 23 Mar 2025 16:23 Selected Answer: D Upvotes: 1

Anwser D is correct:
the next step in the machine learning lifecycle is to evaluate the model

Delineate test/train data should have been done BEFORE or during data processing

Comment 3.1

ID: 1402339 User: desertlotus1211 Badges: - Relative Date: 11 months, 3 weeks ago Absolute Date: Sun 23 Mar 2025 16:24 Selected Answer: - Upvotes: 1

The machine learning life cycle typically involves planning, data preparation, model engineering, model evaluation, model deployment, and monitoring/maintenance.

Comment 4

ID: 1263446 User: meh_33 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Sat 10 Aug 2024 13:35 Selected Answer: C Upvotes: 3

First ever time Exam Topic answers matching with users answer yoooo hoooooo.

C

Comment 4.1

ID: 1283705 User: 1919730 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Sat 14 Sep 2024 18:05 Selected Answer: - Upvotes: 1

Yoooo hoooooo.Yep !

Comment 5

ID: 1154466 User: JyoGCP Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 04:52 Selected Answer: C Upvotes: 1

Option C

Comment 6

ID: 1109856 User: chickenwingz Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sat 30 Dec 2023 18:23 Selected Answer: C Upvotes: 6

Model doesn't seem to be trained yet

8. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 7

Sequence
110
Discussion ID
16640
Source URL
https://www.examtopics.com/discussions/google/view/16640-exam-professional-data-engineer-topic-1-question-7/
Posted By
-
Posted At
March 15, 2020, 8:43 a.m.

Question

You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?

  • A. Linear regression
  • B. Logistic classification
  • C. Recurrent neural network
  • D. Feedforward neural network

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 20 comments Click to expand

Comment 1

ID: 207480 User: Radhika7983 Badges: Highly Voted Relative Date: 5 years, 4 months ago Absolute Date: Wed 28 Oct 2020 04:28 Selected Answer: - Upvotes: 59

Correct answer is A. A tip here to decide when a liner regression should be used or logistics regression needs to be used. If you are forecasting that is the values in the column that you are predicting is numeric, it is always liner regression. If you are classifying, that is buy or no buy, yes or no, you will be using logistics regression.

Comment 1.1

ID: 259880 User: Anirkent Badges: - Relative Date: 5 years, 2 months ago Absolute Date: Tue 05 Jan 2021 03:37 Selected Answer: - Upvotes: 7

Liner Regression is correct but this is one aspect of the question, how does it relates to resource constrained machines? or that could be just a distraction?

Comment 1.1.1

ID: 282827 User: muzammilnxs Badges: - Relative Date: 5 years, 1 month ago Absolute Date: Wed 03 Feb 2021 17:31 Selected Answer: - Upvotes: 27

Neural Networks(Feed Forward or Recurrent) require resource intensive machines(i.e GPU's) whereas Linear regression can be done on ordinary CPU's

Comment 2

ID: 1399887 User: willyunger Badges: Most Recent Relative Date: 12 months ago Absolute Date: Tue 18 Mar 2025 00:14 Selected Answer: A Upvotes: 1

linear regression to predict numeric with least cost. for classifying with options use logistics.

Comment 3

ID: 1301135 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Mon 21 Oct 2024 18:06 Selected Answer: A Upvotes: 2

the keyword here is running it on a single resource-constrained virtual machine. linear regression is a simple and efficient algorithm that is well-suited for predicting continuous values.

Comment 4

ID: 1050469 User: rtcpost Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Tue 24 Sep 2024 07:08 Selected Answer: A Upvotes: 3

Linear regression is a simple and resource-efficient algorithm for predicting continuous values like housing prices. It's computationally lightweight and well-suited for single machines with limited resources. It doesn't require the extensive computational power or specialized hardware that more complex algorithms like neural networks (options C and D) might need.

Option B (Logistic classification) is used for binary classification tasks, not for predicting continuous values like housing prices, so it's not the right choice in this context.

Comment 5

ID: 1154388 User: AshishDhamu Badges: - Relative Date: 2 years ago Absolute Date: Tue 20 Feb 2024 01:50 Selected Answer: A Upvotes: 1

Linear regression is used for continous distribution.

Comment 6

ID: 1116623 User: Fazan456 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Mon 08 Jan 2024 14:08 Selected Answer: A Upvotes: 1

Here, due to budget constraints, we're utilizing a single resource-constrained virtual machine, operating in a minimal resource environment. Linear regression emerges as the appropriate algorithm. It's a lightweight predictive model that suits our resource limitations

Comment 7

ID: 1065068 User: RT_G Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 07 Nov 2023 19:15 Selected Answer: A Upvotes: 1

Linear regression will be used since the prediction requires forecasting prices involving numeric values and is computationally less resource intensive

Comment 8

ID: 1061051 User: rocky48 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 03 Nov 2023 04:20 Selected Answer: A Upvotes: 1

Correct answer is A

Comment 9

ID: 904824 User: AmmarFasih Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 23 May 2023 12:48 Selected Answer: A Upvotes: 1

Correct Answer is A. Since linear regression is used to predict a numeric value. While logistic regression is used to classify among the binary scenario.
Further option C and D are advance ML options and not cost and resource effective for the current situation.

Comment 10

ID: 820218 User: Zosby Badges: - Relative Date: 3 years ago Absolute Date: Fri 24 Feb 2023 08:40 Selected Answer: - Upvotes: 2

predict housing prices = linear regresssion

Comment 11

ID: 810092 User: JJJJim Badges: - Relative Date: 3 years ago Absolute Date: Thu 16 Feb 2023 01:48 Selected Answer: A Upvotes: 1

must be A.
Though C can do it, linear regression is the better practice.

Comment 12

ID: 741509 User: lukas_xls Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 11 Dec 2022 08:58 Selected Answer: A Upvotes: 1

Must be A

Comment 13

ID: 647411 User: rowan_ Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Tue 16 Aug 2022 04:19 Selected Answer: - Upvotes: 2

A for sure. B is for classification. Neural nets can accomplish the task but they take WAY too many resources

Comment 14

ID: 529925 User: samdhimal Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sat 22 Jan 2022 16:00 Selected Answer: - Upvotes: 4

correct answer -> Linear Regression

Linear regression is a statistical method that allows to summarize and study relationships between two continuous (quantitative) variables: One variable, denoted X, is regarded as the independent variable. The other variable denoted y is regarded as the dependent variable. Linear regression uses one independent variable X to explain or predict the outcome of the dependent variable y.

Whenever you are told to predict some future value of a process which is currently running, you can go with a regression algorithm.

Comment 14.1

ID: 784804 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 00:57 Selected Answer: - Upvotes: 2

Linear regression is a simple and computationally efficient algorithm that can be used to predict a continuous target variable based on one or more input variables. It is particularly well-suited for resource-constrained environments, as it requires minimal computational resources and can be run on a single virtual machine.
Linear regression is a good fit for this problem as it is a supervised learning algorithm that can be used for regression problems, and it's not computationally expensive.

Option B is not recommended as Logistic classification is a supervised learning algorithm that is used for classification problems, not regression problems.

Option C and D are not recommended as Recurrent Neural Network (RNN) and Feedforward Neural Network (FNN) are computationally expensive and may require significant computational resources and memory to run on a single virtual machine.

Comment 15

ID: 474006 User: MaxNRG Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Sun 07 Nov 2021 19:29 Selected Answer: - Upvotes: 3

A as Supervised learning using Regression can help build a model to predict house prices.
Option B is wrong as Classification would not help to solve the problem.
Options C & D are wrong as they would need more resources.

Comment 16

ID: 462019 User: anji007 Badges: - Relative Date: 4 years, 4 months ago Absolute Date: Thu 14 Oct 2021 14:41 Selected Answer: - Upvotes: 1

Ans: A

Comment 17

ID: 438490 User: StefanoG Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Fri 03 Sep 2021 13:47 Selected Answer: - Upvotes: 2

Ok the right answer is A, but the question is why? Then:
- B not because we are make forecasting and not classifying
- C and D not because this solution need more nodes, then more VM.
Right?

9. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 71

Sequence
127
Discussion ID
17111
Source URL
https://www.examtopics.com/discussions/google/view/17111-exam-professional-data-engineer-topic-1-question-71/
Posted By
-
Posted At
March 21, 2020, 4:45 p.m.

Question

You are developing an application on Google Cloud that will automatically generate subject labels for users' blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?

  • A. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
  • B. Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.
  • C. Build and train a text classification model using TensorFlow. Deploy the model using Cloud Machine Learning Engine. Call the model from your application and process the results as labels.
  • D. Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 14 comments Click to expand

Comment 1

ID: 137968 User: VishalB Badges: Highly Voted Relative Date: 4 years, 7 months ago Absolute Date: Sun 18 Jul 2021 17:37 Selected Answer: - Upvotes: 36

Correct Answer : A

Entity analysis -> Identify entities within documents receipts, invoices, and contracts and label them by types such as date, person, contact information, organization, location, events, products, and media.

Sentiment analysis -> Understand the overall opinion, feeling, or attitude sentiment expressed in a block of text.
-- Avoid Custom models

Comment 1.1

ID: 766098 User: AzureDP900 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 22:12 Selected Answer: - Upvotes: 1

https://cloud.google.com/natural-language/docs/analyzing-entities
https://cloud.google.com/natural-language/docs/analyzing-sentiment

Comment 2

ID: 1366581 User: Ratikl Badges: Most Recent Relative Date: 1 year ago Absolute Date: Sat 08 Mar 2025 14:46 Selected Answer: A Upvotes: 1

Call the Cloud Natural Language API from your application. Process the generated Entities Analysis as labels.

The Cloud Natural Language API is a pre-trained machine learning model that can be used for natural language processing tasks such as entity recognition, sentiment analysis, and syntax analysis. The API can be called from your application using a simple API call, and it can generate entities analysis that can be used as labels for the user's blog posts. This would be the quickest and easiest option for your team since it would not require any machine learning expertise or additional developer resources to build and train a model. Additionally, it will give you accurate and up-to-date results as the API is constantly updated by Google.

Comment 3

ID: 1021787 User: Az900Exam2021 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 30 Sep 2024 20:03 Selected Answer: - Upvotes: 3

For the first time, the answer in exam topics matches community vote :-).

Comment 4

ID: 906417 User: AmmarFasih Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Sat 25 May 2024 07:50 Selected Answer: A Upvotes: 2

Of course the answer is A. Since the problem already states that you don't have time, resources or expertise. So the best solution in the case is to utilize the available API. Also since we need to extract the labels and not the sentiment of the text, we'll go for option A and not B

Comment 5

ID: 785802 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 23 Jan 2024 21:24 Selected Answer: - Upvotes: 1

A. Call the Cloud Natural Language API from your application. Process the generated Entities Analysis as labels.

The Cloud Natural Language API is a pre-trained machine learning model that can be used for natural language processing tasks such as entity recognition, sentiment analysis, and syntax analysis. The API can be called from your application using a simple API call, and it can generate entities analysis that can be used as labels for the user's blog posts. This would be the quickest and easiest option for your team since it would not require any machine learning expertise or additional developer resources to build and train a model. Additionally, it will give you accurate and up-to-date results as the API is constantly updated by Google.

Comment 6

ID: 766096 User: AzureDP900 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 04 Jan 2024 22:09 Selected Answer: - Upvotes: 1

Answer is A
Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.

Entity analysis -> Identify entities within documents receipts, invoices, and contracts and label them by types such as date, person, contact information, organization, location, events, products, and media.

Sentiment analysis -> Understand the overall opinion, feeling, or attitude sentiment expressed in a block of text.

Comment 7

ID: 713583 User: NicolasN Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Wed 08 Nov 2023 09:00 Selected Answer: - Upvotes: 2

Apparently, there is unanimity on answer [A]
What if there was another available answer in an actual exam?

E. Call the Cloud Natural Language API from your application. Process the generated Content Classification as labels

What would you choose, A or E?

My opinion is that Content Classification is more suitable for detecting subject.

Comment 8

ID: 632243 User: Remi2021 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 16 Jul 2023 17:33 Selected Answer: - Upvotes: 1

A is the right one . Doc says:
Entity analysis inspects the given text for known entities (Proper nouns such as public figures, landmarks, and so on. Common nouns such as restaurant, stadium, and so on.) and returns information about those entities. Entity analysis is performed with the analyzeEntities method.

Comment 9

ID: 599512 User: waterh2oeau Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Wed 10 May 2023 12:09 Selected Answer: A Upvotes: 1

Vote for A

Comment 10

ID: 544613 User: bury Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 10 Feb 2023 15:26 Selected Answer: A Upvotes: 1

a is correct

Comment 11

ID: 473249 User: JayZeeLee Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Sun 06 Nov 2022 00:26 Selected Answer: - Upvotes: 1

A.
CD don't work as it requires Machine Learning experience.
B - Sentiment Analysis is to analyze attitude, opinion, etc. So A.

Comment 12

ID: 393701 User: sumanshu Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Wed 29 Jun 2022 12:51 Selected Answer: - Upvotes: 3

Vote for A

Comment 13

ID: 161835 User: haroldbenites Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Fri 20 Aug 2021 01:01 Selected Answer: - Upvotes: 4

A is correct

10. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 113

Sequence
142
Discussion ID
16856
Source URL
https://www.examtopics.com/discussions/google/view/16856-exam-professional-data-engineer-topic-1-question-113/
Posted By
rickywck
Posted At
March 17, 2020, 12:12 p.m.

Question

You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems. Which solutions should you choose?

  • A. Speech-to-Text API
  • B. Cloud Natural Language API
  • C. Dialogflow Enterprise Edition
  • D. AutoML Natural Language

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 22 comments Click to expand

Comment 1

ID: 65159 User: rickywck Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Tue 17 Mar 2020 12:12 Selected Answer: - Upvotes: 26

should be C, since we need to recognize both voice and intent

Comment 1.1

ID: 762283 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 30 Dec 2022 20:59 Selected Answer: - Upvotes: 1

C. Dialogflow Enterprise Editio

Comment 2

ID: 216388 User: Alasmindas Badges: Highly Voted Relative Date: 5 years, 4 months ago Absolute Date: Tue 10 Nov 2020 07:05 Selected Answer: - Upvotes: 19

Option A - Cloud Speech-to-Text API.
The question is just asking to " interpret customer voice commands" .. it does not mention anything related to sentiment analysis so NLP is not required. DialogFlow is more of a chat bot services typically suited for a "Service Desk" kind of setup - where clients will call a centralized helpdesk and automation is achieved through Chat bot services like - google Dialog flow

Comment 2.1

ID: 415375 User: hdmi_switch Badges: - Relative Date: 4 years, 7 months ago Absolute Date: Tue 27 Jul 2021 12:52 Selected Answer: - Upvotes: 8

Cloud Speech-to-Text API just converts speech to text. You will have text files as an output and then the requirement is to "interpret customer voice commands and issue an order to the backend systems". This is not achieved by having text files.

I would go with option C, since Dialogflow can interpret the commands (intents) and integrates other applications e.g. backend systems.

Comment 2.2

ID: 1013765 User: exnaniantwort Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 22 Sep 2023 09:36 Selected Answer: - Upvotes: 1

shuld be C, the key is interpret customer voice commands

Comment 2.3

ID: 663024 User: HarshKothari21 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 08 Sep 2022 03:20 Selected Answer: - Upvotes: 2

Question also says "in-home assistants, such as Google Home". the idea here is to provide assistance which involves Dialog.

I would go with option C

Comment 3

ID: 1342788 User: grshankar9 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sat 18 Jan 2025 23:43 Selected Answer: C Upvotes: 1

While both are Google Cloud services related to speech processing, the key difference is that Cloud Speech-to-Text API is solely focused on transcribing spoken language into text, while DialogFlow is a more comprehensive platform that not only transcribes speech but also interprets the meaning of the conversation, allowing you to build conversational AI applications with features like intent recognition and entity extraction

Comment 4

ID: 1289153 User: LR2023 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Wed 25 Sep 2024 21:55 Selected Answer: - Upvotes: 1

This is a queston from Actual exam question from Google's Professional Data Engineer so C makes sense. This is not an AWS question

Comment 5

ID: 1224380 User: AlizCert Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 04 Jun 2024 23:40 Selected Answer: A Upvotes: 1

The question clearly states "voice commands" which is a term for short (few words long at most) well-defined phrases to be recognized. No need for a dialog.
Even if I were to use Dialogflow, I would use ES instead of CX (new name for Enterprise Edition), no fancy features are required for this.

Comment 6

ID: 973028 User: NeoNitin Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 05 Aug 2023 14:49 Selected Answer: - Upvotes: 1

Ans C . main thing is that question is saying" customer voice commands " there is no need to sentimental analysis of language so thats why.

C. Dialogflow Enterprise Edition

Dialogflow is a powerful natural language understanding platform developed by Google. It allows you to build conversational interfaces, interpret user voice commands, and integrate with various platforms and devices like Google Home. The "Enterprise Edition" provides additional features and support for more complex use cases, making it a good choice for a retailer looking to integrate with in-home assistants and handle customer voice commands effectively.

Comment 7

ID: 843735 User: juliobs Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sun 19 Mar 2023 13:25 Selected Answer: C Upvotes: 3

Answer is C. However Google Assistant Conversational Actions will be sunsetted on June 13, 2023.

Comment 8

ID: 808974 User: techtitan Badges: - Relative Date: 3 years ago Absolute Date: Wed 15 Feb 2023 01:46 Selected Answer: C Upvotes: 2

https://cloud.google.com/dialogflow/es/docs/integrations/aog

Comment 9

ID: 781640 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 19 Jan 2023 22:44 Selected Answer: - Upvotes: 1

I think the answer is A: Speech to Text.
You want to interpret what a user say... Diagflow is text to speech, not what the question asked for...

Thoughts?

Comment 10

ID: 758146 User: PrashantGupta1616 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Tue 27 Dec 2022 06:05 Selected Answer: A Upvotes: 1

The question is just asking to " interpret customer voice commands" so A is out of the box solution

Comment 11

ID: 738255 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 19:53 Selected Answer: A Upvotes: 1

Enable voice control

Implement voice commands such as “turn the volume up,” and voice search such as saying “what is the temperature in Paris?” Combine this with the Text-to-Speech API to deliver voice-enabled experiences in IoT (Internet of Things) applications.
https://cloud.google.com/speech-to-text#section-9

Comment 11.1

ID: 750100 User: odacir Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Mon 19 Dec 2022 18:54 Selected Answer: - Upvotes: 5

I change my mind, it's C.
https://cloud.google.com/blog/products/gcp/introducing-dialogflow-enterprise-edition-a-new-way-to-build-voice-and-text-conversational-apps

Comment 12

ID: 668642 User: TNT87 Badges: - Relative Date: 3 years, 5 months ago Absolute Date: Wed 14 Sep 2022 07:55 Selected Answer: - Upvotes: 2

https://cloud.google.com/blog/products/gcp/introducing-dialogflow-enterprise-edition-a-new-way-to-build-voice-and-text-conversational-apps

Dialogflow is the answer

Comment 13

ID: 633401 User: Smaks Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Tue 19 Jul 2022 08:39 Selected Answer: C Upvotes: 3

Dialogflow provides a seamless integration with Google Assistant. This integration has the following advantages: You can use the same Dialogflow agent to power Google Assistant and other integrations. Dialogflow agents provide Google Cloud enterprise-grade security, privacy, support, and SLAs

Comment 14

ID: 589986 User: Vip777 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Fri 22 Apr 2022 14:43 Selected Answer: - Upvotes: 1

dialog

Comment 15

ID: 589965 User: Vip777 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Fri 22 Apr 2022 14:06 Selected Answer: - Upvotes: 1

speech

Comment 16

ID: 574983 User: PJG_worm Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Fri 25 Mar 2022 12:59 Selected Answer: - Upvotes: 1

It should be D. INTERPRET customer voice commands and issue an order to the backend systems. Option C is usually applied for conversation. But in this case, it is not a conversation.

Comment 17

ID: 518524 User: medeis_jar Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Thu 06 Jan 2022 20:30 Selected Answer: C Upvotes: 4

recognize voice and intent
https://cloud.google.com/blog/products/gcp/introducing-dialogflow-enterprise-edition-a-new-way-to-build-voice-and-text-conversational-apps

11. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 126

Sequence
145
Discussion ID
17235
Source URL
https://www.examtopics.com/discussions/google/view/17235-exam-professional-data-engineer-topic-1-question-126/
Posted By
-
Posted At
March 22, 2020, 10:52 a.m.

Question

You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?

  • A. Use Cloud Vision AutoML with the existing dataset.
  • B. Use Cloud Vision AutoML, but reduce your dataset twice.
  • C. Use Cloud Vision API by providing custom labels as recognition hints.
  • D. Train your own image recognition model leveraging transfer learning techniques.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 23 comments Click to expand

Comment 1

ID: 114675 User: Callumr Badges: Highly Voted Relative Date: 5 years, 2 months ago Absolute Date: Sun 20 Dec 2020 13:01 Selected Answer: - Upvotes: 54

B - You only need a PoC and it has be done quickly

Comment 2

ID: 801628 User: techtitan Badges: Highly Voted Relative Date: 2 years, 7 months ago Absolute Date: Tue 08 Aug 2023 03:39 Selected Answer: - Upvotes: 8

A - https://cloud.google.com/vertex-ai/docs/beginner/beginners-guide Target at least 1000 examples per target

Comment 2.1

ID: 801629 User: techtitan Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Tue 08 Aug 2023 03:41 Selected Answer: - Upvotes: 1

The quick POC part can be achieved by using Auto ML instead of creating and training your own model

Comment 3

ID: 1342801 User: grshankar9 Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Sun 19 Jan 2025 00:47 Selected Answer: A Upvotes: 1

The key difference between Google Cloud Vision AutoML and Cloud Vision API is that Cloud Vision API provides pre-trained models for basic image analysis tasks like object detection and labeling, while Cloud Vision AutoML allows you to train custom machine learning models to identify specific objects or concepts within images that are unique to your dataset, requiring you to provide labeled training data.

Comment 4

ID: 1218563 User: josech Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Mon 25 Nov 2024 23:43 Selected Answer: A Upvotes: 4

AutoML Vision is deprecated since march 31, 2024. The question will refer to Vertex AI AutoML. And as bet practice, the minimum dataset size for each label is 1000. So, with an updated question, the answer would be A.

Comment 5

ID: 1189548 User: CGS22 Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Fri 04 Oct 2024 23:16 Selected Answer: A Upvotes: 1

A. Use Cloud Vision AutoML with the existing dataset.

Here's why this is the most suitable option:

Speed and Ease: AutoML simplifies model building. You simply upload your labeled images, and AutoML takes care of model selection, training, and evaluation.
Existing Dataset Sufficiency: Your dataset (750 components x 1000 images each) is a decent starting point for AutoML, allowing you to quickly test its effectiveness.
Minimal Custom Development: AutoML's out-of-the-box deployment options let you integrate the model into your app without extensive coding.

Comment 6

ID: 1003088 User: saado9 Badges: - Relative Date: 2 years ago Absolute Date: Sat 09 Mar 2024 13:18 Selected Answer: B Upvotes: 1

Option B is the fastest way to train a model that can be used to recognize the 750 different components.

Comment 7

ID: 811877 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 17 Aug 2023 12:57 Selected Answer: - Upvotes: 2

Whats wrong with C, its fast, cheap and add your 750 labels which is not big work.
AutoML is good to train on big dataset and costly as compared to APIs

Comment 7.1

ID: 963313 User: knith66 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Fri 26 Jan 2024 06:32 Selected Answer: - Upvotes: 1

it is a labeled dataset and why do you need to label it once again? So no C

Comment 7.2

ID: 911775 User: forepick Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Fri 01 Dec 2023 08:56 Selected Answer: - Upvotes: 4

Adding custom labels to Vision API is done by training an AutoML model! That's the formal recommendation. And you don't need a big dataset for AutoML as it uses transfer learning.

Comment 8

ID: 738826 User: odacir Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 08 Jun 2023 09:29 Selected Answer: A Upvotes: 8

First I think in Vision API, but that is a pre-trained AI, will not recognize my labels, so because you have 1000 samples per item, AUTO ML is perfect. B cannot be because have not sensed to reduce your dataset if you have the recommended number of info.
https://cloud.google.com/vision/automl/docs/beginners-guide#include_enough_labeled_examples_in_each_category
The bare minimum required by AutoML Vision training is 100 image examples per category/label. The likelihood of successfully recognizing a label goes up with the number of high quality examples for each; in general, the more labeled data you can bring to the training process, the better your model will be. Target at least 1000 examples per label.

Comment 8.1

ID: 762463 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 30 Jun 2023 03:49 Selected Answer: - Upvotes: 2

A is correct

Comment 9

ID: 721388 User: gudiking Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 18 May 2023 15:59 Selected Answer: - Upvotes: 1

A - https://cloud.google.com/vision/automl/docs/beginners-guide#include_enough_labeled_examples_in_each_category

Comment 10

ID: 717339 User: MarielaYBird Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sat 13 May 2023 13:54 Selected Answer: B Upvotes: 5

Based on this:
"As a rule of thumb, we recommend to have at least 100 training samples per class if you have distinctive and few classes, and more than 200 training samples if the classes are more nuanced and you have more than 50 different classes"

750 different components = more than 50 different classes. That means we need more than 200 training samples. If we used 250 training samples out of the 1000 samples and multiply it to 750 different classes we get a total of 187,500 which is the equivalent of reducing the dataset twice.

https://cloud.google.com/vision/automl/object-detection/docs/prepare#how_big_does_the_dataset_need_to_be

Comment 11

ID: 697420 User: josrojgra Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 17 Apr 2023 15:15 Selected Answer: A Upvotes: 3

I choose A because on the vertex AI documentation (https://cloud.google.com/vertex-ai/docs/image-data/classification/prepare-data), on the best practices of preparing data for image recognition recommend this: We recommend about 1000 training images per label. The minimum per label is 10. In general, it takes more examples per label to train models with multiple labels per image, and resulting scores are harder to interpret.

I know that is PoC, but if you do it without enough accuracy, you maybe discard the solution because it isn't fit for your requirements. So is better to do it with enough data to be sure that the model is or not accuracy enough with this data, because you maybe haven't enough accuracy and the problem is the quality of the data and not the amount of it.

Comment 12

ID: 683920 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 01 Apr 2023 03:47 Selected Answer: A Upvotes: 3

https://cloud.google.com/vision/automl/docs/beginners-guide#include_enough_labeled_examples_in_each_category

The bare minimum required by AutoML Vision training is 100 image examples per category/label. The likelihood of successfully recognizing a label goes up with the number of high quality examples for each; in general, the more labeled data you can bring to the training process, the better your model will be. Target at least 1000 examples per label.

Comment 12.1

ID: 683921 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 01 Apr 2023 03:48 Selected Answer: - Upvotes: 1

The more labels, the more accurate the result.

Comment 13

ID: 661901 User: changsu Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 06:07 Selected Answer: B Upvotes: 1

750*1000 are a lot.

Comment 14

ID: 653721 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Tue 28 Feb 2023 04:04 Selected Answer: A Upvotes: 1

It is labeled, so A is correct

Comment 15

ID: 649455 User: civilizador Badges: - Relative Date: 3 years ago Absolute Date: Mon 20 Feb 2023 16:47 Selected Answer: - Upvotes: 5

It's A.
https://cloud.google.com/vision/automl/docs/beginners-guide#data_preparation

The bare minimum required by AutoML Vision training is 100 image examples per category/label. The likelihood of successfully recognizing a label goes up with the number of high quality examples for each; in general, the more labeled data you can bring to the training process, the better your model will be. Target at least 1000 examples per label.

Comment 15.1

ID: 649456 User: civilizador Badges: - Relative Date: 3 years ago Absolute Date: Mon 20 Feb 2023 16:48 Selected Answer: - Upvotes: 1

So even for POC better to use 1000 . There would be no significant time differences anyway between 500 and 1000

Comment 16

ID: 641304 User: TheRealBsh Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Thu 02 Feb 2023 17:38 Selected Answer: - Upvotes: 3

Option A & B are quite close. Refer: https://cloud.google.com/vision/automl/docs/beginners-guide#data_preparation – Says to target at least 1000 images per label for training.

Comment 17

ID: 609481 User: czokwe Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 30 Nov 2022 02:31 Selected Answer: B Upvotes: 1

B
cant choose A because model needs to pass through the dataset several times for a proof of concept, existing data set samples might not be all seen in several working days causing over generalization

12. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 27

Sequence
150
Discussion ID
16969
Source URL
https://www.examtopics.com/discussions/google/view/16969-exam-professional-data-engineer-topic-1-question-27/
Posted By
-
Posted At
March 19, 2020, 10:54 a.m.

Question

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

  • A. Eliminate features that are highly correlated to the output labels.
  • B. Combine highly co-dependent features into one representative feature.
  • C. Instead of feeding in each feature individually, average their values in batches of 3.
  • D. Remove the features that have null values for more than 50% of the training records.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 20 comments Click to expand

Comment 1

ID: 461151 User: anji007 Badges: Highly Voted Relative Date: 3 years, 11 months ago Absolute Date: Tue 12 Apr 2022 18:02 Selected Answer: - Upvotes: 9

Ans: B
A: correlated to output means that feature can contribute a lot to the model. so not a good idea.
C: you need to run with almost same number, but you will iterate twice, once for averaging and second time to feed the averaged value.
D: removing features even if it 50% nulls is not good idea, unless you prove that it is not at all correlated to output. But this is nowhere so can remove.

Comment 2

ID: 544536 User: pamepadero Badges: Highly Voted Relative Date: 3 years, 7 months ago Absolute Date: Wed 10 Aug 2022 12:53 Selected Answer: - Upvotes: 7

Trying to find a reason why it is B and not D, found this and it seems the answer is D.
https://cloud.google.com/architecture/data-preprocessing-for-ml-with-tf-transform-pt1
Feature selection. Selecting a subset of the input features for training the model, and ignoring the irrelevant or redundant ones, using filter or wrapper methods. This can also involve simply dropping features if the features are missing a large number of values.

Comment 2.1

ID: 606417 User: Dayashankar_H_A Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 24 Nov 2022 05:12 Selected Answer: - Upvotes: 2

Yes. But nearly 50% of the non-null data still seems to be a lot to ignore.

Comment 3

ID: 1340408 User: sofiane_kihal Badges: Most Recent Relative Date: 1 year, 1 month ago Absolute Date: Tue 14 Jan 2025 16:27 Selected Answer: B Upvotes: 1

I think the best option is B.
D could be an option but what if the feature is very correlated to the result ?

Comment 4

ID: 1212699 User: mark1223jkh Badges: - Relative Date: 1 year, 3 months ago Absolute Date: Sun 17 Nov 2024 07:43 Selected Answer: - Upvotes: 1

B: combine the dependent features. It is more like PCA (principal component analysis).
D: could be the answer, but what if that feature is very important, or how often do you get a feature with more than 50% NULL values?

Comment 5

ID: 1076366 User: axantroff Badges: - Relative Date: 1 year, 9 months ago Absolute Date: Tue 21 May 2024 14:25 Selected Answer: B Upvotes: 2

I am not into ML, to be honest, so I will rely on community opinion and choose B

Comment 6

ID: 1050535 User: rtcpost Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 22 Apr 2024 14:09 Selected Answer: B Upvotes: 6

B. Combine highly co-dependent features into one representative feature.

Combining highly correlated features into a single representative feature can reduce the dimensionality of your dataset, making the training process faster while preserving relevant information. This approach often helps eliminate redundancy in the input data.

Option A (eliminating features that are highly correlated to the output labels) can be counterproductive, as you want to maintain features that are informative for your prediction task. Removing features that are correlated with the output may reduce model accuracy.

Option C (averaging feature values in batches of 3) is not a common technique for reducing dimensionality, and it could lead to loss of important information.

Option D (removing features with null values for more than 50% of training records) can help reduce the dimensionality and may be useful if you have a large number of features with missing data, but it may not necessarily address co-dependency among features.

Comment 7

ID: 1008767 User: suku2 Badges: - Relative Date: 1 year, 12 months ago Absolute Date: Sat 16 Mar 2024 03:11 Selected Answer: B Upvotes: 1

B. Combine highly co-dependent features into one representative feature.
This is the best choice.

Comment 8

ID: 892818 User: WillemHendr Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 09 Nov 2023 09:05 Selected Answer: - Upvotes: 1

"D" is wrong, and very dangerous. For instance, it might represent modern measurements only installed in <50% of weather stations, but very very precise and valuable.

Nulls are not a problem for models, out-of-the-box or with transformations models can handle nulls just fine.

Comment 9

ID: 816228 User: jin0 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 05:27 Selected Answer: D Upvotes: 1

wrong question. there are two answers B, D

Comment 9.1

ID: 816236 User: jin0 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 05:38 Selected Answer: - Upvotes: 2

B. Combine highly co-dependent features into one representative feature.
-> Explainable feature should be dependent from each other feature. expecially not deep leanring. so, in this case normally eliminated or combined

D. Remove the features that have null values for more than 50% of the training records.
-> it's too large null data in the feature. normally the feature should be removed because it's too hard to fill up replacing data

Comment 10

ID: 766082 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Tue 04 Jul 2023 20:57 Selected Answer: - Upvotes: 1

Answer is Combine highly co-dependent features into one representative feature.

A: correlated to output means that feature can contribute a lot to the model. so not a good idea.
C: you need to run with almost same number, but you will iterate twice, once for averaging and second time to feed the averaged value.
D: removing features even if it 50% nulls is not good idea, unless you prove that it is not at all correlated to output. But this is nowhere so can remove.

Comment 10.1

ID: 816220 User: jin0 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 21 Aug 2023 05:17 Selected Answer: - Upvotes: 2

But, if there are null datas more than 50% then, it should be eliminated because there are two ways to train the model. first, remove records containing having null but in this case there are too many records should be removed and second, replace null to other data but in this case cause of it's too large data having null then It's literally hard to replace. so normally the feature having too many null data should be removed. So, there are two answer in this question B, D I think

Comment 11

ID: 723259 User: Thasni Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Sun 21 May 2023 08:00 Selected Answer: - Upvotes: 2

I have a doubt, instead of combining highly corelated features why cant we remove corelated features which may give much more simplified dataset?

Comment 12

ID: 616947 User: noob_master Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 15 Dec 2022 23:21 Selected Answer: B Upvotes: 2

Answer: B

Data that is co-dependent is high corelated is some kind of reduldant information in some cases. If the features x1, x2 and x3 are x2 = x1 + 1 and x3 = 2*x1, for example, x2 and x3 are reduldant because can be explained with x1 feature, so can be excluded of the the model. Other option is to group this features. There is a lot of ways to resolve, but the main ideia is to use data engineer in co-depedent features to reduce the number of features in the model

Comment 13

ID: 612754 User: Ishiske Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Wed 07 Dec 2022 15:37 Selected Answer: B Upvotes: 1

This method is called Data Engineering, that you combine two or more values to get a custom info, this will avoid that the model read an extra column on the training and probably increase it's accuracy.

Comment 14

ID: 601150 User: Yad_datatonic Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Sun 13 Nov 2022 15:23 Selected Answer: - Upvotes: 1

Answer: B

Comment 15

ID: 588599 User: alecuba16 Badges: - Relative Date: 3 years, 4 months ago Absolute Date: Thu 20 Oct 2022 13:31 Selected Answer: B Upvotes: 2

Co-dependent -> correlated -> correlated info = already present info in other variable.

Comment 16

ID: 529764 User: exnaniantwort Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Fri 22 Jul 2022 10:56 Selected Answer: B Upvotes: 4

B
null values can have many meanings and need different approach to handle, otherwise it causes inaccurate model, so not D

Comment 17

ID: 525204 User: ZIMARAKI Badges: - Relative Date: 3 years, 7 months ago Absolute Date: Sat 16 Jul 2022 19:24 Selected Answer: B Upvotes: 1

For me the best option is B

13. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 128

Sequence
153
Discussion ID
17238
Source URL
https://www.examtopics.com/discussions/google/view/17238-exam-professional-data-engineer-topic-1-question-128/
Posted By
-
Posted At
March 22, 2020, 11:11 a.m.

Question

You work on a regression problem in a natural language processing domain, and you have 100M labeled examples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

  • A. Increase the share of the test sample in the train-test split.
  • B. Try to collect more data and increase the size of your dataset.
  • C. Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
  • D. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

Suggested Answer

D

Answer Description Click to expand


Community Answer Votes

Comments 29 comments Click to expand

Comment 1

ID: 114761 User: Callumr Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Sat 20 Jun 2020 14:12 Selected Answer: - Upvotes: 72

This is a case of underfitting - not overfitting (for over fitting the model will have extremely low training error but a high testing error) - so we need to make the model more complex - answer is D

Comment 1.1

ID: 455853 User: hellofrnds Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Sat 02 Oct 2021 06:45 Selected Answer: - Upvotes: 4

@callumr , "root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set." clearly means testing error is twice of training error. So, it is clearly overfitting. Isn't it?

Comment 1.1.1

ID: 460260 User: hellofrnds Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Mon 11 Oct 2021 01:52 Selected Answer: - Upvotes: 1

So, answer should be C

Comment 1.1.1.1

ID: 584308 User: tavva_prudhvi Badges: - Relative Date: 3 years, 11 months ago Absolute Date: Mon 11 Apr 2022 18:24 Selected Answer: - Upvotes: 2

If you training RMSE=0.2. and testing RMSE = 0.4, and we want the RMSE to be low as its the error, now is it overfitting or underfitting? think wisely!

Comment 1.1.1.1.1

ID: 648517 User: alecuba16 Badges: - Relative Date: 3 years, 6 months ago Absolute Date: Thu 18 Aug 2022 18:18 Selected Answer: - Upvotes: 1

It's overfitting.

Overfitting->low rmse in train / high accuracy-f1 score in train for classification.

Underfitting -> high rmse / low f1score or accuracy in train, you don't have to look into test set if there is an underfitting problem.

Comment 1.1.1.1.1.1

ID: 929114 User: jfab Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Wed 21 Jun 2023 08:24 Selected Answer: - Upvotes: 1

But the question clearly states we have higher RMSE on the train than the test. So how would it be overfitting?

Comment 1.1.2

ID: 477451 User: velliger Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Sat 13 Nov 2021 13:53 Selected Answer: - Upvotes: 1

High rmse: The model is underfitting the train data. To reduce overfitting, we increase the number of layers in the model or we change the type of layer.

Comment 1.1.2.1

ID: 477452 User: velliger Badges: - Relative Date: 4 years, 3 months ago Absolute Date: Sat 13 Nov 2021 13:54 Selected Answer: - Upvotes: 2

*underfitting

Comment 1.1.3

ID: 738853 User: odacir Badges: - Relative Date: 3 years, 3 months ago Absolute Date: Thu 08 Dec 2022 10:47 Selected Answer: - Upvotes: 3

NO, its underfitting.

Comment 1.2

ID: 973119 User: NeoNitin Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 05 Aug 2023 17:11 Selected Answer: - Upvotes: 1

Based on the given information, this scenario indicates a case of overfitting.

Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen data (test data). In this case, the root-mean-squared error (RMSE) of the model is twice as high on the train set (the data used for training) compared to the test set (the data used for evaluation). This suggests that the model is fitting the training data too closely and is not generalizing well to new, unseen data.

Comment 1.2.1

ID: 1012032 User: ckanaar Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Wed 20 Sep 2023 10:22 Selected Answer: - Upvotes: 2

Wrong! This scenario indicates a case of underfitting. The RSME is twice as high on the training dataset compared to the test dataset, so the model is underfitting.

Comment 2

ID: 116075 User: AJKumar Badges: Highly Voted Relative Date: 5 years, 8 months ago Absolute Date: Mon 22 Jun 2020 08:18 Selected Answer: - Upvotes: 11

small RMS Error means--overfitting--fits well--so make it complex by dropping features.
big RMS Error means--underfitting--not good fit--so increase complexity by adding layers/features. Answer D.

Comment 3

ID: 1335644 User: samtestking Badges: Most Recent Relative Date: 1 year, 2 months ago Absolute Date: Thu 02 Jan 2025 16:46 Selected Answer: B Upvotes: 1

Could be B (data requirement for task is vague), but let's assume 100 million data points is enough and rule that out.

Indication of overfitting is significantly better performance on training data compared to unseen data. Here we are told that the unseen data is performing significantly better which is the opposite of what we should see if it were overfitting. Rule out C.

Symptoms of model underfitting is poor performance in BOTH training AND unseen data. While underfitting might be the issue, the more pressing concern is that the test set is clearly not representative of the overall data and could be skewed. This is further supported by the 90/10 split (academic/industry standard is 80/20 or 75/25 based on the Pareto principle: https://en.wikipedia.org/wiki/Pareto_principle). A 90/10 split would be useful if we were doing k-fold cross validation (https://machinelearningmastery.com/k-fold-cross-validation/), however there is no indication of such in the prompt.

Note: The question does not explicitly say that the model is performing poorly/errors are significantly bad, just that the error is twice as high in the training set (they could both have low error values).
So whilst it could be a case of underfitting (D), the first step taken should be addressing the obviously problematic data representation by adjusting the train-test split (option A).

Comment 4

ID: 1302551 User: SamuelTsch Badges: - Relative Date: 1 year, 4 months ago Absolute Date: Thu 24 Oct 2024 19:17 Selected Answer: D Upvotes: 2

It is underfitting problem, which means that the used models is too easy.

Comment 5

ID: 1289423 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Thu 26 Sep 2024 14:09 Selected Answer: - Upvotes: 1

This is A. The key is that 90/10 is a weirdly small test set, that stood out to me straight away (I work professionally as a machine learning engineer and have the cert). Next tip, that everyone seems to be ignoring - this is not underfit OR overfit. The model outperforms on the TEST set, this is not a miswording. Test scores higher than train. The time you might expect to see this is if your test set is too small to be a representative sample, leading to unrepresentative results. Seeing as the question already set up this conclusion with the 90/10 thing, it's definitely A. None of the others (or indeed anything else) can address Test outperforming Train, and the conclusion of others below that this is due to a poorly worded question is a bizarre conclusion.

Comment 6

ID: 1151063 User: cuadradobertolinisebastiancami Badges: - Relative Date: 2 years ago Absolute Date: Thu 15 Feb 2024 15:49 Selected Answer: D Upvotes: 2

Underfitting scenario

Comment 7

ID: 1120858 User: Sofiia98 Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 12 Jan 2024 15:49 Selected Answer: D Upvotes: 2

It is an underfitting situation - D

Comment 8

ID: 1083273 User: Kimich Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Wed 29 Nov 2023 09:30 Selected Answer: C Upvotes: 2

Should be C
C. Try out regularization techniques (e.g., dropout or batch normalization) to avoid overfitting:

This is a reasonable approach. Regularization techniques can help prevent overfitting, especially when the model shows a significantly higher error on the training set compared to the test set.
D. Increase the complexity of your model (e.g., introducing an additional layer or increasing the size of vocabularies or n-grams):

This could potentially exacerbate the overfitting issue. Increasing model complexity without addressing overfitting concerns may lead to poor generalization on new data.

Comment 8.1

ID: 1085691 User: Kimich Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 02 Dec 2023 05:10 Selected Answer: - Upvotes: 1

https://dooinnkim.medium.com/what-are-overfitting-and-underfitting-855d5952c0b6

Comment 9

ID: 1076541 User: hallo Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Tue 21 Nov 2023 19:27 Selected Answer: - Upvotes: 3

Are the questions in this relevant for the new exam or are these all now outdated?

Comment 10

ID: 1075856 User: pss111423 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 20 Nov 2023 22:33 Selected Answer: - Upvotes: 1

https://stats.stackexchange.com/questions/497050/how-big-a-difference-for-test-train-rmse-is-considered-as-overfit#:~:text=RMSE%20of%20test%20%3C%20RMSE%20of,is%20always%20overfit%20or%20underfit.
RMSE of test > RMSE of train => OVER FITTING of the data.
RMSE of test < RMSE of train => UNDER FITTING of the data.
so for answer is D

Comment 11

ID: 1066414 User: steghe Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Thu 09 Nov 2023 13:45 Selected Answer: - Upvotes: 1

Underfitting models: In general High Train RMSE, High Test RMSE.
Overfitting models: In general Low Train RMSE, High Test RMSE.

https://daviddalpiaz.github.io/r4sl/regression-for-statistical-learning.html

Comment 12

ID: 1020904 User: ha1p Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 29 Sep 2023 17:33 Selected Answer: - Upvotes: 2

I passed the exam today. I am pretty sure it is overfitting. Answer must be c

Comment 13

ID: 1014429 User: MULTITASKER Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Fri 22 Sep 2023 21:19 Selected Answer: D Upvotes: 3

RMSE is more on training. That means, model is not performing well on training dataset but performing well on testing dataset. This happens in the case of underfitting. So D.

Comment 14

ID: 1000173 User: pulse008 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Wed 06 Sep 2023 06:43 Selected Answer: - Upvotes: 1

chatGPT says option C

Comment 15

ID: 995170 User: stonefl Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Thu 31 Aug 2023 16:02 Selected Answer: D Upvotes: 2

"root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set." means the RMSE of training set is two time of RMSE of test set, which indicates the training is not as good as test, then underfiting, so D.

Comment 16

ID: 973117 User: NeoNitin Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 05 Aug 2023 17:10 Selected Answer: - Upvotes: 1

Based on the given information, this scenario indicates a case of overfitting.

Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen data (test data). In this case, the root-mean-squared error (RMSE) of the model is twice as high on the train set (the data used for training) compared to the test set (the data used for evaluation). This suggests that the model is fitting the training data too closely and is not generalizing well to new, unseen data.

So with dropout method we can overcome the overfitting so C is correct

Comment 17

ID: 946529 User: MoeHaydar Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sat 08 Jul 2023 15:52 Selected Answer: D Upvotes: 2

underfitting

Comment 17.1

ID: 973121 User: NeoNitin Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 05 Aug 2023 17:11 Selected Answer: - Upvotes: 1

Based on the given information, this scenario indicates a case of overfitting.

Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen data (test data). In this case, the root-mean-squared error (RMSE) of the model is twice as high on the train set (the data used for training) compared to the test set (the data used for evaluation). This suggests that the model is fitting the training data too closely and is not generalizing well to new, unseen data.

14. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 203

Sequence
198
Discussion ID
89246
Source URL
https://www.examtopics.com/discussions/google/view/89246-exam-professional-data-engineer-topic-1-question-203/
Posted By
gudiking
Posted At
Nov. 29, 2022, 2:15 p.m.

Question

A TensorFlow machine learning model on Compute Engine virtual machines (n2-standard-32) takes two days to complete training. The model has custom TensorFlow operations that must run partially on a CPU. You want to reduce the training time in a cost-effective manner. What should you do?

  • A. Change the VM type to n2-highmem-32.
  • B. Change the VM type to e2-standard-32.
  • C. Train the model using a VM with a GPU hardware accelerator.
  • D. Train the model using a VM with a TPU hardware accelerator.

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 9 comments Click to expand

Comment 1

ID: 748626 User: jkhong Badges: Highly Voted Relative Date: 2 years, 8 months ago Absolute Date: Sun 18 Jun 2023 05:48 Selected Answer: C Upvotes: 5

Cost effective - among the choices, it is cheaper to have a temporary accelerator instead of increasing our VM cost for an indefinite amount of time
D -> TPU accelerator cannot support custom operations

Comment 2

ID: 1103573 User: MaxNRG Badges: Highly Voted Relative Date: 1 year, 8 months ago Absolute Date: Sat 22 Jun 2024 17:32 Selected Answer: C Upvotes: 5

The best way to reduce the TensorFlow training time in a cost-effective manner is to use a VM with a GPU hardware accelerator. TensorFlow can take advantage of GPUs to significantly speed up training time for many models.

Specifically, option C is the best choice.

Changing the VM to another standard type like n2-highmem-32 or e2-standard-32 (options A and B) may provide some improvement, but likely not a significant speedup.

Using a TPU (option D) could speed up training, but TPUs are more costly than GPUs. For a cost-effective solution, GPU acceleration provides the best performance per dollar.

Since the model must run partially on CPUs, a VM instance with GPUs added will allow TensorFlow to offload appropriate operations to the GPUs while keeping CPU-specific operations on the CPU. This can provide a significant reduction in training time for many common TensorFlow models while keeping costs reasonable

Comment 3

ID: 1206845 User: wences Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Tue 05 Nov 2024 13:48 Selected Answer: C Upvotes: 1

key pjrse is "run partially on a CPU" from https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus refers to GPU

Comment 4

ID: 1064295 User: spicebits Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 06 May 2024 21:57 Selected Answer: C Upvotes: 4

https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus

Comment 5

ID: 763429 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 02 Jul 2023 01:12 Selected Answer: - Upvotes: 1

C. Train the model using a VM with a GPU hardware accelerator.

Comment 6

ID: 732031 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Tue 30 May 2023 22:05 Selected Answer: - Upvotes: 1

C
https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus:~:text=Models%20with%20a%20significant%20number%20of%20custom%20TensorFlow%20operations%20that%20must%20run%20at%20least%20partially%20on%20CPUs

Comment 6.1

ID: 746942 User: Atnafu Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 16 Jun 2023 08:08 Selected Answer: - Upvotes: 3

The model has custom TensorFlow operations that must run partially on a CPU. is the key for GPU

Comment 7

ID: 730424 User: gudiking Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Mon 29 May 2023 13:15 Selected Answer: C Upvotes: 1

I agree with C, for choosing a GPU one of the cases says:
"Models with a significant number of custom TensorFlow operations that must run at least partially on CPUs"
https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus

Comment 7.1

ID: 732462 User: gudiking Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Thu 01 Jun 2023 09:49 Selected Answer: - Upvotes: 1

C is not cost-effective, so I stand corrected. I do not know the answer.

15. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 63

Sequence
217
Discussion ID
16745
Source URL
https://www.examtopics.com/discussions/google/view/16745-exam-professional-data-engineer-topic-1-question-63/
Posted By
jvg637
Posted At
March 16, 2020, 2:39 p.m.

Question

You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the shade of each dot represents what class it is. You want to classify this data accurately using a linear algorithm. To do this you need to add a synthetic feature. What should the value of that feature be?
image

  • A. X2+Y2
  • B. X2
  • C. Y2
  • D. cos(X)

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 19 comments Click to expand

Comment 1

ID: 64726 User: jvg637 Badges: Highly Voted Relative Date: 5 years, 12 months ago Absolute Date: Mon 16 Mar 2020 14:39 Selected Answer: - Upvotes: 41

For fitting a linear classifier when the data is in a circle use A.

Comment 2

ID: 315571 User: xs91 Badges: Highly Voted Relative Date: 4 years, 11 months ago Absolute Date: Sat 20 Mar 2021 12:18 Selected Answer: - Upvotes: 7

I think it's A as explai ned here https://medium.com/@sachinkun21/using-a-linear-model-to-deal-with-nonlinear-dataset-c6ed0f7f3f51

Comment 3

ID: 1301704 User: SamuelTsch Badges: Most Recent Relative Date: 1 year, 4 months ago Absolute Date: Tue 22 Oct 2024 22:08 Selected Answer: A Upvotes: 2

I think A should be x^2+y^2. We need a circle to classify the data.

Comment 4

ID: 1287996 User: baimus Badges: - Relative Date: 1 year, 5 months ago Absolute Date: Mon 23 Sep 2024 07:28 Selected Answer: A Upvotes: 1

Just a note, they are using X2 and Y2 to mean Xsquared, and Ysquared. This is a circle in the form X2+Y2 = k, so for a given k will split that dataset nicely.

Comment 5

ID: 959561 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 22 Jul 2023 15:33 Selected Answer: - Upvotes: 1

It's not obvious to me it is A.

As others said, cos(X) does ignore the Y value. But answer A does not seem good either. The differences seem minimal.

If you do A then you have the following issues. If you take elements in the bottom right or the top left of the circle, they will all have the same value, ZERO. Not only that, they will actually have the same value with the elements in the middle of the circle which are completely black. Moreover, elements on the extreme right and extreme right will have different values (-x_max and +x_max).

However, if you use a cos(x) then the elements in the beginning

Comment 5.1

ID: 959627 User: Mathew106 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sat 22 Jul 2023 16:36 Selected Answer: - Upvotes: 3

Nevermind I did not understand that X2 and Y2 meant X^2 and Y2. Answer is A because that gives the distance from the circle. Circle radius = sqrt(X^2 + Y^2). So even though it's not a perfect answer, it makes sense.

Comment 6

ID: 784910 User: samdhimal Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Mon 23 Jan 2023 04:08 Selected Answer: - Upvotes: 3

A. X2+Y2

The synthetic feature that should be added in this case is the squared value of the distance from the origin (0,0). This is equivalent to X2+Y2. By adding this feature, the classifier will be able to make more accurate predictions by taking into account the distance of each data point from the origin.

X2 and Y2 alone will not give enough information to classify the data because they do not take into account the relationship between X and Y.

D. cos(X) is not a suitable option because it does not take into account the Y coordinate.

Comment 7

ID: 781913 User: GCPpro Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Fri 20 Jan 2023 06:58 Selected Answer: - Upvotes: 2

A is the correct answer as graph of circle is x^2 + y^2

Comment 8

ID: 778378 User: desertlotus1211 Badges: - Relative Date: 3 years, 1 month ago Absolute Date: Tue 17 Jan 2023 00:39 Selected Answer: - Upvotes: 1

Answer is A:
The answer reflects 'x' to the 2nd power + 'y' the 2nd power.
I guess they can't use carots in the exam answers!

Comment 9

ID: 766092 User: AzureDP900 Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Wed 04 Jan 2023 22:05 Selected Answer: - Upvotes: 1

A is right
Reference:
https://medium.com/@sachinkun21/using-a-linear-model-to-deal-with-nonlinear-dataset-c6ed0f7f3f51

Comment 10

ID: 747174 User: DipT Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Fri 16 Dec 2022 13:35 Selected Answer: A Upvotes: 1

https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture

Comment 11

ID: 745522 User: DGames Badges: - Relative Date: 3 years, 2 months ago Absolute Date: Thu 15 Dec 2022 00:18 Selected Answer: A Upvotes: 2

linear circle X2+Y2 https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html

Comment 12

ID: 619889 User: mvww11 Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Tue 21 Jun 2022 16:48 Selected Answer: - Upvotes: 2

If the shape was a circle, it would be (x^2 + y ^2). But I think that a quadric curve will do a better job of separating the two classes, so it would be (x^2)

Comment 13

ID: 598155 User: gabrysave Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sat 07 May 2022 16:50 Selected Answer: - Upvotes: 1

Answer: A.
X^2+Y^2 is the equation of a circle.

Comment 14

ID: 595387 User: diagniste Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Sun 01 May 2022 03:35 Selected Answer: A Upvotes: 2

C'est A

Comment 15

ID: 548745 User: Tanzu Badges: - Relative Date: 4 years ago Absolute Date: Wed 16 Feb 2022 17:12 Selected Answer: A Upvotes: 1

only A is draw a circle

Comment 16

ID: 524345 User: sraakesh95 Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Sat 15 Jan 2022 19:19 Selected Answer: A Upvotes: 1

Equation of circle as represented in the question

Comment 17

ID: 518834 User: moumou Badges: - Relative Date: 4 years, 2 months ago Absolute Date: Fri 07 Jan 2022 09:10 Selected Answer: - Upvotes: 1

F(x) as A B C will have always a positive values as result, for A will need a third dimenssion Z to represent data, only D:cos(x) can be presented as the shown classification. this is a math question

Comment 17.1

ID: 593386 User: NR22 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Wed 27 Apr 2022 22:15 Selected Answer: - Upvotes: 1

A B C will only have positive values
imaginary numbers (i + j) : am I a joke to you?

16. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 77

Sequence
270
Discussion ID
79343
Source URL
https://www.examtopics.com/discussions/google/view/79343-exam-professional-data-engineer-topic-1-question-77/
Posted By
YorelNation
Posted At
Sept. 2, 2022, 11:01 a.m.

Question

Your neural network model is taking days to train. You want to increase the training speed. What can you do?

  • A. Subsample your test dataset.
  • B. Subsample your training dataset.
  • C. Increase the number of input features to your model.
  • D. Increase the number of layers in your neural network.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 19 comments Click to expand

Comment 1

ID: 894048 User: mantwosmart Badges: Highly Voted Relative Date: 1 year, 10 months ago Absolute Date: Fri 10 May 2024 15:58 Selected Answer: - Upvotes: 9

Answer: B. Subsample your training dataset.

Subsampling your training dataset can help increase the training speed of your neural network model. By reducing the size of your training dataset, you can speed up the process of updating the weights in your neural network. This can help you quickly test and iterate your model to improve its accuracy.

Subsampling your test dataset, on the other hand, can lead to inaccurate evaluation of your model's performance and may result in overfitting. It is important to evaluate your model's performance on a representative test dataset to ensure that it can generalize to new data.

Increasing the number of input features or layers in your neural network can also improve its performance, but this may not necessarily increase the training speed. In fact, adding more layers or features can increase the complexity of your model and make it take longer to train. It is important to balance the model's complexity with its performance and training time.

Comment 2

ID: 971433 User: crazycosmos Badges: Most Recent Relative Date: 1 year, 7 months ago Absolute Date: Sat 03 Aug 2024 21:58 Selected Answer: B Upvotes: 3

B is correct

Comment 3

ID: 969714 User: Vipul1600 Badges: - Relative Date: 1 year, 7 months ago Absolute Date: Fri 02 Aug 2024 07:20 Selected Answer: - Upvotes: 1

B should be correct. Increasing the layers can also decrease the training time but may introduce vanishing gradient hence D may not be correct

Comment 4

ID: 884637 User: email2nn Badges: - Relative Date: 1 year, 10 months ago Absolute Date: Mon 29 Apr 2024 21:24 Selected Answer: - Upvotes: 1

answer is B

Comment 5

ID: 848594 User: juliobs Badges: - Relative Date: 1 year, 11 months ago Absolute Date: Sat 23 Mar 2024 20:15 Selected Answer: B Upvotes: 2

Reduce training time and probably accuracy too.

Comment 6

ID: 818536 User: MingSer Badges: - Relative Date: 2 years ago Absolute Date: Fri 23 Feb 2024 00:22 Selected Answer: B Upvotes: 1

all other are wrong

Comment 7

ID: 789466 User: PolyMoe Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 27 Jan 2024 10:26 Selected Answer: B Upvotes: 1

of course !

Comment 8

ID: 786103 User: samdhimal Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 24 Jan 2024 03:57 Selected Answer: - Upvotes: 2

B. Subsampling your training dataset can decrease the amount of data the model needs to process and can speed up training time. However, it can lead to decrease in the model's accuracy.

Although it shouldn't matter since we are not even in testing phase yet and we aren't looking for accuracy.

Comment 9

ID: 782114 User: GCPpro Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 20 Jan 2024 11:34 Selected Answer: - Upvotes: 2

B is the answer as we are bothered about speed not the accuracy.

Comment 10

ID: 770040 User: ler_mp Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 09 Jan 2024 06:10 Selected Answer: B Upvotes: 1

The answer is B. Building a more complex model by increasing the number of layer will not reduce the training time.

Comment 11

ID: 750635 User: slade_wilson Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Wed 20 Dec 2023 09:09 Selected Answer: B Upvotes: 4

By SubSampling the training data, you will reduce the training time.

In case of D, if you increase the number of layers, then the model's accuracy will be increased. But it will not reduce the time required to train the model.

Comment 12

ID: 745650 User: DGames Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Fri 15 Dec 2023 03:50 Selected Answer: D Upvotes: 1

Increase speed of the help to train quicker.. option B is sub sample that also help but it drop accurately of model . So I think Option D is good to go.

Comment 12.1

ID: 825467 User: jin0 Badges: - Relative Date: 2 years ago Absolute Date: Fri 01 Mar 2024 02:55 Selected Answer: - Upvotes: 2

That makes speed of training model lower absolutely. because not only throughput of inference but back-propagation calculation would be increase so, D should be not a answer. there is only answer in those options is B. while it makes dropping performance

Comment 13

ID: 688181 User: pluiedust Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sat 07 Oct 2023 02:56 Selected Answer: B Upvotes: 4

It is B. D would improve the accuracy, not speed.

Comment 14

ID: 683890 User: Chavoz Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 01 Oct 2023 01:49 Selected Answer: - Upvotes: 1

It's B. D Would be for increase performance

Comment 15

ID: 665998 User: crismo04 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Mon 11 Sep 2023 11:57 Selected Answer: - Upvotes: 1

if you Increase the number of layers, you increase the training time, right?

Comment 16

ID: 663792 User: HarshKothari21 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 08 Sep 2023 18:35 Selected Answer: - Upvotes: 1

Both B and D seems correct.

Comment 16.1

ID: 741551 User: jkhong Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 11 Dec 2023 09:51 Selected Answer: - Upvotes: 3

Increasing D will increase training time

Comment 17

ID: 657206 User: YorelNation Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sat 02 Sep 2023 11:01 Selected Answer: B Upvotes: 2

Only valid awnser

17. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 191

Sequence
290
Discussion ID
79643
Source URL
https://www.examtopics.com/discussions/google/view/79643-exam-professional-data-engineer-topic-1-question-191/
Posted By
ducc
Posted At
Sept. 3, 2022, 3:46 a.m.

Question

You are developing a new deep learning model that predicts a customer's likelihood to buy on your ecommerce site. After running an evaluation of the model against both the original training data and new test data, you find that your model is overfitting the data. You want to improve the accuracy of the model when predicting new data. What should you do?

  • A. Increase the size of the training dataset, and increase the number of input features.
  • B. Increase the size of the training dataset, and decrease the number of input features.
  • C. Reduce the size of the training dataset, and increase the number of input features.
  • D. Reduce the size of the training dataset, and decrease the number of input features.

Suggested Answer

B

Answer Description Click to expand


Community Answer Votes

Comments 14 comments Click to expand

Comment 1

ID: 680408 User: John_Pongthorn Badges: Highly Voted Relative Date: 2 years, 11 months ago Absolute Date: Mon 27 Mar 2023 07:19 Selected Answer: B Upvotes: 11

There 2 parts and they are relevant to each other
1. Overfit is fixed by decreasing the number of input features (select only essential features)
2. Accuracy is improved by increasing the amount of training data examples.

Comment 1.1

ID: 680409 User: John_Pongthorn Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Mon 27 Mar 2023 07:19 Selected Answer: - Upvotes: 2

https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

Comment 2

ID: 1122029 User: Matt_108 Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Sat 13 Jul 2024 20:28 Selected Answer: B Upvotes: 2

Option B, the model learned to listen to too much stuff/noise. We need to reduce it, by decreasing the number of input feature, and we need to give the model more data, by increasing the amount of training data

Comment 3

ID: 966864 User: NeoNitin Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Tue 30 Jan 2024 07:54 Selected Answer: - Upvotes: 1

Increase the size of the training dataset: By adding more diverse examples of customers and their buying behavior to the training data, the model will have a broader understanding of different scenarios and be better equipped to generalize to new customers.

Increase the number of input features: Providing the model with more relevant information about customers can help it identify meaningful patterns and make better predictions. These input features could include things like the customer's age, past purchase history, browsing behavior, or any other relevant data that might impact their buying likelihood.

Comment 4

ID: 901141 User: vaga1 Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Sat 18 Nov 2023 14:36 Selected Answer: B Upvotes: 1

A. can be a solution for a specific case, but it is not the academic answer as we do not know the quantity and proportion between them of n and k added. More records and more variables together can lead to even more overfitting due also to the curse of dimensionality. Adding a variable is much more impactful than records.
B. just more records can lead to a more robust estimation and fewer variables certainly lead to at most the same estimation, but potentially reduce the fit on the training set.
C. reduce n in favor of k is never a choice. it is against logic and it will lead to more overfitting.
D. decrease both will reduce overfitting for sure but at the price of losing robustness on the model predictive power

Comment 5

ID: 763415 User: AzureDP900 Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Sun 02 Jul 2023 00:43 Selected Answer: - Upvotes: 1

B. Increase the size of the training dataset, and decrease the number of input features.

Comment 6

ID: 668848 User: pluiedust Badges: - Relative Date: 2 years, 12 months ago Absolute Date: Tue 14 Mar 2023 12:41 Selected Answer: B Upvotes: 2

B is correct

Comment 7

ID: 663116 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Wed 08 Mar 2023 07:54 Selected Answer: - Upvotes: 3

Answer B
https://machinelearningmastery.com/impact-of-dataset-size-on-deep-learning-model-skill-and-performance-estimates/

Comment 8

ID: 662701 User: HarshKothari21 Badges: - Relative Date: 3 years ago Absolute Date: Tue 07 Mar 2023 19:32 Selected Answer: B Upvotes: 3

Option B
Feature selection is the one the ways to resolve overfitting. Which means reducing the features
when the size of the training data is small, then the network tends to have greater control over the training data. so increasing the size of data would help.

Comment 9

ID: 661136 User: YorelNation Badges: - Relative Date: 3 years ago Absolute Date: Mon 06 Mar 2023 14:09 Selected Answer: B Upvotes: 1

Best option is not mentioned: generalize you neural net by decreasing the complexity of it's structure.

A part from that I guess you could remove some features and increase the size of the training dataset ==> B

Comment 10

ID: 659541 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Sat 04 Mar 2023 23:56 Selected Answer: B Upvotes: 1

B. Increase the size of the training dataset, and decrease the number of input features.

Sorry, B is right. Read through extensive best-practices on ML.

Comment 11

ID: 658883 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Sat 04 Mar 2023 04:18 Selected Answer: D Upvotes: 1

D is correct

Comment 12

ID: 658032 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 06:39 Selected Answer: - Upvotes: 1

D. Reduce the size of the training dataset, and decrease the number of input features.
Reveal Solution

Comment 13

ID: 657963 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 04:46 Selected Answer: B Upvotes: 1

B. Increase the size of the training dataset, and decrease the number of input features.

18. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 151

Sequence
297
Discussion ID
79680
Source URL
https://www.examtopics.com/discussions/google/view/79680-exam-professional-data-engineer-topic-1-question-151/
Posted By
ducc
Posted At
Sept. 3, 2022, 6:49 a.m.

Question

You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

  • A. Use Vertex AI for training existing Spark ML models
  • B. Rewrite your models on TensorFlow, and start using Vertex AI
  • C. Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
  • D. Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery

Suggested Answer

C

Answer Description Click to expand


Community Answer Votes

Comments 18 comments Click to expand

Comment 1

ID: 893800 User: vaga1 Badges: Highly Voted Relative Date: 2 years, 10 months ago Absolute Date: Wed 10 May 2023 11:53 Selected Answer: A Upvotes: 6

the question is: is it faster to move a SparkML job to a Vertex AI or to Dataproc? I am personally not sure, I would go for Dataproc as notebooks are not mentioned, but reading the Google article:

https://cloud.google.com/blog/topics/developers-practitioners/announcing-serverless-spark-components-vertex-ai-pipelines/

"Dataproc Serverless components for Vertex AI Pipelines that further simplify MLOps for Spark, Spark SQL, PySpark and Spark jobs."

Comment 1.1

ID: 1075807 User: emmylou Badges: - Relative Date: 2 years, 3 months ago Absolute Date: Mon 20 Nov 2023 21:10 Selected Answer: - Upvotes: 3

But you would need to re-write your models which can be a block

Comment 2

ID: 963254 User: vamgcp Badges: Highly Voted Relative Date: 2 years, 7 months ago Absolute Date: Wed 26 Jul 2023 04:23 Selected Answer: C Upvotes: 6

Option C : It is the most rapid way to migrate your existing training pipelines to Google Cloud.
It allows you to continue using your existing Spark ML models.
It allows you to take advantage of the scalability and performance of Dataproc.
It allows you to read data directly from BigQuery, which is a more efficient way to process large datasets

Comment 3

ID: 1243655 User: Anudeep58 Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Sun 07 Jul 2024 04:51 Selected Answer: C Upvotes: 2

Vertex AI is better suited for TensorFlow or scikit-learn models. Direct Spark ML support isn't native to Vertex AI, making this a less straightforward migration path.

Comment 4

ID: 1163901 User: mothkuri Badges: - Relative Date: 2 years ago Absolute Date: Sat 02 Mar 2024 04:55 Selected Answer: - Upvotes: 1

C
Question is about rapid lift and shift. So code changes should be minimul

Comment 5

ID: 1124996 User: GCP001 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 17 Jan 2024 14:32 Selected Answer: C Upvotes: 1

C looks more suitable as data is alerady on BigQuery.
Ref - https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml

Comment 6

ID: 1122008 User: Matt_108 Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Sat 13 Jan 2024 20:59 Selected Answer: C Upvotes: 1

Option C, agreed with other comments

Comment 7

ID: 1100401 User: MaxNRG Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Tue 19 Dec 2023 08:13 Selected Answer: C Upvotes: 4

Use Cloud Dataproc, BigQuery, and Apache Spark ML for Machine Learning
https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml
Using Apache Spark with TensorFlow on Google Cloud Platform
https://cloud.google.com/blog/products/gcp/using-apache-spark-with-tensorflow-on-google-cloud-platform

Comment 8

ID: 1098699 User: Nandababy Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Sun 17 Dec 2023 07:33 Selected Answer: - Upvotes: 1

Why not option D? To spin up the spark cluster on compute engine, considering rapid migration it potentially could be best approach as team wont have to re-work on model (may be only few configurational changes) and again to get data from Bigquery which is required periodically not all the time, could be easy.
With Dataproc it would have more code changes eventually can take more time.
With Vertex AI it doesn't support spark ML natively and also training would be black box.

For me Answer should be D.

Comment 9

ID: 1015478 User: barnac1es Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Sun 24 Sep 2023 06:01 Selected Answer: C Upvotes: 2

Dataproc for Spark: Google Cloud Dataproc is a managed Spark and Hadoop service that allows you to run Spark jobs seamlessly on Google Cloud. It provides the flexibility to run Spark jobs using Spark MLlib and other Spark libraries.

BigQuery Integration: You mentioned that your data is being migrated to BigQuery. Dataproc has native integration with BigQuery, allowing you to read data directly from BigQuery tables. This eliminates the need to export data from BigQuery to another storage system before processing it with Spark.

Rapid Migration: This approach allows you to quickly migrate your existing Spark ML models and training pipelines without the need for a complete rewrite or extensive changes to your existing workflows. You can continue using your Spark ML models while adapting them to read data from BigQuery.

Comment 10

ID: 1013035 User: DeepakVenkatachalam Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Thu 21 Sep 2023 13:40 Selected Answer: - Upvotes: 1

they are talking about rapid lift and shift, in which case Dataproc cluster will be right one for Spark ML models for lift and shift. so I think the answer is C.

Comment 11

ID: 1012245 User: ckanaar Badges: - Relative Date: 2 years, 5 months ago Absolute Date: Wed 20 Sep 2023 13:27 Selected Answer: A Upvotes: 4

The updated answer seems A based on the following article:

https://cloud.google.com/blog/topics/developers-practitioners/announcing-serverless-spark-components-vertex-ai-pipelines/

Comment 12

ID: 991430 User: FP77 Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Sun 27 Aug 2023 14:15 Selected Answer: C Upvotes: 1

The answer is C. Spin up a Cloud Dataproc Cluster, migrate spark jobs to there, and link the Cluster to Bgquery with the connector. It's a straightforward solution.

Comment 13

ID: 964273 User: knith66 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 27 Jul 2023 03:51 Selected Answer: C Upvotes: 3

If you wanted to use Vertex AI for training Spark ML models, you would typically need to convert your Spark ML code to another supported machine learning framework like TensorFlow or scikit-learn. Then you could use Vertex AI's pre-built training and prediction services for those frameworks.

Comment 14

ID: 953240 User: wan2three Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Sun 16 Jul 2023 12:54 Selected Answer: A Upvotes: 1

Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery using standard SQL queries on existing business intelligence tools and spreadsheets, or you can export datasets from BigQuery directly into Vertex AI Workbench and run your models from there.
https://cloud.google.com/vertex-ai#all-features:~:text=Data%20and%20AI%20integration

Comment 15

ID: 931230 User: blathul Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 23 Jun 2023 07:09 Selected Answer: C Upvotes: 1

Dataproc is a managed Spark and Hadoop service on Google Cloud, which makes it an ideal choice for migrating your existing Spark ML training pipelines. By using Dataproc, you can continue to leverage Spark and its ML capabilities without the need for significant code changes or rewriting your models.
By combining Dataproc and BigQuery, you can create Spark jobs or workflows in Dataproc that read data from BigQuery and train your existing Spark ML models. This approach allows you to quickly migrate your training pipelines to Google Cloud and take advantage of the scalability and performance benefits of both Dataproc and BigQuery.

Comment 16

ID: 930319 User: KC_go_reply Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Thu 22 Jun 2023 11:01 Selected Answer: C Upvotes: 4

It is obviously C) Dataproc, since we don't want to rewrite the training from scratch, highly prefer Dataproc for anything Hadoop/Spark ecosystem, and Vertex AI doesn't support *training* with SparkML (but deploying existing models).

Comment 17

ID: 924918 User: Takshashila Badges: - Relative Date: 2 years, 8 months ago Absolute Date: Fri 16 Jun 2023 08:41 Selected Answer: C Upvotes: 1

Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

19. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 192

Sequence
304
Discussion ID
79526
Source URL
https://www.examtopics.com/discussions/google/view/79526-exam-professional-data-engineer-topic-1-question-192/
Posted By
PhuocT
Posted At
Sept. 2, 2022, 7:54 p.m.

Question

You are implementing a chatbot to help an online retailer streamline their customer service. The chatbot must be able to respond to both text and voice inquiries.
You are looking for a low-code or no-cade option, and you want to be able to easily train the chatbot to provide answers to keywords. What should you do?

  • A. Use the Cloud Speech-to-Text API to build a Python application in App Engine.
  • B. Use the Cloud Speech-to-Text API to build a Python application in a Compute Engine instance.
  • C. Use Dialogflow for simple queries and the Cloud Speech-to-Text API for complex queries.
  • D. Use Dialogflow to implement the chatbot, defining the intents based on the most common queries collected.

Suggested Answer

D

Answer Description Click to expand


Community Answer Votes

Comments 8 comments Click to expand

Comment 1

ID: 657682 User: PhuocT Badges: Highly Voted Relative Date: 3 years ago Absolute Date: Thu 02 Mar 2023 20:54 Selected Answer: D Upvotes: 12

D is correct:
https://cloud.google.com/dialogflow/es/docs/how/detect-intent-tts#:~:text=Dialogflow%20can%20use%20Cloud%20Text,to%2Dspeech%2C%20or%20TTS.

Comment 2

ID: 1102804 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Fri 21 Jun 2024 17:35 Selected Answer: D Upvotes: 2

The best option would be to use Dialogflow to implement the chatbot, defining the intents based on the most common queries collected.

Dialogflow is a conversational AI platform that allows for easy implementation of chatbots without needing to code. It has built-in integration for both text and voice input via APIs like Cloud Speech-to-Text. Defining intents and entity types allows you to map common queries and keywords to responses. This would provide a low/no-code way to quickly build and iteratively improve the chatbot capabilities.

Option A and B would require more heavy coding to handle speech input/output. Option C still requires coding the complex query handling. Only option D leverages the full capabilities of Dialogflow to enable no-code chatbot development and ongoing improvements as more conversational data is collected. Hence, option D is the best approach given the requirements.

Comment 3

ID: 968102 User: Lanro Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 31 Jan 2024 15:34 Selected Answer: D Upvotes: 1

Low-code or no-cade requirement makes it easy to decide.

Comment 4

ID: 725617 User: Atnafu Badges: - Relative Date: 2 years, 9 months ago Absolute Date: Wed 24 May 2023 07:08 Selected Answer: - Upvotes: 1

D
https://cloud.google.com/dialogflow/es/docs/how/detect-intent-tts#:~:text=Dialogflow%20can%20use%20Cloud%20Text,to%2Dspeech%2C%20or%20TTS.

Comment 5

ID: 697463 User: devaid Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Mon 17 Apr 2023 16:15 Selected Answer: D Upvotes: 4

D definitely, as the documentation says (specially that you can call the detect Intect method for audio inputs):
https://cloud.google.com/dialogflow/es/docs/how/detect-intent-tts
Also Speech-To-Text API does nothing more than translate.

Comment 6

ID: 663123 User: TNT87 Badges: - Relative Date: 3 years ago Absolute Date: Wed 08 Mar 2023 08:00 Selected Answer: - Upvotes: 4

Answer D

https://cloud.google.com/dialogflow/es/docs/how/detect-intent-tts

Comment 7

ID: 660000 User: nwk Badges: - Relative Date: 3 years ago Absolute Date: Sun 05 Mar 2023 12:41 Selected Answer: - Upvotes: 2

https://cloud.google.com/dialogflow/es/docs/how/detect-intent-stream
Vote D

Comment 8

ID: 657965 User: ducc Badges: - Relative Date: 3 years ago Absolute Date: Fri 03 Mar 2023 04:49 Selected Answer: C Upvotes: 3

C. Use Dialogflow for simple queries and the Cloud Speech-to-Text API for complex queries.

This seem the best answer here but not the best answer in real world.
But with the Question, the answer must be the combination of both Diagflow and Speech API

20. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 157

Sequence
308
Discussion ID
17212
Source URL
https://www.examtopics.com/discussions/google/view/17212-exam-professional-data-engineer-topic-1-question-157/
Posted By
-
Posted At
March 22, 2020, 7:33 a.m.

Question

Your team is working on a binary classification problem. You have trained a support vector machine (SVM) classifier with default parameters, and received an area under the Curve (AUC) of 0.87 on the validation set. You want to increase the AUC of the model. What should you do?

  • A. Perform hyperparameter tuning
  • B. Train a classifier with deep neural networks, because neural networks would always beat SVMs
  • C. Deploy the model and measure the real-world AUC; it's always higher because of generalization
  • D. Scale predictions you get out of the model (tune a scaling factor as a hyperparameter) in order to get the highest AUC

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 18 comments Click to expand

Comment 1

ID: 106342 User: aadaisme Badges: Highly Voted Relative Date: 5 years, 3 months ago Absolute Date: Thu 10 Dec 2020 03:17 Selected Answer: - Upvotes: 42

Seems to be A. Preprocessing/scaling should be done with input features, instead of predictions (output)

Comment 2

ID: 157813 User: FARR Badges: Highly Voted Relative Date: 5 years ago Absolute Date: Sun 14 Feb 2021 07:54 Selected Answer: - Upvotes: 11

A
Deep LEarning is not always the best solution
D talks about fudgin the output which is wrong

Comment 3

ID: 1100423 User: MaxNRG Badges: Most Recent Relative Date: 1 year, 8 months ago Absolute Date: Wed 19 Jun 2024 07:53 Selected Answer: A Upvotes: 3

https://www.quora.com/How-can-I-improve-Precision-Recall-AUC-under-Imbalanced-Classification

Comment 4

ID: 893864 User: vaga1 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Fri 10 Nov 2023 14:36 Selected Answer: A Upvotes: 2

B,C are simply not true. D is modifing the scoring, making it not realiable anymore. A makes sense, is potentially increasing the model accuracy.

Comment 5

ID: 891120 User: rishu2 Badges: - Relative Date: 2 years, 4 months ago Absolute Date: Tue 07 Nov 2023 05:30 Selected Answer: A Upvotes: 1

a is the correct answer

Comment 6

ID: 812944 User: musumusu Badges: - Relative Date: 2 years, 6 months ago Absolute Date: Fri 18 Aug 2023 12:40 Selected Answer: - Upvotes: 1

Answer A,
why not B, Deep Neu Net. are better for sure but AUC is 0.87 is already good. Don't go for complex and time taking model. If AUC more than 0.95, it can be a reason of overfit.
Now just check SVM params for hypertuning if you can bring it close to 0,9-0,95

Comment 7

ID: 789584 User: Kvk117 Badges: - Relative Date: 2 years, 7 months ago Absolute Date: Thu 27 Jul 2023 12:26 Selected Answer: A Upvotes: 1

a is the correct answer

Comment 8

ID: 712769 User: Dan137 Badges: - Relative Date: 2 years, 10 months ago Absolute Date: Sun 07 May 2023 02:45 Selected Answer: - Upvotes: 1

Also a good read is: https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview

Comment 9

ID: 520139 User: medeis_jar Badges: - Relative Date: 3 years, 8 months ago Absolute Date: Sat 09 Jul 2022 11:01 Selected Answer: A Upvotes: 2

as mentioned by Spider7 "performing tuning rather than using the model default parameters there's a way to increase the overall model performance --> A."

Comment 10

ID: 486433 User: JG123 Badges: - Relative Date: 3 years, 9 months ago Absolute Date: Wed 25 May 2022 04:19 Selected Answer: - Upvotes: 1

Correct: A

Comment 11

ID: 477142 User: Spider7 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Thu 12 May 2022 20:03 Selected Answer: - Upvotes: 3

0.89 it's already not bad but by performing tuning rather then using the model default parameters there's a way to increase the overall model performance --> A.

Comment 11.1

ID: 477145 User: Spider7 Badges: - Relative Date: 3 years, 10 months ago Absolute Date: Thu 12 May 2022 20:06 Selected Answer: - Upvotes: 1

0.87 precisely

Comment 12

ID: 416078 User: hdmi_switch Badges: - Relative Date: 4 years, 1 month ago Absolute Date: Fri 28 Jan 2022 11:18 Selected Answer: - Upvotes: 3

Not C because real-world AUC value falls between 0.5 and 1.0 usually, this wouldn't help.

A seems the most straigh forward.

Comment 13

ID: 316628 User: Mitra123 Badges: - Relative Date: 4 years, 5 months ago Absolute Date: Tue 21 Sep 2021 19:41 Selected Answer: - Upvotes: 1

For a large enough training set DNN will most likely beat a SVM. However the opposite may or may not be true. It also depends on the complexity of the problem. Which we don’t know from the question. For image, nlp, I say B can be a good answer
However, if we decide to stick with SVM, D reduces overfitting and may increase AUC.
I am torn between the two!

Comment 14

ID: 292812 User: ArunSingh1028 Badges: - Relative Date: 4 years, 6 months ago Absolute Date: Tue 17 Aug 2021 19:21 Selected Answer: - Upvotes: 1

Ans - D when the model is overfitted means want to increase the AUC, we always perform hyperparameter tuning, Increase regularisations, decrease input feature parameters etc.

Comment 15

ID: 202380 User: nitinbhatia Badges: - Relative Date: 4 years, 10 months ago Absolute Date: Mon 19 Apr 2021 07:18 Selected Answer: - Upvotes: 2

AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. So answer shall be A
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl=en

Comment 16

ID: 200885 User: arghya13 Badges: - Relative Date: 4 years, 11 months ago Absolute Date: Fri 16 Apr 2021 07:07 Selected Answer: - Upvotes: 3

Definitely not D
https://developers.google.com/machine-learning/crash-course/classification/check-your-understanding-roc-and-auc

Comment 17

ID: 161008 User: saurabh1805 Badges: - Relative Date: 5 years ago Absolute Date: Thu 18 Feb 2021 19:13 Selected Answer: - Upvotes: 4

A for me, read below link for more details.

https://towardsdatascience.com/understanding-hyperparameters-and-its-optimisation-techniques-f0debba07568

21. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 174

Sequence
311
Discussion ID
80144
Source URL
https://www.examtopics.com/discussions/google/view/80144-exam-professional-data-engineer-topic-1-question-174/
Posted By
AWSandeep
Posted At
Sept. 4, 2022, 10:50 p.m.

Question

You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company's mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer's stated intention for contacting customer service. About 70% of customer requests are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer, more complicated requests. Which intents should you automate first?

  • A. Automate the 10 intents that cover 70% of the requests so that live agents can handle more complicated requests.
  • B. Automate the more complicated requests first because those require more of the agents' time.
  • C. Automate a blend of the shortest and longest intents to be representative of all intents.
  • D. Automate intents in places where common words such as 'payment' appear only once so the software isn't confused.

Suggested Answer

A

Answer Description Click to expand


Community Answer Votes

Comments 5 comments Click to expand

Comment 1

ID: 1101914 User: MaxNRG Badges: - Relative Date: 1 year, 8 months ago Absolute Date: Thu 20 Jun 2024 20:07 Selected Answer: A Upvotes: 3

This is the best approach because it follows the Pareto principle (80/20 rule). By automating the most common 10 intents that address 70% of customer requests, you free up the live agents to focus their time and effort on the more complex 30% of requests that likely require human insight/judgement. Automating the simpler high-volume requests first allows the chatbot to handle those easily, efficiently routing only the trickier cases to agents. This makes the best use of automation for high-volume simple cases and human expertise for lower-volume complex issues.

Comment 2

ID: 961957 User: vamgcp Badges: - Relative Date: 2 years, 1 month ago Absolute Date: Wed 24 Jan 2024 22:11 Selected Answer: A Upvotes: 1

Option A : : By automating the intents that cover a significant majority (70%) of customer requests, you target the areas with the highest volume of interactions. This helps reduce the load on live agents, enabling them to focus on more complicated and time-consuming inquiries that require their expertise.

Comment 3

ID: 923425 User: Takshashila Badges: - Relative Date: 2 years, 2 months ago Absolute Date: Thu 14 Dec 2023 20:32 Selected Answer: A Upvotes: 1

A is the answer.

Comment 4

ID: 672048 User: SMASL Badges: - Relative Date: 2 years, 11 months ago Absolute Date: Sat 18 Mar 2023 09:55 Selected Answer: - Upvotes: 4

Correct answer: A

As it states in the documentation: "If your agent will be large or complex, start by building a dialog that only addresses the top level requests. Once the basic structure is established, iterate on the conversation paths to ensure you're covering all of the possible routes an end-user may take." (https://cloud.google.com/dialogflow/cx/docs/concept/agent-design#build-iteratively)

Therefore, you should initally automate the 70 % of the requests that are simpler before automating the more complicated ones.

Comment 5

ID: 659534 User: AWSandeep Badges: - Relative Date: 3 years ago Absolute Date: Sat 04 Mar 2023 23:50 Selected Answer: A Upvotes: 1

A. Automate the 10 intents that cover 70% of the requests so that live agents can handle more complicated requests.