1. PROFESSIONAL-DATA-ENGINEER Topic 1 Question 14
Question
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)
- A. There are very few occurrences of mutations relative to normal samples.
- B. There are roughly equal occurrences of both normal and mutated samples in the database.
- C. You expect future mutations to have different features from the mutated samples in the database.
- D. You expect future mutations to have similar features to the mutated samples in the database.
- E. You already have labels for which samples are mutated and which are normal in the database.
Suggested Answer
AC
Answer Description Click to expand
Community Answer Votes
- AC: 65 most voted
- AD: 62
- BD: 9

Comment 1
I think that AD makes more sense. D is the explanation you gave. In the rest, A makes more sense, in any anomaly detection algorithm it is assumed a priori that you have much more "normal" samples than mutated ones, so that you can model normal patterns and detect patterns that are "off" that normal pattern. For that you will always need the no. of normal samples to be much bigger than the no. of mutated samples.
Comment 1.1
Guys its A & C.
Anomaly detection has two basic assumptions:
->Anomalies only occur very rarely in the data. (a)
->Their features differ from the normal instances significantly. (c)
link -> https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e559c1#:~:text=Unsupervised%20Anomaly%20Detection%20for%20Univariate%20%26%20Multivariate%20Data.&text=Anomaly%20detection%20has%20two%20basic,from%20the%20normal%20instances%20significantly.
Comment 1.1.1
I don't agree on C. Anomaly detection assumes "Their features differ from the NORMAL INSTANCES significantly" and in the C option you have:
"You expect future mutations to have different features from the MUTATED SAMPLES IN THE DATABASE".
IMHO Answer D fits better: "D. You expect future mutations to have similar features to the mutated samples in the database." - in other words: Expect future anomalies to be similar to the anomalies that we already have in database
Comment 1.2
as per chatGPT, it can be different (C) - thats how unsupervised anomaly detection works - as long as they are different than "normal' tissues , they would be detected
Comment 2
A instead of B:
"anomaly detection (also outlier detection[1]) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data
Comment 3
I think it's A & C, the anomaly detection algorithm train on "normal" data and anything that deviates from that is classified "mutated", and so there is no commun caracterstics or patterns for mutated data, this is why the statement C is correct
Comment 4
A. There are very few occurrences of mutations relative to normal samples. This is a strong characteristic supporting unsupervised anomaly detection. Anomaly detection methods often work well when the "normal" class is the majority and anomalies are rare outliers. The mutated samples, being few, would likely appear as anomalies relative to the cluster of normal samples.
D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic supports unsupervised anomaly detection. If future mutations are expected to share similar characteristics with the existing mutated samples (even if they are rare), an anomaly detection method could potentially learn the pattern of the normal samples and flag anything significantly different (including the mutated samples) as an anomaly.
Comment 5
Unsupervised anomaly detection. Characteristics are few anomalies and future mutations different. So A and C.
Comment 6
AC. Answer C. Unsupervised anomaly detection methods are particularly useful when you don't have labeled examples of the anomalies you're trying to detect. Why not D, if future mutations are similar to existing ones, a supervised model trained on labeled examples of known mutations would likely be more accurate in classifying new samples.
Comment 7
The answer should be A and C.
Unsupervised anomaly detection is useful when labels are unavailable, or when anomalies are rare and distinct.
Hence A is definitely correct because anomaly detection excels when anomalies are rare compared to normal data.
I think C is correct because by adding new mutation data that is similar to the existing mutation data, the model will learn in a broader sense on what constitutes to 'mutation', and it leads to a better generalization. If the new data is too similar to the existing mutation data (answer D), the model might overfit to those specific examples. However, the new data should still share some fundamental characteristic to the existing data so that the model can recognize them as belonging to the same anomaly category.
Comment 8
I think it's A&C. For an anomaly detection model, the ratio of normal vs abnormal is expected to be high. 'C' because the model is expected to be adaptive meaning the model detects the abnormal features that can be different from the abnormal features currently being trained on.
Comment 9
The keyword is unsupervised anomaly detection. So A is correct. We think and should ensure the majority of data represents 'normal'. Unsupervised methods are good for detecting unknown patterns. Thus C could be correct.
Comment 9.1
I correct my answer. AD should be better. Unsupervised method is usually used for grouping the data. So, if the future mutations have similar features to the mutated samples, our trained model should group it into anomalies even though no label exists.
Comment 10
AD: to use unsupervised anomaly detection the anomalies a) must be rare b) they must differ from the NORMAL. So...
A: mutated samples must be scarce compared to normal tissue.
D: yes, we expect the future mutated samples to have similar features to the mutated samples currently in the database.
Why not C? If I train my model with mutated samples with specific characteristics, I do not expect it to find different mutations. In the future, when new mutations appear, I would retrain my model including those new samples.
Comment 11
Anomaly detection has two basic assumptions:
*Anomalies only occur very rarely in the data.
*Their features differ from the normal instances significantly.
Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called “normal” instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection.
The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them.
https://www.science.gov/topicpages/u/unsupervised+anomaly+detection
A because “Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal”, B is for Supervised anomaly detection https://en.wikipedia.org/wiki/Anomaly_detection
Comment 12
A - anomaly detection is used for detecting rare events, meaning it is expected that there are much less of those than of normal ones.
D - you expect the future mutations to be similar to the mutations you already have, so that you can detect them (pattern recognition)
Comment 13
A makes sense
C and D compares future mutations to mutated samples in database
The question is pretty badly worded… If we were to run a full unsupervised anomaly detection over the entire dataset, C and D will be true, since some future mutations may be similar to current mutations and some will be significantly different to current mutations.
The question is suggesting "labelling" tissue samples using unsupervised anomaly detection, and subsequently using the labels with a supervised algorithm to classify future samples. If this interpretation of the question is correct, then D makes sense
Comment 14
The answer should be AD.
A, anomaly should have a little amount, if there are many samples then we should do classification instead, because unsupervised will give a lot of false positive.
D, the future anomaly should be of the same distribution as present anomaly! or else our anomaly detection will not be generalize to the future feature.
Comment 15
A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated.
D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.
Comment 16
D should be correct. You expect future samples will correlate with the training samples. That's the whole point of learning procedure. If you do not expect that they have similar features, then why would you use features in the training samples in the first place? A is also correct, since anomaly labels would be seen rarely.
Comment 17
A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated.
D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.