Wednesday 13 December 2017

Downstream SNP Prediction using the MTSA method

I covered much of this topic in a presentation I gave at the FTDNA Annual Conference in Houston (Nov 2017) and you can watch it on YouTube here. The relevant section is from 37 minutes 40 seconds onwards.

Let's imagine that the Tree of Mankind (aka Y-Haplotree) starts with "genetic Adam" (some 250,000 years ago) and splits into progressively more downstream branches as the timeline approaches modern day. These downstream branches can be identified by downstream SNP marker testing of your Y chromosome (with tests such as SNP Packs, and in particular the Big Y). This downstream Y-SNP testing helps locate your position on the Tree of Mankind and potentially this can prove very helpful for a variety of reasons:

  • It can help ensure that you have been grouped accurately in a specific "genetic family" (within a Surname Project, for example)
  • It can help determine your ancestral origins - at times the actual country, and potentially even the region or county ... this helps focus your genealogical research
  • It can identify your nearest genetic neighbours and their associated surnames ... which in turn can tie you into the genealogy of a specific 'clan' or sept
  • It can identify branches within a genetic family and which one you sit on (it can also be useful in generating a Mutation History Tree)
  • It can highlight the risk of Chance Matches due to Convergence amongst your list of matches

But ... the Big Y test is expensive. The technique below tells how to predict the Big Y result without doing the test. In that way you can reap the benefits of the Big Y without actually having to do it. 

The technique is called Downstream SNP Prediction because we will be predicting what SNP markers you are likely to test positive for "downstream" i.e. approaching the modern era, say within the last 500-1500 years. And the MTSA in the title stands for Matches Terminal SNP Analysis - in other words, you will be analysing the terminal SNPs of each person on your list of Y-DNA matches generated from the Y-STR test that you have previously done (be it the Y-DNA-37, Y-DNA-67 or Y-DNA-111).

The technique is quite simple. It just takes a little bit of time to complete (about 10 minutes). But there is one major caveat - it does not always work. And once you see the results, you will have to make a judgement call on whether or not you think the result is likely to be reliable. But when it does work, it works well.

Essentially the MTSA method involves collecting the terminal SNPs of all of your Y-STR matches and then seeing where each SNP in turn sits on the Tree of Mankind. 

If they all sit on the same branch, then you probably do too. If they sit on widely different branches, then the results are untrustworthy (in this particular instance), and the method has not been able to predict which downstream SNP you are likely to test positive for. As a consequence, formal SNP testing (Big Y or otherwise) will be necessary to determine your position on the Tree of Mankind.

The Methodology

Here is a list of the steps involved in Downstream SNP Prediction using the MTSA method. We will go through them later in detail one by one:

  1. To start, sign in to your FTDNA account and open your Y-DNA Matches List
  2. Sort your matches list by "Haplogroup"
  3. Note down the terminal SNPs and how often each one occurs - repeat this step for each marker level (111, 67, 37, & 25). 
  4. Plot each SNP in turn on the Haplotree
  5. Assess whether or not the SNPs fall on a single line of descent coming down the Haplotree ...
    1. if they do, there is a good chance that you will also follow this line of descent and end up on the same downstream branch (or a branch very close by)
    2. if they do not fall on the same single line of descent, then the technique has not worked in this instance because Convergence is present
  6. Make a judgement call on how reliable you think the results are

Now let's look at each step in detail.

Step 1 - open your Y-DNA Matches list

Step 2 - sort your matches by Haplogroup ... just click on the title "Y-DNA Haplogroup" and this will arrange your list of matches alphabetically by their Terminal SNP.

This individual has 183 matches at the 25 marker level (top left)

Step 3 - note down the terminal SNPs and how often each one occurs

In the example above, this would produce a list like this:
  • BY3441
  • CTS7030
  • DF13
  • FGC10116
  • FGC10117 (x2)
  • FGC10125 (x2)
  • FGC28987
  • L1065
  • L1335 (x8)
  • etc ...

1) I don't bother recording the frequency of single SNPs. Thus, any SNP in the list without a number in brackets has only occurred once in the list.
2) I ignore any known "upstream" SNPs (e.g. M269, L21, etc) as these are too far upstream to be informative.
3) this exercise should be repeated at each marker level (111, 67, 37 & 25). In practice, the 25 marker level appears to be the most informative (currently).

Step 4 - Plot each SNP in turn on the Haplotree

This is the most time-consuming part of the exercise but you will get quicker with practice. To be comprehensive, it is best to identify the SNP Progression for each SNP in turn. The SNP Progression is simply the series of SNPs that characterise each branching point on the line of descent to the "terminal SNP' in question.

Thus the SNP Progressions associated with the list above would be listed as follows:
  • BY3441 ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > DF49/S474 > Z2980 > Z2976 > DF23 > Z2961 > FGC6540 > FGC6562 > FGC6545 > BY3442 > BY3437 > BY3441
  • CTS7030 ... equivalent to L1065
  • DF13 ... too far upstream 
  • FGC10116 ... equivalent to FGC10117
  • FGC10117 (x2) ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065 > FGC10125 > FGC10117
  • FGC10125 (x2) ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065FGC10125
  • FGC28987 ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065 > Z16325 > S744 > S764 > FGC28987
  • L1065 ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065
  • L1335 (x8) ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530
  • etc ... 

1) the easiest way to find the SNP Progression is simply to google "YTREE" and the SNP in question. This will bring you to Alex Williamson's Big Tree, each page of which has the SNP Progression for the particular branch of the Y-Haplotree under discussion (as in the diagram below for the first SNP in the list).

2) Sometimes the google approach will bring you to a branch slightly upstream of the SNP you want and you will have to search the webpage for the more downstream SNP. Do this by clicking cmd+F (ctrl+F on a PC) to FIND the SNP in question.
3) Sometimes the SNP won't be on the Big Tree and you may have to use the FTDNA or YFULL Haplotrees instead in order to find where the particular SNP sits on the tree. 
4) Sometimes you may have to check to see if the SNP has an alternative name

Step 5 - Do the SNPs fall on a single line of descent?

Comparing the SNP Progressions above, a pattern clearly emerges. The majority of the SNP Progressions are on a single line of descent, at least as far down as L1065. The exception is the first SNP (BY3441), which splits off from the rest, two branches above L1065.

Below L1065, there are at least two branches - one via FGC10125 (5 instances - count carefully - count bullet points 4-6), the other via Z16325 (bullet point 7). So the SNPs do fall on a single line of descent ... up to a point. And beyond that point, there is some disparity ... some discordance ... different SNPs on different (i.e. separate) branches of the Haplotree. 

But a single man cannot sit on two conflicting branches. He can only ever sit on one branch. Beyond a certain point, the predicted branches are contradictory. And this discordance indicates that some of his Y-STR matches are Chance Matches due to Convergence.

Chance Matches could also conceivably be due to an extreme lack of Divergence (i.e. the Y-STR signature / haplotype is passed down unchanged for many thousands of years), but the chances of this being the cause are probably very low.

Step 6 - make a judgement call

So where is this particular individual likely to sit on the Tree of Mankind? Based purely on the (partial) data presented above, he sits ...
  • Probably below Z39589 (estimated probability ... what? say ... 99%? 95%?)
  • Probably below L1335 (estimated probability ... 16 out of 17 instances = about 94%?)
  • Probably below L1065 (estimated probability ... 8 out of 9 instances = about 89%?)
  • Probably below FGC10125 (estimated probability ... 5 out of 7 instances = about 71%?)
  • Probably below Z16325 (estimated probability ... 1 out of 7 instances = about 14%?)
  • Probably below DF49 (estimated probability ... 1 out of 17 instances = about 6%?)

These probabilities are relatively crude, but certainly give a strong impression that the individual in question is highly likely to test positive for L1065, and below that is more likely to test positive for FGC10125 than for any of the other downstream SNPs.

So while this exercise has not identified a specific downstream SNP with 100% probability, it has  pointed us in a specific direction and has identified a "most likely candidate", namely FGC10125 (about 70% probability) ... or maybe, some SNP below it, possibly FGC10117.

The SNP FGC10125 appears to have arisen some time at least 1150 years ago, so the exercise has potentially moved us down the Haplotree to a branch that arose within the last 1000-1500 years.

In addition, it has identified with even greater confidence (about 90% probability) that the individual sits somewhere below L1065 for which there happens to be a dedicated SNP Pack. So rather than doing an upstream SNP Pack like the R1b-M343&M269 Backbone Panel, this individual may choose to do the more downstream R1b-L1065 SNP Pack ... which (from the above) is likely to be appropriate with 90% probability. I always caution my project members that there is a chance (10% in this instance) that they will be wasting their money. The choice is theirs.

But before doing any downstream SNP Pack test (the R1b-L1065 SNP Pack in this example), it is always advisable to check that the SNP Pack actually contains the "further downstream" SNPs of interest (extracted from the list of matches' terminal SNPs above). And in this instance, the R1b-L1065 SNP Pack contains all the "more downstream" SNPs identified in the list above. So it would be a good choice to make in this instance ... if the individual did not want to spend money on the Big Y.

The Output

Several different types of profile can emerge from this exercise and they broadly fall into the following categories:
  1. all the evidence points to a single downstream branch of the Y-Haplotree (say, within the last 1000 years)
  2. most of the evidence points to a single downstream branch, but there is some minor downstream discordance within the last 2000 years or so, with several "very downstream" branches predicted
  3. most / all of the evidence points to a major subclade branch (say, about 2000-4000 years ago) but, below this, many downstream branches are predicted indicating major downstream discordance
  4. the evidence suggests several conflicting upstream branches of the Y-Haplotree (e.g. L21, U106, M198) and only some or none of the evidence points to a single major subclade. Thus in this case, major upstream discordance is present and accurate Downstream SNP Prediction is not possible

The various degrees of discordance arise due to Convergence  This is when by chance, and over the passage of time, the descendants of one branch of the Haplotree develop a similar set of Y-STR marker values to the descendants of another branch of the Haplotree. Thus the genetic signatures  of the descendants of both branches look similar and thus they match each other i.e. they appear in each other's matches list. This suggests there is a close connection (say, within several hundred years) when in fact the common ancestor is several thousand years ago. They sit on completely different branches of the Haplotree, but their Y-STR signatures suggest they could be close cousins (when in fact they are not).

Here are a few examples of each profile.

Scenario 1 - no discordance, everything points to a single downstream branch

This scenario occurs with Farrell Group 2. Using the MTSA method on many of this group's members and then plotting the terminal SNPs generated onto a diagram of the Haplotree, indicates that they all fall on a single line of descent. And predicts that the members of this group will test positive for the downstream SNP FGC20561.

There is no or little evidence that there is Convergence in this group - all the STR matches appear to be "genuine" "true positive" matches, none of the matches appear to be Chance Matches due to Convergence.

MTSA of many Farrell Group 2 members predicts they will test positive for FGC20561

Scenario 2 - minor downstream discordance

The exercise described above (to illustrate the methodology) indicated that the individual's Y-STR matches all sat on a single line of descent as far down as Z39589. Immediately after that there was some "minor discordance" (one match tested positive for DF49), but the majority of the group continued downstream to L1335 and L1065. Thereafter, there was some more discordance in the group, with 5 going down the path of FGC10125 and one turning down to Z16325. Thus, all the evidence was concordant down to Z39589 (100%), a majority of the available evidence was concordant down to L1065 (89%), and a smaller majority of the available evidence was concordant down to FGC10125 (71%). And from this we can conclude that this individual and his Y-STR matches share a common ancestor on the branch of the tree characterised by Z39589, and probably share another common ancestor further downstream on the branch characterised by L1065, and possibly share another common ancestor on the FGC10125 branch.

This is a fairly typical profile that emerges from this exercise. It takes you so far down the Haplotree but no further. Additional SNP testing will be needed to confirm the predictions.

In this scenario, Convergence is present, but it does not exert an influence until we get quite far downstream. Thus the common ancestor for the group is relatively far downstream, certainly below the major subclade level (about 2000-4000 years ago), and probably within the last 1500 years. In the example above, the major subclade L1065 is at least 1800 years old and the downstream SNP FGC10125 is at least 1150 years old. In the diagram below, the major subclade L226 is at least 1450 years old, and the downstream SNP FGC5628 is at least 1100 years old.

Two Discordant Downstream Branches occurring below major subclade R-L226 

Scenario 3 - major downstream discordance 

In this scenario, the MTSA methodology identifies many Discordant Downstream Branches, frequently with no particular sub-branch predominating. The individual is predicted to sit somewhere below a major subclade branch but there are so many candidates further downstream that no reasonable prediction can be made.

However, it remains clear that the individual does fall below a major subclade branch and therefore the associated subclade SNP Pack may be an appropriate test to take (if the individual does not want to purchase the Big Y). The SNP Pack will need to be checked to see if any relevant SNPs are included therein.

In the diagram below, MTSA predicts that the individual will sit on a branch downstream of M222 (a SNP marker known to be associated with significant Convergence . However, there are at least 6 different branches below M222 that the MTSA methodology predicts as possible candidates for the individual's branch. This person went on to do the Big Y test and the confirmed branch he actually sits on turned out to be none of the candidates predicted by MTSA. This illustrates the importance of making a judgement call on the reliability of the predictions.

Several Discordant Downstream Branches indicate major downstream discordance

Scenario 4 - major upstream discordance

In the final scenario, there are multiple Discordant Upstream Branches making it impossible to predict which subclade of the Haplotree the individual belongs to. For example, some matches sit on L21, others on U106, and others on M198 - all upstream SNPs that are thousands of years old. Under these circumstances, actual Big Y testing is the only option for defining where on the haplotree the individual sits.

I generally use the terms Upstream and Downstream in crude approximation to the nearest major subclade, which tends to be in the range of 2000-4000 years ago. Upstream is roughly more than 4000 years ago; and Downstream is roughly less than 2000 years ago. But these are approximations.

Some Final Words

Downstream SNP Prediction using the MTSA method can be surprisingly predictive in many cases.

Currently it works best at the 25 marker level, simply because there are many more matches at this level and therefore many more datapoints. However I always check the higher marker levels first and also check for consistency across the different marker levels. I have rarely explored 12 marker results (because the risk of Convergence at this level is so high) but occasionally they can appear useful (nevertheless, a large grain of salt needs ingestion).

Predicting the "most likely" terminal SNP for an individual allows more targeted "confirmatory" SNP testing (via a SNP Pack or single SNP test) and potentially saves the customer money.

It also helps identify Chance Matches due to Convergence within an individual's match list, and thus gives some indication of the extent of Convergence within the individual's match list. In subsequent blog posts, we will explore how the MTSA methodology can facilitate quantification of the extent of Convergence  not just within an individual's match list, but also for an entire genetic group within a surname project.

I'd like to say a big thank you to Ralph Taylor, James Irvine & Debbie Kennett for helping shape my ideas on this subject.

Maurice Gleeson
Dec 2017