Wednesday 13 December 2017

Downstream SNP Prediction using the MTSA method

I covered much of this topic in a presentation I gave at the FTDNA Annual Conference in Houston (Nov 2017) and you can watch it on YouTube here. The relevant section is from 37 minutes 40 seconds onwards.

Let's imagine that the Tree of Mankind (aka Y-Haplotree) starts with "genetic Adam" (some 250,000 years ago) and splits into progressively more downstream branches as the timeline approaches modern day. These downstream branches can be identified by downstream SNP marker testing of your Y chromosome (with tests such as SNP Packs, and in particular the Big Y). This downstream Y-SNP testing helps locate your position on the Tree of Mankind and potentially this can prove very helpful for a variety of reasons:

  • It can help ensure that you have been grouped accurately in a specific "genetic family" (within a Surname Project, for example)
  • It can help determine your ancestral origins - at times the actual country, and potentially even the region or county ... this helps focus your genealogical research
  • It can identify your nearest genetic neighbours and their associated surnames ... which in turn can tie you into the genealogy of a specific 'clan' or sept
  • It can identify branches within a genetic family and which one you sit on (it can also be useful in generating a Mutation History Tree)
  • It can highlight the risk of Chance Matches due to Convergence amongst your list of matches

But ... the Big Y test is expensive. The technique below tells how to predict the Big Y result without doing the test. In that way you can reap the benefits of the Big Y without actually having to do it. 

The technique is called Downstream SNP Prediction because we will be predicting what SNP markers you are likely to test positive for "downstream" i.e. approaching the modern era, say within the last 500-1500 years. And the MTSA in the title stands for Matches Terminal SNP Analysis - in other words, you will be analysing the terminal SNPs of each person on your list of Y-DNA matches generated from the Y-STR test that you have previously done (be it the Y-DNA-37, Y-DNA-67 or Y-DNA-111).

The technique is quite simple. It just takes a little bit of time to complete (about 10 minutes). But there is one major caveat - it does not always work. And once you see the results, you will have to make a judgement call on whether or not you think the result is likely to be reliable. But when it does work, it works well.

Essentially the MTSA method involves collecting the terminal SNPs of all of your Y-STR matches and then seeing where each SNP in turn sits on the Tree of Mankind. 

If they all sit on the same branch, then you probably do too. If they sit on widely different branches, then the results are untrustworthy (in this particular instance), and the method has not been able to predict which downstream SNP you are likely to test positive for. As a consequence, formal SNP testing (Big Y or otherwise) will be necessary to determine your position on the Tree of Mankind.

The Methodology

Here is a list of the steps involved in Downstream SNP Prediction using the MTSA method. We will go through them later in detail one by one:

  1. To start, sign in to your FTDNA account and open your Y-DNA Matches List
  2. Sort your matches list by "Haplogroup"
  3. Note down the terminal SNPs and how often each one occurs - repeat this step for each marker level (111, 67, 37, & 25). 
  4. Plot each SNP in turn on the Haplotree
  5. Assess whether or not the SNPs fall on a single line of descent coming down the Haplotree ...
    1. if they do, there is a good chance that you will also follow this line of descent and end up on the same downstream branch (or a branch very close by)
    2. if they do not fall on the same single line of descent, then the technique has not worked in this instance because Convergence is present
  6. Make a judgement call on how reliable you think the results are

Now let's look at each step in detail.

Step 1 - open your Y-DNA Matches list

Step 2 - sort your matches by Haplogroup ... just click on the title "Y-DNA Haplogroup" and this will arrange your list of matches alphabetically by their Terminal SNP.

This individual has 183 matches at the 25 marker level (top left)

Step 3 - note down the terminal SNPs and how often each one occurs

In the example above, this would produce a list like this:
  • BY3441
  • CTS7030
  • DF13
  • FGC10116
  • FGC10117 (x2)
  • FGC10125 (x2)
  • FGC28987
  • L1065
  • L1335 (x8)
  • etc ...

1) I don't bother recording the frequency of single SNPs. Thus, any SNP in the list without a number in brackets has only occurred once in the list.
2) I ignore any known "upstream" SNPs (e.g. M269, L21, etc) as these are too far upstream to be informative.
3) this exercise should be repeated at each marker level (111, 67, 37 & 25). In practice, the 25 marker level appears to be the most informative (currently).

Step 4 - Plot each SNP in turn on the Haplotree

This is the most time-consuming part of the exercise but you will get quicker with practice. To be comprehensive, it is best to identify the SNP Progression for each SNP in turn. The SNP Progression is simply the series of SNPs that characterise each branching point on the line of descent to the "terminal SNP' in question.

Thus the SNP Progressions associated with the list above would be listed as follows:
  • BY3441 ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > DF49/S474 > Z2980 > Z2976 > DF23 > Z2961 > FGC6540 > FGC6562 > FGC6545 > BY3442 > BY3437 > BY3441
  • CTS7030 ... equivalent to L1065
  • DF13 ... too far upstream 
  • FGC10116 ... equivalent to FGC10117
  • FGC10117 (x2) ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065 > FGC10125 > FGC10117
  • FGC10125 (x2) ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065FGC10125
  • FGC28987 ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065 > Z16325 > S744 > S764 > FGC28987
  • L1065 ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530 > L1065
  • L1335 (x8) ... 
    • R-P312/S116 > Z290 > L21/S145 > DF13 > Z39589 > L1335/S530
  • etc ... 

1) the easiest way to find the SNP Progression is simply to google "YTREE" and the SNP in question. This will bring you to Alex Williamson's Big Tree, each page of which has the SNP Progression for the particular branch of the Y-Haplotree under discussion (as in the diagram below for the first SNP in the list).

2) Sometimes the google approach will bring you to a branch slightly upstream of the SNP you want and you will have to search the webpage for the more downstream SNP. Do this by clicking cmd+F (ctrl+F on a PC) to FIND the SNP in question.
3) Sometimes the SNP won't be on the Big Tree and you may have to use the FTDNA or YFULL Haplotrees instead in order to find where the particular SNP sits on the tree. 
4) Sometimes you may have to check to see if the SNP has an alternative name

Step 5 - Do the SNPs fall on a single line of descent?

Comparing the SNP Progressions above, a pattern clearly emerges. The majority of the SNP Progressions are on a single line of descent, at least as far down as L1065. The exception is the first SNP (BY3441), which splits off from the rest, two branches above L1065.

Below L1065, there are at least two branches - one via FGC10125 (5 instances - count carefully - count bullet points 4-6), the other via Z16325 (bullet point 7). So the SNPs do fall on a single line of descent ... up to a point. And beyond that point, there is some disparity ... some discordance ... different SNPs on different (i.e. separate) branches of the Haplotree. 

But a single man cannot sit on two conflicting branches. He can only ever sit on one branch. Beyond a certain point, the predicted branches are contradictory. And this discordance indicates that some of his Y-STR matches are Chance Matches due to Convergence.

Chance Matches could also conceivably be due to an extreme lack of Divergence (i.e. the Y-STR signature / haplotype is passed down unchanged for many thousands of years), but the chances of this being the cause are probably very low.

Step 6 - make a judgement call

So where is this particular individual likely to sit on the Tree of Mankind? Based purely on the (partial) data presented above, he sits ...
  • Probably below Z39589 (estimated probability ... what? say ... 99%? 95%?)
  • Probably below L1335 (estimated probability ... 16 out of 17 instances = about 94%?)
  • Probably below L1065 (estimated probability ... 8 out of 9 instances = about 89%?)
  • Probably below FGC10125 (estimated probability ... 5 out of 7 instances = about 71%?)
  • Probably below Z16325 (estimated probability ... 1 out of 7 instances = about 14%?)
  • Probably below DF49 (estimated probability ... 1 out of 17 instances = about 6%?)

These probabilities are relatively crude, but certainly give a strong impression that the individual in question is highly likely to test positive for L1065, and below that is more likely to test positive for FGC10125 than for any of the other downstream SNPs.

So while this exercise has not identified a specific downstream SNP with 100% probability, it has  pointed us in a specific direction and has identified a "most likely candidate", namely FGC10125 (about 70% probability) ... or maybe, some SNP below it, possibly FGC10117.

The SNP FGC10125 appears to have arisen some time at least 1150 years ago, so the exercise has potentially moved us down the Haplotree to a branch that arose within the last 1000-1500 years.

In addition, it has identified with even greater confidence (about 90% probability) that the individual sits somewhere below L1065 for which there happens to be a dedicated SNP Pack. So rather than doing an upstream SNP Pack like the R1b-M343&M269 Backbone Panel, this individual may choose to do the more downstream R1b-L1065 SNP Pack ... which (from the above) is likely to be appropriate with 90% probability. I always caution my project members that there is a chance (10% in this instance) that they will be wasting their money. The choice is theirs.

But before doing any downstream SNP Pack test (the R1b-L1065 SNP Pack in this example), it is always advisable to check that the SNP Pack actually contains the "further downstream" SNPs of interest (extracted from the list of matches' terminal SNPs above). And in this instance, the R1b-L1065 SNP Pack contains all the "more downstream" SNPs identified in the list above. So it would be a good choice to make in this instance ... if the individual did not want to spend money on the Big Y.

The Output

Several different types of profile can emerge from this exercise and they broadly fall into the following categories:
  1. all the evidence points to a single downstream branch of the Y-Haplotree (say, within the last 1000 years)
  2. most of the evidence points to a single downstream branch, but there is some minor downstream discordance within the last 2000 years or so, with several "very downstream" branches predicted
  3. most / all of the evidence points to a major subclade branch (say, about 2000-4000 years ago) but, below this, many downstream branches are predicted indicating major downstream discordance
  4. the evidence suggests several conflicting upstream branches of the Y-Haplotree (e.g. L21, U106, M198) and only some or none of the evidence points to a single major subclade. Thus in this case, major upstream discordance is present and accurate Downstream SNP Prediction is not possible

The various degrees of discordance arise due to Convergence  This is when by chance, and over the passage of time, the descendants of one branch of the Haplotree develop a similar set of Y-STR marker values to the descendants of another branch of the Haplotree. Thus the genetic signatures  of the descendants of both branches look similar and thus they match each other i.e. they appear in each other's matches list. This suggests there is a close connection (say, within several hundred years) when in fact the common ancestor is several thousand years ago. They sit on completely different branches of the Haplotree, but their Y-STR signatures suggest they could be close cousins (when in fact they are not).

Here are a few examples of each profile.

Scenario 1 - no discordance, everything points to a single downstream branch

This scenario occurs with Farrell Group 2. Using the MTSA method on many of this group's members and then plotting the terminal SNPs generated onto a diagram of the Haplotree, indicates that they all fall on a single line of descent. And predicts that the members of this group will test positive for the downstream SNP FGC20561.

There is no or little evidence that there is Convergence in this group - all the STR matches appear to be "genuine" "true positive" matches, none of the matches appear to be Chance Matches due to Convergence.

MTSA of many Farrell Group 2 members predicts they will test positive for FGC20561

Scenario 2 - minor downstream discordance

The exercise described above (to illustrate the methodology) indicated that the individual's Y-STR matches all sat on a single line of descent as far down as Z39589. Immediately after that there was some "minor discordance" (one match tested positive for DF49), but the majority of the group continued downstream to L1335 and L1065. Thereafter, there was some more discordance in the group, with 5 going down the path of FGC10125 and one turning down to Z16325. Thus, all the evidence was concordant down to Z39589 (100%), a majority of the available evidence was concordant down to L1065 (89%), and a smaller majority of the available evidence was concordant down to FGC10125 (71%). And from this we can conclude that this individual and his Y-STR matches share a common ancestor on the branch of the tree characterised by Z39589, and probably share another common ancestor further downstream on the branch characterised by L1065, and possibly share another common ancestor on the FGC10125 branch.

This is a fairly typical profile that emerges from this exercise. It takes you so far down the Haplotree but no further. Additional SNP testing will be needed to confirm the predictions.

In this scenario, Convergence is present, but it does not exert an influence until we get quite far downstream. Thus the common ancestor for the group is relatively far downstream, certainly below the major subclade level (about 2000-4000 years ago), and probably within the last 1500 years. In the example above, the major subclade L1065 is at least 1800 years old and the downstream SNP FGC10125 is at least 1150 years old. In the diagram below, the major subclade L226 is at least 1450 years old, and the downstream SNP FGC5628 is at least 1100 years old.

Two Discordant Downstream Branches occurring below major subclade R-L226 

Scenario 3 - major downstream discordance 

In this scenario, the MTSA methodology identifies many Discordant Downstream Branches, frequently with no particular sub-branch predominating. The individual is predicted to sit somewhere below a major subclade branch but there are so many candidates further downstream that no reasonable prediction can be made.

However, it remains clear that the individual does fall below a major subclade branch and therefore the associated subclade SNP Pack may be an appropriate test to take (if the individual does not want to purchase the Big Y). The SNP Pack will need to be checked to see if any relevant SNPs are included therein.

In the diagram below, MTSA predicts that the individual will sit on a branch downstream of M222 (a SNP marker known to be associated with significant Convergence . However, there are at least 6 different branches below M222 that the MTSA methodology predicts as possible candidates for the individual's branch. This person went on to do the Big Y test and the confirmed branch he actually sits on turned out to be none of the candidates predicted by MTSA. This illustrates the importance of making a judgement call on the reliability of the predictions.

Several Discordant Downstream Branches indicate major downstream discordance

Scenario 4 - major upstream discordance

In the final scenario, there are multiple Discordant Upstream Branches making it impossible to predict which subclade of the Haplotree the individual belongs to. For example, some matches sit on L21, others on U106, and others on M198 - all upstream SNPs that are thousands of years old. Under these circumstances, actual Big Y testing is the only option for defining where on the haplotree the individual sits.

I generally use the terms Upstream and Downstream in crude approximation to the nearest major subclade, which tends to be in the range of 2000-4000 years ago. Upstream is roughly more than 4000 years ago; and Downstream is roughly less than 2000 years ago. But these are approximations.

Some Final Words

Downstream SNP Prediction using the MTSA method can be surprisingly predictive in many cases.

Currently it works best at the 25 marker level, simply because there are many more matches at this level and therefore many more datapoints. However I always check the higher marker levels first and also check for consistency across the different marker levels. I have rarely explored 12 marker results (because the risk of Convergence at this level is so high) but occasionally they can appear useful (nevertheless, a large grain of salt needs ingestion).

Predicting the "most likely" terminal SNP for an individual allows more targeted "confirmatory" SNP testing (via a SNP Pack or single SNP test) and potentially saves the customer money.

It also helps identify Chance Matches due to Convergence within an individual's match list, and thus gives some indication of the extent of Convergence within the individual's match list. In subsequent blog posts, we will explore how the MTSA methodology can facilitate quantification of the extent of Convergence  not just within an individual's match list, but also for an entire genetic group within a surname project.

I'd like to say a big thank you to Ralph Taylor, James Irvine & Debbie Kennett for helping shape my ideas on this subject.

Maurice Gleeson
Dec 2017

Friday 17 November 2017

FTDNA Holiday Sale until Dec 31 2017

FamilyTreeDNA have launched their Annual Holiday Sale. This runs from the last day of the Annual FTDNA Conference (Nov 12th 2017) until the end of the year. So now is the time to buy FTDNA tests and take advantage of some of their lowest prices ever. They also make perfect Birthday, Thanksgiving & Christmas gifts for friends and family.

2017 Holiday Sale Discounts

There are discounts on many of their products including upgrades on mtDNA and Y-DNA. The discounts represent approximately a 10-30% reduction from the usual price.

There is a special offer regarding the Big Y test. The usual price is $575 but there is a $100 discount in the sale. Further discounts are possible with the vouchers described below. But everyone who buys a Big Y test will automatically get a FREE upgrade to the Y-DNA-111 test. So if you have only tested your Y-DNA to the 37 marker level, buying the Big Y will get you a free upgrade to 111 markers (which would normally cost you $188).

Even if you haven't done a Y-DNA-37 test yet, you can order it at the Sale Price, and use a voucher for a further discount, and then once it has registered on the system, you can order the Big Y test and get the $100 Sale Price discount, and any additional voucher discount, and a free upgrade to 111 markers. This is a very good deal indeed!
So if you were very lucky, you could get the Y-DNA-37 for $109 (using a $20 voucher) plus the Big Y for $375 (using a $100 voucher) and the free upgrade to 111 markers. This wold normally cost $169 + $575 + $188 = $942 but you would be getting it for $484. This is only 51% of the price you would normally pay.

As mentioned above, you can use Holiday Reward vouchers to lower the sale prices even further. These will be issued every Monday until the end of the Sale but each voucher only lasts for 7 days so you have to use them quickly. In effect, this may reduce the cost of the Family Finder atDNA test to $49 and Y-DNA-37 to $109.

A $20 voucher for the Y-DNA-67 test

To access your voucher, simply log on to your FTDNA account and click on the Holiday Reward icon on your home page. If you make a purchase during the Sale, you frequently get a Bonus Reward as well. This gives further discounts on other tests.

And if you want to use the voucher for yourself, simply click on the Enjoy Rewards button and the product will be added to your Cart and the discount applied. Alternatively you can give the voucher to friends or family by clicking on the Share Rewards button. Each voucher can only be used once, and must be used before the weekly deadline.

A lot of people donate any vouchers they are not using so check the ISOGG Facebook group and Genetic Genealogy Ireland Facebook group for any unused vouchers that you might be able to take advantage of. Be warned, they go fast so you might have to try several before you find one that works.

Enjoy the Sale!

Maurice Gleeson
Nov 2017

Saturday 19 August 2017

DNA Workshop at the British Institute 2017 (Salt Lake City)

I will be running a week-long workshop on DNA at the British Institute 2017 in Salt lake City (Sep 18-22) and everyone is welcome to attend.

The workshop is for Beginners and Intermediates alike. It covers every aspect of DNA-testing for genealogy, including the basic science, the use of Y-DNA for surname research, and working with your DNA matches to break thru Brick Walls.

There are still places left so visit this link for full details and booking ...

Participants will also get a 150-page book to accompany the course.

See you there!

Maurice Gleeson
Aug 2017

Here is the schedule for the week.

Monday (introduction and mitochondrial DNA)
  • Introduction to DNA - the 3 main types of DNA, the different types of test
  • Applying DNA to your genealogy - what each test can tell you, which test is best for your particular genealogical question, examples of the application of each type of test
  • A brief look at minor topics: Neanderthal DNA, Human Migration, Medical Applications
  • A comparison of the different companies and what they offer
  • A brief look at resources available to help the genetic genealogist (from beginner to advanced)
  • A closer look at mitochondrial DNA and how it can be used for genealogy - mtDNA and ancestral homeland localisation
Tuesday (Y-DNA - STR markers & SNP markers)
  • A deeper dive into Y-DNA 
  • How Y-DNA can help you locate your ancestral homeland
  • Surname Projects - what surname projects can tell you over and above your own Y-DNAresults - how they help identify ancestral homelands -  piggybacking onto the longest genealogies
  • A close look at the benefits of joining Y-DNA Haplogroup Projects, Geographic Projects, Special Interest Projects
  • Y-DNASNP marker testing - what it tells us about the human evolutionary tree - how it can be used to link in to the ancient genealogies - where it will be in 5 years
  • How to build family trees using Y-DNA STR markers and SNP markers when you have run out of people
Wednesday (Autosomal DNA)
  • What is atDNA? What is xDNA? Patterns of inheritance. Recombination. Independent assortment.
  • atDNA and ethnic makeup: What does it really tell us? A close look at the People of the British Isles project and the Irish DNA Atlas - can the results inform our own genealogy? How ethnic admixture results can help inform your own family history?
  • What is a match? When is a match not a match?
  • Using a step by step approach to analysing your atDNA matches
Thursday (Autosomal DNA)
  • Explaining ICW matches and the importance of overlapping segments
  • The significance of Triangulated Groups (TGs) - generating TGs using known cousins - generating TGs using unknown cousins
  • Using Third party tools to harness the power of your atDNA matches 
- Gedmatch
- DNAGedcom
- GenomeMate
- DNAadoption 
... and others
Friday (Practical Application of DNA to Genealogy)
  • Applying all types of DNA in practice using Adoptees as an example:
  • Using Y-DNA to predict the surname
  • Using close atDNA matches to identify probable close ancestors (great grandparents or closer)
  • Developing a testing strategy to narrow down candidates for birth parents
  • Other testing strategies for specific attendee problems
  • General Q&A Session

Friday 30 June 2017

Criteria for Grouping People into Y-DNA Genetic Families

One of the main tasks of Surname Project administrators is to place new members into the appropriate genetic group within their surname project.

Having run a variety of surname projects for the last few years, I have come up with a set of criteria I use on a routine basis to place newcomers into existing genetic groups and also to identify new genetic groups. I call these criteria Markers of Potential Relatedness (MPRs). And (not surprisingly) these can be thought of as indicators that two people may be "related" to each other, which for the purposes of surname projects means somewhere in the last 1000 years or so. This arbitrary timepoint is chosen because many European surnames were introduced about 1000 years ago (in particular British and Irish surnames), although they only became commonplace several centuries thereafter.

This approach to grouping works best with hereditary surnames (i.e. passed from father to son) but should also work with patronymic (and other) surnames, except that (in these latter cases) criteria 1 and 8 will not apply. The discussion below is very much from the standpoint of hereditary surname projects.

Not all criteria have to be met. But the more criteria that are met, the higher the likelihood of two people being related. This is particularly important in relation to SDSs (Surname or DNA Switches; also known as NPEs, Non-Paternity Events), as it may be difficult to distinguish a match that is an SDS (e.g. adoption, illegitimacy) from one that is due to Convergence.

Below is a list of these criteria and we will consider each one in turn. Some of these Markers of Potential Relatedness (MPRs) have nothing to do with DNA. If two people have the same surname, or the same unusual surname variant, or have a similar ancestral homeland, or even an ancestor with the exact same name, then these can be indicators that the two people are related. And because they don't rely on genetics I simply call them "traditional markers" as opposed to "genetic markers". 

MPRs for deciding if two or more people are related within the last 1000 years

In practice, the most useful indicators (or at least the ones I most frequently use) are Markers 1, 2, 6 and 7. And if a new project member is grouped on the basis of these "main" markers, it usually becomes apparent that they meet many of the remaining criteria also.

1. The members have the same surname

This is an obvious criterion, especially for surname projects that deal with hereditary surnames. If two people share the same surname, the next question is: are they related? And it would seem a reasonable supposition that there is a much higher probability that they are related on their direct male lines (within the last 1000 years) if they do share a surname than if they don't.

Problems tend to arise when there is some doubt over what is a valid surname variant and what is not. For example, are Malley and Malloy surname variants? Are Farrell and Farris surname variants? What happens when you get both types of variant testing positive for M222? Do you group them together or keep them apart? Only other MPRs (such as downstream SNP testing) can answer these questions.

2. The Genetic Distance (GD) between two people indicates a (very) close relationship

The threshold for "declaring a match" between two people varies with the number of STR markers tested (see below). These thresholds are arbitrary, but the intention is to get the right balance between false positives and false negatives - in other words, letting the wrong people in and keeping the right people out (known more technically as specificity and sensitivity).

Most people do the Y-DNA-37 test initially and I would usually feel very confident grouping together people with the same surname if their GD was 2/37 or less; and reasonably confident of grouping them together if the GD was 4/37 or less. Except in the instance where there is evidence of Convergence, as indicated (for example) by the terminal SNPs of their matches sitting on a wide variety of distantly related "upstream" branches of the Y-Haplotree (Tree of Mankind). We'll talk about this some more in item 7 below.

In addition, Convergence is a common occurrence in certain subclades, such as M222 and L226. When I see these terminal SNPs in a new project member, alarm bells start ringing, my level of conservatism increases, and I start looking to other MPRs other than Genetic Distance to decide if two people belong in the same genetic family.


This technique for grouping people together will miss outliers - people who do indeed belong in the same genetic family but whose ancestors branched away from the main group many many generations ago. For example, in the Gleeson DNA Project, several of the members of Lineage II (all confirmed to be related by Big-Y SNP testing) have a GD of 10/37 compared to other group members, and that would usually preclude them being grouped together.

3. The TiP24 score is >80% compared to the group modal haplotype

I don't use this marker so much anymore but it can be a useful way of assessing if a newcomer belongs in a given genetic family, especially if there is insufficient data regarding SNP markers among their STR matches. The potential benefit of this method is that it takes into account the varying mutation rates of STR markers whereas GD does not.

It involves generating a TiP Report between a new project member and the member closest to the modal haplotype for a given genetic family within the project, and then looking at the percentage probability of being related within 24 generations. We call this the TiP24 Score (for lack of a better term). If this is >80% (an arbitrary figure, which can be adjusted to suit your personal preference), then the newcomer can be considered to be "likely to be related" and therefore placed in that specific genetic family.

It is important to note that the use of the TiP24 Score is not an attempt to date when two people are related, merely to ascertain if two people are likely to be related. The TiP24 Score is simply an attempt to standardise GD comparisons, given that we know that a GD of (say) 4/37 on slow-mutating markers is much more significant than a GD of 4/37 on fast-mutating markers. The former (probably) indicates a much more distant relationship than the latter.

This techniques works best for those related within the last several hundred years, but will miss outliers. I have several people in the Gleeson DNA Project (confirmed to be related via SNP testing) whose TiP24 Score with other members is as low as 1%.

Also, the TiP24 Score is likely to be tripped up by Convergence (in the same way that GD is) and is therefore of limited utility in such circumstances.

4. There is a clear Genetic Distance Demarcation between project members within a genetic cluster & project members outside it

Administrators have access to a tool called the "Y-DNA Genetic Distance" tool. This permits comparisons between any person in the project and every other person in the project. Often times, there will be a clear demarcation between a newcomer's range of GDs to a particular genetic family and all other genetic families within the project.

In the example below, the newcomer matches 9 members of R1b-Genetic Family 2 with a GD ranging from 4/67 to 9/67. Thereafter, the GD jumps to 16/37 and higher. This stark demarcation in GD suggests strongly that the newcomer falls within R1b-Genetic Family 2.

This also suggests that Convergence is unlikely to be an issue here (otherwise we might expect to see a more gradual increase in GD values, rather than the jump from 9 to 16 that we see here).

This technique works best with 111 or 67 marker comparisons. Demarcations are much less obvious using 37 marker comparisons.

The GD between the newcomer & other members
shows a clear demarcation between
one particular genetic family and all others

5. Presence of Rare Marker Values or a Unique STR Pattern among genetic group members

The idea here is that if one or more people share a Rare Marker Value, then it stands to reason that they are more likely to be related to each other, especially if they all share the same surname.

Leo Little's spreadsheet of STR marker value frequencies is very useful for identifying those values which are particularly rare, even though the spreadsheet only covers six of the main haplogroups (E3a, E3b, G, I, J2, R1a, R1b). What constitutes "rare" is a moveable feast but a frequency less than 5% would not be unreasonable.

Usually these rare marker values emerge after several people have been grouped together. Any newcomers thereafter who share this rare marker value can be further assessed for membership of the specific genetic family wherein the rare marker value occurs. A famous example is Group B of the Wheaton Surname Project where 3 "rare" marker values occur within the first 12 markers (with incidences of 5%, 1% & 8% in the "general" R1b population). The chances of these occurring within the general population are 1 in 62,000. And therefore, any Wheaton who matches these 3 STR marker values can be automatically allocated to Group B (with 99.99% confidence). And they only need a 12-marker test to do so.

Leo Little's spreadsheet of marker value frequencies

An allied concept is that of the Unique STR Pattern (USP, also know by various other terms such as STR Motif). In short, these are a selection of STR marker values (usually between 3 to 8 in number) that are "unique" to just a few people within a surname project and which indicate that the people concerned are likely to be related to each other.

A good example from the Gleeson DNA Project shows that several members had relatively unique STR Pattern (USP) which predicted that they were related (Branch E and F below). This was later confirmed by SNP testing of the two branches.

Unique STR Patterns (USP) predict the existence of a Branch E and F (last 6 entries)
Branch E signature ... 464b=17, 607=14, 576=17
Branch F signature ... 391=10, 458=17, 459=9-9, 576=17

Robert Casey has developed this concept extensively and you can hear him talk about it in this video here.

6. SNP testing is consistent among the members of the particular group

The advent of Next Generation Sequencing (producing tests like the Big Y and the array of SNP Packs) has created a SNP tsunami. And as more people SNP test, their predicted red SNP is being converted to a green confirmed SNP on the project's Y-DNA Results page.

As a result, many groups within a surname project are having their "Terminal SNP" characterised. And this allows us to compare any SNP markers that the newcomer has tested with the SNP markers that characterise the various groups within our surname project. If they are discordant, then the newcomer is ruled out from membership of those particular genetic families. But if they agree with each other, especially if they are SNPs quite far downstream, then this is further supportive evidence that the newcomer belongs in a specific genetic family.

The phrase terminal SNP is a bit of a misnomer. It should be restated as "current terminal SNP" and simply means the "most downstream" SNP marker that you have currently tested. And what is meant by "most downstream"? Imagine the Tree of Mankind (the Y-Haplotree) as starting with genetic Adam (upstream) about 250,000 years ago and the various branches emerging from him and continuously branching over many thousands of years into finer and finer "more downstream" branches, until these finer branches start approaching the origin of surnames (roughly 1000 years ago) and a genealogical timeframe. So your "most downstream" branch would be the branch characterised by your "most downstream" SNP marker ... which in turn is determined by your current level of SNP testing. For example, your Y-DNA 37 STR results will predict which Haplogroup branch you sit on (let's say it is R-M269, which arose about 13.5K years ago), and the R-M269 SNP Pack will take you a little further down Branch R (say to Z255, 4000 years ago), and the R-Z255 SNP Pack, will take you even further downstream (maybe to 2000 years ago), but the Big Y test will take you the furthest (maybe down to 500 years ago).

In the example below, all the green confirmed SNPs sit below the SNP marker that defines Gleeson Lineage II, namely A5631. Therefore any newcomer who matches any of these SNPs (even if he has a large GD to everyone in the project) can be reliably grouped into Lineage II. The abbreviated SNP Progressions (or SNP Signatures) for each of the individual SNPs is detailed below:

  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629
  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > BY5706 
  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > BY5706 > BY5707 
  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > A5628 > Y16880

Predicted SNPs (red) and Confirmed SNPs (green)

The predicted red SNPs are almost always much further upstream on the Tree of Mankind than the green confirmed SNPs. Think of the upstream SNPs as closer to Genetic Adam (250,000 years ago) and the downstream SNPs as closer to a genealogical timeframe (say, 1000 years ago).

7. SNP predictions are consistent (Matches’ Terminal SNP Analysis) 

NB: SNP Predictions does not mean the red predicted SNP you get in the Haplogroup column (see figure above) when you first get your Y-DNA-37 results. It refers to SNPs much further downstream than that, usually within the last 5000 years and frequently within the last 2000 years.

If a newcomer to the surname project has not undertaken downstream SNP testing, it is still possible to guess what his downstream "terminal SNP" will be by simply analysing the terminal SNPs of his STR matches. I call this the Matches' Terminal SNP Analysis. It is a relatively simple technique that takes a little time to complete. Here are the steps in the process:

1) First, open up the Y-DNA Matches page and adjust the Matches Per Page setting so that all the matches are on the one page.

2) Next click on the heading in the Y-DNA Haplogroup column so that all of the matches are sorted by their terminal SNP.

3) Make a list of all the terminal SNPs (you can ignore the SNPs that are way upstream e.g. M269, P312, L21, etc)

4) Find out where each SNP sits on the Y-Haplotree, and (most importantly) the major subclade to which it belongs. You can do this by either of two ways: a) launch FTDNA's Haplotree, press Ctrl+F (Cmd+F on a Mac) and enter the SNP name. Once you find it, trace the branch back up to the previous branching point, make a note of the SNP there, and repeat the process until you arrive at a known subclade SNP; or b) google the following: "ytree" and the SNP name ... and this will bring you to the relevant page on the Big Tree. Then simply copy and paste the SNP Progression from the top of the page.

A google search for: ytree a5631

5) Both of the above methods will result in you having a SNP Progression for each SNP in the Matches List (see example below). If all (or most) of these SNP Progressions fall below a certain sublcade, then the likelihood is that the newcomer will also test positive for some SNP below this subclade level. It may even be possible to predict that he sits on one of maybe two or three "way downstream" branches. And this can be strong supportive evidence that he is related to certain project members and should be grouped in a particular genetic family.

If on the other hand, the various SNP Progressions associated with this list of SNPs indicate that the newcomer is matching to multiple distinct upstream branches of the Haplotree, then no firm conclusions can be drawn about the newcomer's likely terminal SNP and therefore this information cannot be used to help place him in a specific genetic family.

6) As a result of this analysis, I may write to the newcomer and suggest they skip the upstream SNP Pack (e.g. R-M269) and move down to the more relevant downstream subclade SNP Pack (e.g. R-Z255) and purchase that one ... warning them that there is a 1% chance that my assessment may be wrong (but I haven't been wrong yet).

Output of the MTSA for a new project member
(he was advised to do the R-L1065 SNP Pack)

There are SNP Packs available for most of the major subclades and it is important to know what these are. You can see a list of them by logging in to your FTDNA account, clicking on Upgrade, then Advanced Tests, then SNP Packs from the drop-down menu.

Surprisingly, this analysis works best at the 25-marker level (because there are usually too few matches at the 37, 67 and 111 marker levels).

Occasionally I will have to use to check for the existence of equivalent SNPs or alternative names (if the SNP in question does not turn up in the FTDNA Haplotree or the Big Tree).

8. The same surname variant is predominant in a genetic group

This usually emerges after the new project member has been grouped on the basis of the previous MPRs described above. This serves to support and validate the decision to group the newcomer in the specific genetic family.

9. The same MDKA location is present in the particular genetic group

As above. This serves to illustrate how essential it is to encourage all project members to include the birth location of their Most Distant Known Ancestor (MDKA / EKA) in the Genealogy section of their personal FTDNA webpages. After their surname, their ancestor's birth location is the single most important piece of information.

Always include the birth location of the EKA / MDKA

10. The same MDKA is present in the particular genetic group

This is the ultimate validation that the grouping based on the preceding MPRs is valid and accurate.

For a more detailed discussion of these various criteria, watch the video below. This is suitable for beginners, those who have already done the Y-DNA test and want to find out what it means, and for Surname Project Administrators.

Maurice Gleeson
June 2017

Tuesday 13 June 2017

WDYTYA 2017 - videos going online

This year was the last year of Who Do You Think You Are? - Live! The event was an annual staple of the British genealogical calendar for the last 10 years. Starting in 2008 in Olympia in London, it moved to the National Exhibition Centre in Birmingham in 2014. The event attracted thousands of attendees year on year, and in 2009 Brian Swann, ISOGG-UK representative, persuaded FamilyTreeDNA to sponsor a stand at the event.

Shortly thereafter, the DNA Workshop began. And for the past several years this has been kindly sponsored by FamilyTreeDNA and run by volunteers from ISOGG. Each year the lecture schedule has attracted a host of international and local speakers, both academics and citizen scientists. And in addition, videos of the presentations have been made available free of charge on our dedicated YouTube channel as a service to the genetic genealogy community.

It is sad to see the demise of the WDYTYA event. It was a wonderful way of keeping in touch with friends and colleagues, and everyone looked forward to the manic three days of early mornings and late nights. Hopefully another annual event will rise to take its place. Nevertheless, at least the videos of the presentations will serve as a lasting legacy of the ten year run of WDYTYA.

The last batch of videos ever are ready to be uploaded to the YouTube channel and this will happen every Monday and Thursday over the coming 6 weeks or so. Three videos have already been uploaded and are attracting a large audience:

The Science of Admixture Percentages (Garrett Hellenthal)

DNA, emigration and shipping (Brian Swann)

Autosomal DNA demystified (Debbie Kennett)

The schedule of lectures from this year's event are indicated below (click the image to enlarge). Most of the presenters gave permission to upload their lectures and big thank you is due to the speakers for their generosity.

And this year we managed to get better audio recordings than ever before. Who would have thought that dangling a small microphone in front of the loudspeaker and recording a separate audio track on your iPhone would be the best way of conquering the ever-present background noise from the 10,000 people in the auditorium?!


Maurice Gleeson
June 2017

Thursday 1 June 2017

Convergence - quantifying Back & Parallel Mutations (Part 1)

In a recent post I explored the concept of Convergence and made the point that the mechanism by which Convergence arises is via a combination of Parallel Mutations and Back Mutations in the STR marker values. These mutations are changes that occurred at some time in the past but because they remain hidden to us in the present, we cannot tell when they occurred or how frequently they occurred just by looking at two sets of STR results from people living today.

However, there is a way around this problem. Or at least a partial solution.

By using a combination of STR data and SNP data we can build a Mutation History Tree that is a more accurate representation of the branching structure of the "family tree" for a specific genetic group. And this type of tree allows us to more easily (and more accurately) spot Back Mutations and Parallel Mutations.

I did this for one particular genetic family in one of my surname projects - the North Tipperary Gleeson's (Lineage II of the Gleason DNA Project). This tree is a "best fit" tree, by which I mean a tree constructed in such a way as to explain the STR & SNP data in the most parsimonious way i.e. with the fewest number of branches that will accommodate or "fit" the data. This approach is also called the "maximum parsimony" approach and is often used when building cladograms or phylogenetic trees. The Mutation History Tree (MHT) is simply another type of cladogram. You can read about the process of how the tree was developed in this blog post here and subsequent posts.

But a key point here is that this "best fit" tree is likely to change as more data becomes available. And to illustrate this point, I'm going to compare the current version of the tree (Dec 2016) with the next version that is being prepared following the recent availability of new data from 12 sets of Z255 SNP Pack results.

Below is the current version of the MHT for Lineage II. By comparing each mutation in the tree with every other one, we can identify which mutations are Back Mutations (occurring on a single line of descent) and which are Parallel Mutations (occurring on two or more lines of descent). I have highlighted the Back Mutations in yellow and the Parallel Mutations in green.

Back Mutations in yellow, Parallel Mutations in green
from Gleeson Lineage II MHT (version Dec 2016)

Parallel Mutations occur in the following lines of descent:
  • CDYb 40-39 ... A, E, D, F (4 times)
  • CDYa 39-38 ... A, B, C, F (4 times)
  • 464c 17-16 ... A x2, D (3 times)
  • 461 12-11 ... A, B (2 times)
  • 576 18-19 ... A, D (2 times)
  • 390 23-24 ... A, B, C (3 times)
  • 390 24-23 ... B, C (2 times)
  • 456 16-15 ... B, D (2 times)
  • and so on ...
Back Mutations are more difficult to count, and to conceptualise. Whether you consider the value as mutating forward or back is entirely dependant on your reference point. If our anchor is the upstream Z255 branch, then the original value of marker 390 (for example) is 24, mutating (forward) to 23 on the Z16438 branch, and then back to 24 (in parallel) on Branches A, B & C, and then back to 23 (again in parallel) on Branches B & C. So there are several points to make here:
  • this is in fact a Back Mutation that occurs in parallel in 3 separate lines of descent. It is thus both a Back Mutation (relative to its earlier value of 24 on the Z255 branch) and a Parallel Mutation, occurring at (presumably) different time points in Branches A, B & C. It is thus coloured yellow and green.
  • It can also be considered a Triple Mutation relative to the Z255 branch - in the sense that it mutates forward to 23 then back to 24, then back to 23 again. But what happens if it flips forward and back 5 times? What would we call that? And what do we call it if it goes two steps forward and one step back? This is where terminology fails us. I'm not sure if there is a standardised way of describing these different kinds of mutation (if there is, please leave a comment below).
  • the mutation 390 24-23 occurs in Branches B & C ... relative to its value of 24 in the Z255 branch, this could be considered a Parallel Forward Back Forward Mutation ... for Pete's Sake!!

But if we just focus on the Back Mutations that occur downstream of the branch characterised by the STR mutation (710 36-37), just above the A5627 SNP Block. This "710 branch" incorporates all the Gleeson's of Lineage II, from Branch A to F.* On this overarching branch for Lineage II, the value of the STR marker 390 is 23 and Back Mutations are as follows:
  • 390 24-23 ... B, C ... this is the only Back Mutation below the "710 branch"
  • And it is also a Parallel Mutation
  • All the other yellow Back Mutations are relative to the upstream Z255 branch, and not our downstream "710 branch", and so are not counted in this particular exercise.

So, let's generate some statistics from these numbers:
  • The total number of mutations below the "710 branch" (irrespective of whether they are forward or back) is 71.
  • There are 69 Forward Mutations (i.e. away from the original value of the relevant marker on the "710 branch")
    • 31 Forward Mutations show an increase in the number (e.g. 9 to 10)
    • 38 Forward Mutations show a decrease in the number (e.g. 9 to 8)
  • There are 2 Back Mutations 
    • both Back Mutations show a decrease in the number (i.e. 24 to 23)
  • There are 26 Parallel Mutations
  • Forward Mutations outnumber Back Mutations by a ratio of 35.5 : 1
  • Parallel Mutations outnumber Back Mutations by a ratio of 13 : 1
  • There are 16 people in this tree, and if we make the big assumption that the "710 branch" starts 1000 years ago (i.e. roughly at the time of the introduction of the Gleeson surname), then over the course of 1000 years, the rate of each type of mutation is (crudely) as follows:
    • Forward Mutations = 69/16 = 4.3125 mutations per "line of descent" per 1000 years
    • Back Mutations = 2/16 = 0.125 mutations per "line of descent" per 1000 years
    • Parallel Mutations = 26/16 = 1.625 mutations per "line of descent" per 1000 years

These are crude estimates but they give some idea of the relative importance of Parallel Mutations compared to Back Mutations. And applying this information to the phenomenon of Convergence, it would seem that Back Mutations play a very minor role compared to Parallel Mutations.

This conjecture is supported by some recent modelling work undertaken by Dave Vance and written up for the L21 Yahoo Discussion Forum. In Dave's simple model, which is an extremely useful basis for further discussion, the "average tree" could expect to have a ratio of Parallel to Back Mutations in the range of 25:1 to 50:1.

This is a lot higher than what I have shown in my MHT for the Lineage II Gleeson's, but this can be partly explained by the fact that there are only 16 people in my Gleeson sample, and we are looking at (perhaps) only the last 1000 years. I would predict that the ratio will increase further as 1) I add more people to the sample; and 2) the duration of observation is extended backward from 1000 years ago (the 710 Branch) to 4300 years ago (the Z255 Branch).

In subsequent posts we will see how these calculations stand up when we add in additional data from 12 SNP Pack results and reconfigure the MHT for Gleeson Lineage II into the next version of the "best fit" model. And we will also attempt to quantify the total number of Back & Parallel Mutations below the upstream marker Z255. And lastly, we will attempt to quantify Convergence itself.

Maurice Gleeson
June 2017

* the Big Y results of a 10th member of the group indicate that this branch is characterised by the SNP A5631 although this result is not reflected in this version of the MHT