DNA and Family Tree Research: Criteria for Grouping People into Y-DNA Genetic Families

One of the main tasks of Surname Project administrators is to place new members into the appropriate genetic group within their surname project.

Having run a variety of surname projects for the last few years, I have come up with a set of criteria I use on a routine basis to place newcomers into existing genetic groups and also to identify new genetic groups. I call these criteria Markers of Potential Relatedness (MPRs). And (not surprisingly) these can be thought of as indicators that two people may be "related" to each other, which for the purposes of surname projects means somewhere in the last 1000 years or so. This arbitrary timepoint is chosen because many European surnames were introduced about 1000 years ago (in particular British and Irish surnames), although they only became commonplace several centuries thereafter.

This approach to grouping works best with hereditary surnames (i.e. passed from father to son) but should also work with patronymic (and other) surnames, except that (in these latter cases) criteria 1 and 8 will not apply. The discussion below is very much from the standpoint of hereditary surname projects.

Not all criteria have to be met. But the more criteria that are met, the higher the likelihood of two people being related. This is particularly important in relation to SDSs (Surname or DNA Switches; also known as NPEs, Non-Paternity Events), as it may be difficult to distinguish a match that is an SDS (e.g. adoption, illegitimacy) from one that is due to Convergence.

Below is a list of these criteria and we will consider each one in turn. Some of these Markers of Potential Relatedness (MPRs) have nothing to do with DNA. If two people have the same surname, or the same unusual surname variant, or have a similar ancestral homeland, or even an ancestor with the exact same name, then these can be indicators that the two people are related. And because they don't rely on genetics I simply call them "traditional markers" as opposed to "genetic markers".

MPRs for deciding if two or more people are related within the last 1000 years

In practice, the most useful indicators (or at least the ones I most frequently use) are Markers 1, 2, 6 and 7. And if a new project member is grouped on the basis of these "main" markers, it usually becomes apparent that they meet many of the remaining criteria also.

1. The members have the same surname

This is an obvious criterion, especially for surname projects that deal with hereditary surnames. If two people share the same surname, the next question is: are they related? And it would seem a reasonable supposition that there is a much higher probability that they are related on their direct male lines (within the last 1000 years) if they do share a surname than if they don't.

Problems tend to arise when there is some doubt over what is a valid surname variant and what is not. For example, are Malley and Malloy surname variants? Are Farrell and Farris surname variants? What happens when you get both types of variant testing positive for M222? Do you group them together or keep them apart? Only other MPRs (such as downstream SNP testing) can answer these questions.

2. The Genetic Distance (GD) between two people indicates a (very) close relationship

The threshold for "declaring a match" between two people varies with the number of STR markers tested (see below). These thresholds are arbitrary, but the intention is to get the right balance between false positives and false negatives - in other words, letting the wrong people in and keeping the right people out (known more technically as specificity and sensitivity).

Most people do the Y-DNA-37 test initially and I would usually feel very confident grouping together people with the same surname if their GD was 2/37 or less; and reasonably confident of grouping them together if the GD was 4/37 or less. Except in the instance where there is evidence of Convergence, as indicated (for example) by the terminal SNPs of their matches sitting on a wide variety of distantly related "upstream" branches of the Y-Haplotree (Tree of Mankind). We'll talk about this some more in item 7 below.

In addition, Convergence is a common occurrence in certain subclades, such as M222 and L226. When I see these terminal SNPs in a new project member, alarm bells start ringing, my level of conservatism increases, and I start looking to other MPRs other than Genetic Distance to decide if two people belong in the same genetic family.

from www.familytreedna.com/privacy-policy.aspx

This technique for grouping people together will miss outliers - people who do indeed belong in the same genetic family but whose ancestors branched away from the main group many many generations ago. For example, in the Gleeson DNA Project, several of the members of Lineage II (all confirmed to be related by Big-Y SNP testing) have a GD of 10/37 compared to other group members, and that would usually preclude them being grouped together.

3. The TiP24 score is >80% compared to the group modal haplotype

I don't use this marker so much anymore but it can be a useful way of assessing if a newcomer belongs in a given genetic family, especially if there is insufficient data regarding SNP markers among their STR matches. The potential benefit of this method is that it takes into account the varying mutation rates of STR markers whereas GD does not.

It involves generating a TiP Report between a new project member and the member closest to the modal haplotype for a given genetic family within the project, and then looking at the percentage probability of being related within 24 generations. We call this the TiP24 Score (for lack of a better term). If this is >80% (an arbitrary figure, which can be adjusted to suit your personal preference), then the newcomer can be considered to be "likely to be related" and therefore placed in that specific genetic family.

It is important to note that the use of the TiP24 Score is not an attempt to date when two people are related, merely to ascertain if two people are likely to be related. The TiP24 Score is simply an attempt to standardise GD comparisons, given that we know that a GD of (say) 4/37 on slow-mutating markers is much more significant than a GD of 4/37 on fast-mutating markers. The former (probably) indicates a much more distant relationship than the latter.

This techniques works best for those related within the last several hundred years, but will miss outliers. I have several people in the Gleeson DNA Project (confirmed to be related via SNP testing) whose TiP24 Score with other members is as low as 1%.

Also, the TiP24 Score is likely to be tripped up by Convergence (in the same way that GD is) and is therefore of limited utility in such circumstances.

4. There is a clear Genetic Distance Demarcation between project members within a genetic cluster & project members outside it

Administrators have access to a tool called the "Y-DNA Genetic Distance" tool. This permits comparisons between any person in the project and every other person in the project. Often times, there will be a clear demarcation between a newcomer's range of GDs to a particular genetic family and all other genetic families within the project.

In the example below, the newcomer matches 9 members of R1b-Genetic Family 2 with a GD ranging from 4/67 to 9/67. Thereafter, the GD jumps to 16/37 and higher. This stark demarcation in GD suggests strongly that the newcomer falls within R1b-Genetic Family 2.

This also suggests that Convergence is unlikely to be an issue here (otherwise we might expect to see a more gradual increase in GD values, rather than the jump from 9 to 16 that we see here).

This technique works best with 111 or 67 marker comparisons. Demarcations are much less obvious using 37 marker comparisons.

The GD between the newcomer & other members
shows a clear demarcation between
one particular genetic family and all others

5. Presence of Rare Marker Values or a Unique STR Pattern among genetic group members

The idea here is that if one or more people share a Rare Marker Value, then it stands to reason that they are more likely to be related to each other, especially if they all share the same surname.

Leo Little's spreadsheet of STR marker value frequencies is very useful for identifying those values which are particularly rare, even though the spreadsheet only covers six of the main haplogroups (E3a, E3b, G, I, J2, R1a, R1b). What constitutes "rare" is a moveable feast but a frequency less than 5% would not be unreasonable.

Usually these rare marker values emerge after several people have been grouped together. Any newcomers thereafter who share this rare marker value can be further assessed for membership of the specific genetic family wherein the rare marker value occurs. A famous example is Group B of the Wheaton Surname Project where 3 "rare" marker values occur within the first 12 markers (with incidences of 5%, 1% & 8% in the "general" R1b population). The chances of these occurring within the general population are 1 in 62,000. And therefore, any Wheaton who matches these 3 STR marker values can be automatically allocated to Group B (with 99.99% confidence). And they only need a 12-marker test to do so.

Leo Little's spreadsheet of marker value frequencies

An allied concept is that of the Unique STR Pattern (USP, also know by various other terms such as STR Motif). In short, these are a selection of STR marker values (usually between 3 to 8 in number) that are "unique" to just a few people within a surname project and which indicate that the people concerned are likely to be related to each other.

A good example from the Gleeson DNA Project shows that several members had relatively unique STR Pattern (USP) which predicted that they were related (Branch E and F below). This was later confirmed by SNP testing of the two branches.

Unique STR Patterns (USP) predict the existence of a Branch E and F (last 6 entries)
Branch E signature ... 464b=17, 607=14, 576=17
Branch F signature ... 391=10, 458=17, 459=9-9, 576=17

Robert Casey has developed this concept extensively and you can hear him talk about it in this video here.

6. SNP testing is consistent among the members of the particular group

The advent of Next Generation Sequencing (producing tests like the Big Y and the array of SNP Packs) has created a SNP tsunami. And as more people SNP test, their predicted red SNP is being converted to a green confirmed SNP on the project's Y-DNA Results page.

As a result, many groups within a surname project are having their "Terminal SNP" characterised. And this allows us to compare any SNP markers that the newcomer has tested with the SNP markers that characterise the various groups within our surname project. If they are discordant, then the newcomer is ruled out from membership of those particular genetic families. But if they agree with each other, especially if they are SNPs quite far downstream, then this is further supportive evidence that the newcomer belongs in a specific genetic family.

The phrase terminal SNP is a bit of a misnomer. It should be restated as "current terminal SNP" and simply means the "most downstream" SNP marker that you have currently tested. And what is meant by "most downstream"? Imagine the Tree of Mankind (the Y-Haplotree) as starting with genetic Adam (upstream) about 250,000 years ago and the various branches emerging from him and continuously branching over many thousands of years into finer and finer "more downstream" branches, until these finer branches start approaching the origin of surnames (roughly 1000 years ago) and a genealogical timeframe. So your "most downstream" branch would be the branch characterised by your "most downstream" SNP marker ... which in turn is determined by your current level of SNP testing. For example, your Y-DNA 37 STR results will predict which Haplogroup branch you sit on (let's say it is R-M269, which arose about 13.5K years ago), and the R-M269 SNP Pack will take you a little further down Branch R (say to Z255, 4000 years ago), and the R-Z255 SNP Pack, will take you even further downstream (maybe to 2000 years ago), but the Big Y test will take you the furthest (maybe down to 500 years ago).

In the example below, all the green confirmed SNPs sit below the SNP marker that defines Gleeson Lineage II, namely A5631. Therefore any newcomer who matches any of these SNPs (even if he has a large GD to everyone in the project) can be reliably grouped into Lineage II. The abbreviated SNP Progressions (or SNP Signatures) for each of the individual SNPs is detailed below:

R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629
R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > BY5706
R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > BY5706 > BY5707
R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > A5628 > Y16880

Predicted SNPs (red) and Confirmed SNPs (green)

The predicted red SNPs are almost always much further upstream on the Tree of Mankind than the green confirmed SNPs. Think of the upstream SNPs as closer to Genetic Adam (250,000 years ago) and the downstream SNPs as closer to a genealogical timeframe (say, 1000 years ago).

7. SNP predictions are consistent (Matches’ Terminal SNP Analysis)

NB: SNP Predictions does not mean the red predicted SNP you get in the Haplogroup column (see figure above) when you first get your Y-DNA-37 results. It refers to SNPs much further downstream than that, usually within the last 5000 years and frequently within the last 2000 years.

If a newcomer to the surname project has not undertaken downstream SNP testing, it is still possible to guess what his downstream "terminal SNP" will be by simply analysing the terminal SNPs of his STR matches. I call this the Matches' Terminal SNP Analysis. It is a relatively simple technique that takes a little time to complete. Here are the steps in the process:

1) First, open up the Y-DNA Matches page and adjust the Matches Per Page setting so that all the matches are on the one page.

2) Next click on the heading in the Y-DNA Haplogroup column so that all of the matches are sorted by their terminal SNP.

3) Make a list of all the terminal SNPs (you can ignore the SNPs that are way upstream e.g. M269, P312, L21, etc)

4) Find out where each SNP sits on the Y-Haplotree, and (most importantly) the major subclade to which it belongs. You can do this by either of two ways: a) launch FTDNA's Haplotree, press Ctrl+F (Cmd+F on a Mac) and enter the SNP name. Once you find it, trace the branch back up to the previous branching point, make a note of the SNP there, and repeat the process until you arrive at a known subclade SNP; or b) google the following: "ytree" and the SNP name ... and this will bring you to the relevant page on the Big Tree. Then simply copy and paste the SNP Progression from the top of the page.

A google search for: ytree a5631

5) Both of the above methods will result in you having a SNP Progression for each SNP in the Matches List (see example below). If all (or most) of these SNP Progressions fall below a certain sublcade, then the likelihood is that the newcomer will also test positive for some SNP below this subclade level. It may even be possible to predict that he sits on one of maybe two or three "way downstream" branches. And this can be strong supportive evidence that he is related to certain project members and should be grouped in a particular genetic family.

If on the other hand, the various SNP Progressions associated with this list of SNPs indicate that the newcomer is matching to multiple distinct upstream branches of the Haplotree, then no firm conclusions can be drawn about the newcomer's likely terminal SNP and therefore this information cannot be used to help place him in a specific genetic family.

6) As a result of this analysis, I may write to the newcomer and suggest they skip the upstream SNP Pack (e.g. R-M269) and move down to the more relevant downstream subclade SNP Pack (e.g. R-Z255) and purchase that one ... warning them that there is a 1% chance that my assessment may be wrong (but I haven't been wrong yet).

Output of the MTSA for a new project member
(he was advised to do the R-L1065 SNP Pack)

There are SNP Packs available for most of the major subclades and it is important to know what these are. You can see a list of them by logging in to your FTDNA account, clicking on Upgrade, then Advanced Tests, then SNP Packs from the drop-down menu.

Surprisingly, this analysis works best at the 25-marker level (because there are usually too few matches at the 37, 67 and 111 marker levels).

Occasionally I will have to use www.Ybrowse.org to check for the existence of equivalent SNPs or alternative names (if the SNP in question does not turn up in the FTDNA Haplotree or the Big Tree).

8. The same surname variant is predominant in a genetic group

This usually emerges after the new project member has been grouped on the basis of the previous MPRs described above. This serves to support and validate the decision to group the newcomer in the specific genetic family.

9. The same MDKA location is present in the particular genetic group

As above. This serves to illustrate how essential it is to encourage all project members to include the birth location of their Most Distant Known Ancestor (MDKA / EKA) in the Genealogy section of their personal FTDNA webpages. After their surname, their ancestor's birth location is the single most important piece of information.

Always include the birth location of the EKA / MDKA

10. The same MDKA is present in the particular genetic group

This is the ultimate validation that the grouping based on the preceding MPRs is valid and accurate.

For a more detailed discussion of these various criteria, watch the video below. This is suitable for beginners, those who have already done the Y-DNA test and want to find out what it means, and for Surname Project Administrators.

Maurice Gleeson

June 2017

6 comments:

John1 July 2017 at 04:58
Very useful set of suggestions here. One question, when you state "related" in terms of "2. The Genetic Distance (GD) between two people indicates a (very) close relationship"?

I have 8 Loughney's tested. All have roots in or around Killala, County Mayo. At 111 markers, the GD between all of them and myself starts at 3 and goes to 8. Interestingly, I seem to be in the middle of all of the matches, as when I view some of them, they have a GD greater than 10 to some of the other testers.
Unknown2 July 2017 at 17:20
Wow! Thank you Maurice. I am not up to doing this just yet but I do hope to be able to not too far in the future. You certainly have inspired me!
Tanvi3 August 2017 at 12:23
Hi DNA and Family Tree Research Team,

My name is Anuj Agarwal. I'm Founder of Feedspot.

I would like to personally congratulate you as your blog DNA and Family Tree Research has been selected by our panelist as one of the Top 50 DNA Blogs on the web.

http://blog.feedspot.com/dna_blogs/

I personally give you a high-five and want to thank you for your contribution to this world. This is the most comprehensive list of Top 50 DNA Blogs on the internet and I’m honored to have you as part of this!

Also, you have the honor of displaying the badge on your blog.

Best,
Anuj
Kevin C22 May 2021 at 16:51
Over the last 20 years, I have generally grouped Campbell Project results informally, but have applied most of the same rules codified in this article. In general, my heuristic grouping strategy was to include like kits if they matched 1-2 at 12 markers, 2-3 at 25 markers, 3 at 37, 6 at 67, and 8-9 at 111. Deep SNP results were not available in the dark ages of testing and are only infrequently available today for those people that have done extensive testing. I acknowledge that SNPs are more accurate, and I would prefer to use SNPs for grouping, but SNPs are just not uniformly available. (Rule 2, Ibid Rule 6) FTDNA is long overdue in improving their prediction algorithm and should offer more than M269 to most people.

Regarding Rule 3 – FTDNATip scores, I generally agree with what the author says except that I have found that these scores can be a rough estimate of time to the most recent common ancestor. My empiric experience with known ancestors in the Campbell project leads me to conclude that 70%-75% confidence is a good place to start when trying to determine the approximate number of prior generations to your common ancestors.

Rule 4 – Clear Demarcation. This is good in theory but tough in practice. One thing that I know that my grouping suffer from is what I call “genetic drift”. For example, if A and B are grouped together because they are a 34/37 match, and B and C and C and D for the same reason, then its possible to end up with a grouping that includes A and D that match 28/37. This is not desirable and can happen at any number of markers. I acknowledge that this has occurred in some of the Campbell groups, and that these groups should be reswizzled. I have been hesitant to start this since it will likely be a very large task.

Rule 5 – Rare Markers. A few comments. First Little’s marker frequency table that is referenced is very old and really too high level at the R1b level of granularity. Better is Mike Walsh’s table of L21 frequencies. See my August 11, 2016 pinned post on this subject. Secondly, using these tables I published an article in the Spring 2007 CCSNA Journal noticing that “it has been discovered that while allele DYS458=20 appears in 2% of the overall R1b population, it occurs with a frequency of 31% among Campbell R1b participants.”

Rule 7 – Better SNP predictions. As many of you are aware, I have been doing this as a service for the last several years. I have built an L21 spreadsheet that takes the terminal SNPs of your matches and predicts a much farther downstream SNP. This works for most project members and allows you to verify a downstream SNP more cheaply by testing this alone instead of multiple panel tests, working yourself down the tree.

Rules 9 and 10 – Most Distant Known Ancestors. One of the things I’m most proud of is that I recognized the important of MDKA very early. Since the project’s outset in 2006, we have been collecting lineages from every Y-STR participant. Our current Lineage Summary document is approaching 500 pages and will benefit researches for generations. I am unaware if any other surname project has done this as long or as methodically as the Campbell project.

Friday, 30 June 2017

Criteria for Grouping People into Y-DNA Genetic Families