There are several important questions that face Project Administrators of Surname DNA Projects:
- Why should I group people together?
- How should I group people together?
- What does each group tell me?
As an Administrator of 15 DNA Projects for a variety of Irish Surnames, I have pondered these issues, explored different alternatives, fallen down rabbit-holes, and revised my thinking. So here is my current streamlined approach - no doubt it will evolve further as time goes by. These are just my own personal musings - other admins may differ in their approach (and that's fine - there is no right way or wrong way to run a project). And the discussion below applies only to Surname DNA Projects - other DNA Projects will have different reasons for grouping and therefore alternative grouping strategies.
I offer these thoughts and ideas so that project members may get a better understanding of the thinking behind the process of grouping people, and so that project administrators might pick up a few useful tips - please take what you like and discard the rest.
So let's explore each of these topics in turn.
Why should I group people together?
For me, the purpose of a Surname DNA Project is to study the surname. That may seem obvious but it has several important implications.
Firstly, fixed inherited surnames arose in Ireland about 1000 years ago and in England & Scotland about 800 years ago. Wales was a bit later still (with some parts of Wales not adopting the practice of a fixed inherited surname until the 1850s). This defines the period of study as being roughly the last 1000 years. And therefore, we should aim to create subgroups of people who are related to each other within that timeframe.
For Irish and Scottish surnames at least, anything beyond 1000 years ago steps into the realm of Clan history, and that in itself is a fascinating area of research, but one that falls more under the remit of geographic projects (e.g. the Munster Irish project), haplogroup projects (e.g. R-L226 project), or even specific clan projects (e.g. Ancient Breifne Clans project).
So for surname projects, we should be aiming to identify groups of related people, with the same surname, who are likely to be related to each other within the last 1000 years. Such groups are likely to descend from a single individual who was the progenitor of the surname for that group.
And if we are lucky, we may be able to make a case for having identified the genetic signature of the first person to bear the name 1000 years ago. For Irish surnames, we may even be able to link this to some of the Traditional Genealogies and therefore to a specific Irish clan, thus connecting project members with a much deeper part of their ancestral heritage.
How should I group people together?
Some years ago I developed the concept of Markers of Potential Relatedness (MPR). Simply said, these are markers that point you toward the conclusion that two or more people are related to each other. And by "related" I mean within the last 1000 years.
These Markers of Potential Relatedness help us to identify people who may be related within the last 1000 years and who therefore belong within the same subgroup.
You can see a presentation that takes a deep dive into this concept in this video here, but the most useful MPRs in practice (and the main ones I use for grouping people together) are as follows:
- a known relationship
- same downstream SNP
- close Genetic Distance to people with the same surname
- same USP (Unique STR Pattern)
Let's go through each in turn.
A Known Relationship
The first one is obvious - if two people have a known relationship, then clearly they are "related within the last 1000 years" and therefore belong in the same group. Some people may not know that they are related (e.g. 4th cousins) but have the same common ancestor showing up in the "Paternal Ancestor Name" column on the project's Results Page. A little communication between these project members can confirm the connection and justify their being grouped together.
Same downstream SNP
If two people share the same "downstream" SNP (i.e. close to 1000 years old or less), then I group them together, especially if they have the same surname.
Rob Spencer's Admin Utilities tool is a great way of seeing exactly where a particular SNP sits and what SNPs sit above it. Entering any SNP will generate the SNP Sequence for that SNP.
TMRCA dates for downstream SNPs can be checked by simply googling the SNP name and YFULL.
People with the same downstream SNP but a different surname may be an indication that a Surname Switch has happened at some point in the past - the trouble is that without other information, you won't know on whose ancestral line the switch occurred. Then you are faced with the classic question: which came first? - the Fry chicken or the Boylan egg?
The Big Y test gives much more definitive data than SNP Packs or single SNP tests and is my preferred (and recommended) method of SNP-testing.
Close Genetic Distance to people with the same surname
When a new member joins one of my projects, the first thing I do is check whether or not he has the surname being studied (or one of its potential variants). I then check his Y-STR Matches and see if he matches any other project members - if he does, I assign him to the same group that they are in. I will also double check that any downstream SNP data he has is consistent with the SNP results of other members of that group. And I may also check to see if he shares any Unique STR Pattern that characterises that particular group (see below).
Much of the time this criterion is perfectly fine for grouping people together, but we can run into major difficulties if there is significant Convergence present i.e. just by chance, the genetic profile of a person is similar to the profile of many other "non-related" people. This has been a significant issue with the M222 groups in some of my projects.
You can recognise when Convergence is likely to be present by looking at the number of matches - if a project member has a huge number of matches to a wide variety of different surnames, then Convergence is likely and most of these would be "false positive" matches. Yes, he does share a common ancestor with every single match but this may be thousands of years ago rather than hundreds. In other words, the connection is a lot further back than it looks. And it may be well beyond the arbitrary 1000 year threshold we have set for defining subgroups.
In this situation, I would group everyone with the same surname (or variant) into the same large overarching group (call it, say, Group 3). All of these people may or may not be related within the last 1000 years.
Then within this large group, I would create subgroups (3a, 3b, etc) of people with known downstream SNP data that places them on a downstream branch of the Tree of Mankind close to our 1000 year threshold. I may look up the age of the SNP on YFULL to make sure the TMRCA date is roughly somewhere between 1000-1500 years ago.
Having created these SNP-defined subgroups, I would then add in non-SNP-tested individuals based on much more restrictive Genetic Distance criteria than those used for "declaring a match" i.e. 2/37, 4/67 and 5/111 as opposed to 4/37, 7/67 and 10/111. This approach minimises the risk of inappropriate grouping but does not get rid of it completely. Ultimately the only way of being sure that someone has been placed in the correct subgroup is for that person to do the Big Y test to identify their SNP profile. This is the recommended course of action for anyone who has not managed to make it into one of the SNP-defined subgroups.
Participants who do not meet the criteria for a subgroup are left in the overarching group (Group 1 in this example) |
A good example of this process in practice is from my O'Malley DNA Project. Many Mayo O'Malley's test positive for the M222 SNP marker. I placed them in Group 3 - a large overarching group for all M222+ O'Malley's. So far, downstream SNP testing has identified 6 subgroups below this. The common ancestor for all 6 subgroups lived about 2000 years ago (the TMRCA for the M222 SNP Block), and the common ancestor for each subgroup lived about 1000 years ago or less. You can read a detailed account of this specific example in this blog post here.
The common ancestor for each of the individual 6 subgroups is within the last 1000 years |
Same USP (Unique STR Pattern)
When a group of people have the same value for several specific STR markers, this can indicate a specific "signature" for that particular group and anyone with the same signature can be deemed to be "related" and thus should be grouped with them. The number of STR markers that make up a Unique STR Pattern varies a lot, but the more markers involved, the more robust the USP.
USPs were easy to spot on the Results Pages of the old WorldFamilies.Net (WFN) website (sadly now defunct) and a similar scheme would be most welcome on FTDNA's Results Pages. The WFN website compared each group's genetic signature against the signature (modal haplotree) of an upstream branch of the Tree of Mankind and thus identified any USPs and presented them as coloured columns on their Results Pages. The coloured pattern in the diagram below beautifully portrays the Unique STR Pattern within different subgroups of the Gleason DNA Project.
It is much more difficult to see USPs on the FTDNA pages because they are not highlighted in colour. You would need to use Dave Vance's SAPP Programme or Chase Ashley's Y-DNA Grouping App to highlight any USPs.
So those are the main methods I use for assigning project members to a specific group.
In addition, I have some general advice on formatting the name for each group:
- number each group (01, 02, 03, etc) - it makes it easier to refer to when writing about it or discussing it with project members.
- include the possible ancestral location (this may be obvious from the MDKA information)
- include the abbreviated SNP Sequence (get it from Rob Spencer's Admin Utilities)
- include any specific guidance (e.g. if R-M269, upgrade to Big Y) or point members toward additional information (e.g. see Updates tab in About section for Next Steps) - this may include links to haplogroup, geographic & clan projects that they should join, as well as useful general information (e.g. how to get the most out of your Y-DNA test, essential things everyone should do).
What does each group tell me?
Far more has been written about how to group project members than about how to analyse the resultant groups. The grouping process only takes you half-way ... you then need to analyse each group in turn. If the overall objective of a Surname DNA Project is to study the surname, then grouping merely lays the foundation upon which subsequent analysis is based.
The sort of questions that can be explored in any analysis of a specific group include: where is the group from? does this link us to the known history of the surname? how old is the group? what is the branching structure? how did the name evolve over time? is there an association with a pre-surname clan?
A practical example of how to approach analysis of individual groups is detailed in this video here (delivered at the O'Malley Clan Gathering in 2019).
Having a clear picture of the desired outcomes of your research allows you to create more specific project goals. Thus the objectives for any surname study could include the following:
- To identify distinct genetic groups of people carrying surname X (or one of its variants)
- To analyse each genetic group and assess where did it come from, how old it is, and is there any connection to a pre-surname "clan"?
- To communicate the conclusions of the analysis for each genetic group
- To help focus project members on specific directions for their own ongoing genealogical research
After all this work, you will need an effective way of communicating it to your project members. Different admins use different methods. Some publish regular updates on the project website on FTDNA. Others create a separate website or blog or newsletter or annual report. Whatever method you choose, you should plan to keep your project members informed about the current status of the project and any new developments affecting specific groups. Also bear in mind that you will eventually need to pass this task on to a successor so it is wise to design your communication strategy with this in mind.
Hope you find something of use among these hints and tips.
Always great to read your articles! Thanks again.
ReplyDeleteI have posted some comments on this article and the underlying 2017 post in the Campbell FTDNA activity feed. Here are some of my comments on this very excellent set of blog posts.
ReplyDeleteOver the last 20 years, I have generally grouped results in the Campbell project informally, but using the same rules codified in this article. In general, my heuristic grouping strategy was to include like kits if they matched 1-2 at 12 markers, 2-3 at 25 markers, 3 at 37, 6 at 67, and 8-9 at 111. Deep SNP results were not available in the dark ages of testing and are only infrequently available today for those people that have done extensive testing. I acknowledge that SNPs are more accurate, and I would prefer to use SNPs for grouping, but SNPs are just not uniformly available. (Rule 2, Ibid Rule 6) FTDNA is long overdue in improving their prediction algorithm and should offer more than M269 to most people.
Regarding Rule 3 – FTDNATip scores, I generally agree with what the author says except that I have found that these scores can be a rough estimate of time to the most recent common ancestor. My empiric experience with known ancestors in the Campbell project leads me to conclude that 70%-75% confidence is a good place to start when trying to determine the approximate number of prior generations to your common ancestors.
Rule 4 – Clear Demarcation. This is good in theory but tough in practice. One thing that I know that my Campbell groupings suffer from what I call “genetic drift”. For example, if A and B are grouped together because they are a 34/37 match, and B and C and C and D for the same reason, then its possible to end up with a grouping that includes A and D that match 28/37. This is not desirable and can happen at any number of markers.
Rule 5 – Rare Markers. A few comments. First Little’s marker frequency table that is referenced in the 2017 post is very old and really too high level at the R1b level of granularity. Better for my purposes is Mike Walsh’s table of L21 frequencies. (https://drive.google.com/file/d/1dUmxIAd9IaMjCdYfaIQVwtdzFcTnOL73/view?usp=sharing) Secondly, using these tables I published an article in the Spring 2007 CCSNA Journal noticing that “it has been discovered that while allele DYS458=20 appears in 2% of the overall R1b population, it occurs with a frequency of 31% among Campbell R1b participants.”
Rule 7 – Better SNP predictions. I have been doing this as a service for Campbell Project participant for last several years. I have bult an L21 spreadsheet that takes the terminal SNPs of your matches and predicts a much farther downstream SNP. This works for most project members and allows you to verify a downstream SNP more cheaply by testing this alone instead of multiple panel tests, working yourself down the tree. There are just too many SNPs to keep track of. FTDNA should really create a report or tool that places a random collection of SNPs in “tree context.”
Rules 9 and 10 – Most Distant Known Ancestors. One of the things I’m most proud of in the Campbell project is that I recognized the important of MDKA very early. Since the project’s out set in 2006, we have been collecting lineages from every Y-STR participant. Out current Lineage Summary document is approaching 500 pages and will benefit researches for generations. I am unaware if any other surname project has done this for as long or as methodically as the Campbell project.
Finally, about a year or two ago there was a free tool link posted that automatically did STR clustering from a downloaded FTDNA spreadsheet. I can’t seem to find the link. Maybe others have it. In any case, FTDNA could vastly add value if they suggested groupings/clustering for a project based upon the combination of STR matches, SNPs tested, and deep SNP prediction based on the terminal SNPs of matches and any atDNA testing that has been done as described in the article.
Well done Maurice.
This sounds like a wonderful way to handle a surname project, but unfortunately I'm not learned enough to understand any of it. Thanks for your help. I'll keep studying.
ReplyDeleteVery clearly presented and helpful. Dispels some of the mystery behind the way the admins on FTDNA have grouped members of my own surname project!
ReplyDelete