Tuesday 7 April 2020

Toward a Methodology for confirming the DNA signature of an Irish Clan

Introduction


This article will be of particular interest to anyone engaged in Irish Clan research or those who wish to do so. If you are a Project Administrator of an Irish Surname Project or Haplogroup Project, I hope you find the suggestions below of practical relevance to your own project. I also hope this article sparks an ongoing conversation about this topic because I feel we are on the brink of a new and exciting era in Irish Clan research, so please feel free to leave a comment or question in the Comments section below.

I revisited the seminal work of Dennis Wright from his 2009 JOGG paper and attempted to repeat some of the previous analyses using present-day data. Dennis established that the Irish Type III signature (IT3, characterised by a distinctive pattern of STR marker values) was associated with the tribe of the Dál gCais (and hence with the many Dalcassian surnames derived from its descendants - see a list of these surnames in Footnote 4). These findings were later augmented by the discovery of a SNP marker (L226) that appeared to be present in all those with the IT3 signature. More information is available on Dennis' Irish Type III website.

Any historical DNA signature will have emerged in a specific geographic location and may have subsequently given rise to specific surnames over time. Thus DNA is associated with both geography and surnames. And we can exploit this triangular relationship to analyse the association between a specific DNA signature and reported Clan history (both in terms of Traditional Clan territory and Clan-associated surnames).

The Triangular Interplay between Irish Surnames, Geography & DNA

Dennis considered four criteria for establishing that there was a significant association between IT3 and Dalcassian (DC) surnames. These are summarised in the slide below ...



With the availability of new data and methods in the 10 years since the paper was written, these can be expanded into 7 criteria, initially simply using the interplay of Surnames, Geography & DNA that Dennis explored in his 2009 paper.

Comparing men with and without a specific DNA signature, men with the DNA signature have ...
  • a greater frequency of earliest known Direct Male Line ancestors from the Traditional Clan Territory [Criterion 1]
  • and a greater incidence of Clan-associated Surnames [Criterion 2]
Comparing men with who have earliest known Direct Male Line ancestors in the Traditional Clan Territory to those who do not, those with such ancestors have ...
  • a higher prevalence of the specific DNA signature [Criterion 3]
  • and a greater incidence of Clan-associated Surnames (this is not a DNA-related criterion)
Comparing men with and without Clan-associated Surnames, those with Clan-associated surnames have ...
  • a higher frequency of the specific DNA signature [Criterion 4]
  • and a greater prevalence of DML ancestors from the Traditional Clan Territory (this is not a DNA-related criterion)

To the 4 criteria identified above, we can add 3 additional criteria to give a total of 7:
  1. People with extensive genealogically validated pedigrees (e.g. >20 generations) that go back to the Clan in question, also share the same specific DNA signature [Criterion 5]
  2. The branching structure in the Clan-associated Surnames Family Tree corresponds with that of the Genetic Family Tree [Criterion 6]
  3. Analysis of ancient remains of known historical figures descended from the Clan reveals that the specific DNA signature is present [Criterion 7]

Let's apply each of these criteria in turn using the example of L226 and the Dál gCais, and critique the evidence generated for each criterion along the way. Methodology and process are discussed in the footnotes so just the top-line results and a critique (including limitations) are discussed below.


The 7 Criteria applied to the Clan of the Dál gCais (IT3 / L226)


Criterion 1 - Comparing men with and without a specific DNA signature, men with the DNA signature have a greater frequency of earliest known Direct Male Line ancestors from the Traditional Clan Territory

There is insufficient data from public Results Page of the L226 project (and others) to ascertain where the earliest known Direct Male Line ancestor of tested individuals was born. FTDNA should have an extra column on the Results Pages with MDKA (Most Distant Known Ancestor) birth location data. I have asked them to do this. We need county-level MDKA birth location data in order to ascertain if there is a surplus of participants with MDKAs from the Traditional Clan Territory. We also need to bear in mind that the "Country of Origin" column on some public pages frequently contains guesses rather than known fact.

This is key information and its lack of ready accessibility is a serious obstacle to completing this work.

In his 2009 paper, Dennis found a preponderance of IT3 individuals with Direct Male Line MDKA birth locations within the Traditional Clan Territory of the Dál gCais (an area previously known as Thomond, which incorporated parts of the present-day counties Clare, Limerick & north Tipperary). Of those with the IT3 signature, 28 had MDKAs from Clare, Tipperary or Limerick (the counties closest to the Traditional Clan Territory) out of a total of 103 people who had recorded an MDKA birth location, which translates into 28/103= 27%. In short, 27% of the IT3 sample had MDKA origins in/near the Clan territory. With the additional data accumulated over the past 10 years, we could now repeat this exercise for other subclades with a strong Irish presence, such as DF105, Z255, CTS4466, L1335, etc. The proportion of these comparative samples with MDKA origins in the Clan territory should be a lot lower (maybe in the 1-3% range). These figures could be compared to the original 27% and subjected to a Chi-squared test or Student's t test to ascertain statistical significance.

But to perform this analysis we need MDKA birth location data. And this is not currently provided by FTDNA in an accessible manner, even though they have this data for many test-takers in their database. Hopefully this will change in the not too distant future.




Another consideration to bear in mind is that this type of analysis is vulnerable to biases resulting from (among other factors) test-takers guessing their ancestral origins (due to lack of confirmatory records, romantic notions, wishful thinking, or pure laziness), population disruption events and internal migrations (such as Cromwell banishing the Old English families "to Hell or Connaught"), and lineal extinction - descendant lines dying out or "daughtering out" (see Rob Spencer's Extinction Simulator to explore how frequently male lineages go extinct for a variety of different fertility rates. This simulator can also be applied to the current Covid-19 pandemic to assess different reproduction rates and how long it will take for each to lead to viral extinction).

As a result of these (and other) confounding factors, a detectable signal may not survive after all these years. But the only way to ascertain this is to do the analysis ... and for that we need the MDKA birth location data that is currently not readily accessible.


Criterion 2 - Comparing men with and without a specific DNA signature, men with the DNA signature have a greater incidence of Clan-associated Surnames [Criterion 2]

I compared the proportion of Dalcassian (DC) surnames (Footnote 4) among men in various Haplogroup R subclade projects, including: L226, M222, DF105, Z255, L1335, CTS4466 (see Footnotes 1,2,3,4 for considerations regarding methodology and process).

(click to enlarge)

There is a clear signal of a surplus of DC surnames in the L226 project (20.6%) but not in the other projects (0.6% to 3%). Using an online Chi-Squared Test calculator, the proportion of those with DC surnames in the L226 group was compared to the summed numbers of those in all the other subclades combined (excluding DF105 because it is itself a subclade of M222). The results were highly significant as indicated in the screenshot below. We can thus conclude that Criterion 2 is fulfilled.


Several interesting questions arise from this analysis. Why is the percentage not greater? Why isn't the proportion of L226 participants with DC surnames closer to 100%? There are several potential explanations for this (if you think of any more, please leave a note in the Comments section below).
  • NPEs / SDSs (Non-Paternity Events, Surname or DNA Switches) ... there are many causes for these but the Take Home Message is that the risk of a surname switch on your Direct Male Line since 1000 AD is about 50%. Therefore, as time proceeds, the proportion of Dalcassian descendants with non-DC names would tend to increase.
  • "Surname Gravity" ... this is a term coined by Bart Jaski in his presentation at GGI2019 Dublin. It refers to the fact that powerful clans attracted allegiance from non-genetically-related tenants, neighbours, vassals, servants, slaves, etc. Thus more powerful clans / surnames should have much greater Y-DNA diversity, compared to lesser clans. We see some evidence of this in Dennis Wright's 2009 paper - Table 7 reveals that the IT3 signature occurs in 60% (12/20) of men with the Casey surname but only 35% (24/68) of men with the O'Brien surname (which was a more powerful clan than the Casey sept). 
  • The source from which the list of DC surnames was extracted (namely O Hart) may be incomplete and some DC surnames may be missing. Other sources (apart from O Hart) should be consulted to refine the list.
  • Some DC surnames may never have been recorded in the ancient genealogies ... e.g. the Limerick O'Malley's of Group 2 of the O'Malley DNA Project are L226+ and are documented as living in Thomond since the 1100s. They may be DC in origin but are not named as such in any of the ancient genealogical texts (or at least those I have consulted thus far).
  • Many surnames are multi-origin in nature - therefore some DC surnames are also associated with non-DC origins. For example, even the name O'Brien (which is one of the most prominent Dalcassian surnames) may also have origins in several other places. It may be an anglicisation of O'Braoin which in turn had 4 possible origins (in Westmeath, Meath, Mayo & Roscommon). Thus there may be at least 5 distinct genetic signatures which go back to the individual founders of each surname.


Criterion 3 - Comparing men who have earliest known Direct Male Line ancestors in the Traditional Clan Territory to those who do not, those with such ancestors have a higher prevalence of the specific DNA signature

Again, we need MDKA birth location data in order to complete this analysis and this is lacking currently. If we were able to conduct this analysis, it should show that more people with MDKA birth locations in Clare, Limerick & Tipperary test positive for L226 compared to people with MDKA birth locations elsewhere in Ireland.

We can use a less robust approach by comparing the participants in specific geographic projects in or near the Traditional Clan Territory (namely Clare, Limerick & Tipperary) against those in nearby counties. There is a specific county geographic project for Clare (n=1445) and Tipperary (n=94) but not for Limerick. And as regards the nearby counties, there are specific geographic projects for counties Cork (n=612), Kerry (n=109),  Wexford (n=282), Kilkenny (n=403), and Greater Galway (n=565). However, the Cork & Wexford projects do not allow access to their Results Pages for people who are not in the projects (i.e. there is no publicly-accessible data for these projects).

We can potentially extract data from the Results Pages of the various projects above and compare the proportion of L226+ participants in each project with the proportion in the Clare & Tipperary projects.
This is a messy analysis because we do not know if we are double-counting some participants (i.e. the same person could be in several projects). Also, it would be wrong to assume that all people in a particular county project have proven direct male line ancestry from that particular county - it is likely that many participants will be exploring the possibility that they have such ancestry and their Direct Male Line may in fact be from an entirely different county.

Nevertheless, it is a useful exercise to undertake. And here are the results ...


As anticipated there are higher numbers of L226+ participants in the Clare & Tipperary projects (21-24%) than in projects based around nearby counties (2-7%). Numbers are based solely on SNP data, comparing those within each project who tested positive for L226 (or one of its downstream SNPs) or for a SNP on an entirely distinct branch of the Tree of Mankind. Numbers were small for Tipperary (19) and should therefore be interpreted with caution. However the numbers from Clare were much more substantial and the analysis is therefore more robust.

Using an online Chi-Squared Test calculator, Chi-squared tests produced the following p values (see Footnote 6 for full details) ...
  • Clare vs Kerry... p = 0.006899
  • Clare vs Kilkenny ... p = 0.000246
  • Clare vs Galway ... p = 0.00001
  • Tipperary vs Kerry ... p = 0.160434
  • Tipperary vs Kilkenny ... p = 0.119305
  • Tipperary vs Galway ... p = 0.000531

Not surprisingly two of the comparisons involving Tipperary were not significant (probably due to the small numbers involved) but all of the comparisons between Clare and nearby counties were statistically significant. Combining the numbers from Clare & Tipperary and comparing them to the combined numbers from Kerry, Kilkenny & Greater Galway produced a highly significant result and indicates that the frequency of L226 is almost 6 times higher in Clare & Tipperary than in nearby counties.



Thus this crude analysis suggests that those men with presumed direct male line ancestry from Clare and/or Tipperary have a higher frequency of testing positive for L226 than those with presumed ancestry from nearby counties. This therefore suggests that Criterion 3 is fulfilled.


Criterion 4 - Comparing men with and without Clan-associated Surnames, those with Clan-associated surnames have a higher frequency of the specific DNA signature 

The first step is to identify 2 subgroups - those with and without DC surnames. The second step is to determine which of these have the IT3 USP (Unique STR Pattern) or are L226+ ... and that potentially means assessing each subject in turn - a daunting task if we are assessing several thousand subjects.

There are several options for identifying subjects with Clan-related surnames. Data could be extracted from the Ireland Y-DNA Project, or from the Munster Irish Project, or from individual haplogroup or surname studies. The process for extracting and processing the data collected from the public Results Pages of the Ireland Y-DNA Project are discussed in the footnotes. (5)

Using data from the Ireland Y-DNA Project, and a conservative approach, the following numbers were analysed:
  • DC group (those with Dalcassian surnames - see Footnote 4)
    • 24 subjects were L226+
    • 122 subjects were L226-
    • 141 subjects were "uncertain"
    • thus the proportion of this group who were L226+ was 16.4% ("uncertain" cases were excluded)
  • non-DC group (those without Dalcassian surnames)
    • 97 subjects were L226+
    • 2162 subjects were L226-
    • 2072 subjects were "unlikely" to be L226
    • 3645 subjects were "uncertain"
    • thus the proportion of this group who were L226+ was 4.3% (excluding the "unlikely" numbers in the denominator) and 2.2% (including the "unlikely" numbers in the denominator). "Uncertain" cases were excluded from both analyses.


    Using an online Chi-Squared Test calculator, the results for the most conservative approach (i.e. excluding the "unlikely" and "uncertain" numbers in the denominators) are highly statistically significant, as indicated in the screenshot below.



    Why is the proportion of L226+ men with a DC surname only 16%? Why is it not more than this? One reason for this is that the list of DC surnames is not exclusively Dalcassian. Many of the surnames may be multi-origin in nature, rather than arising exclusively from the tribe of the Dál gCais. Even the surname O’Brien may have non-DC origins (as mentioned above under Criterion 2). This will have the effect of "diluting" the DC surname group.

    Another reason is the occurrence of a Surname or DNA Switch along the direct male line. Some of the possible causes were discussed above. The effect of these SDS phenomena is to increase the number of DC surnames that are not associated with the DC DNA signature (L226).

    It is important to remember that these are not random samples. There are many factors operating that can introduce bias and skew the results one way or the other. For example ...
    • Selection bias - many surnames will be over-represented because of the sterling efforts of project administrators to recruit members to their particular surname projects. Those surnames which do not have a dedicated surname project may thus be under-represented. And those surnames with more-able project admins may be over-represented.
    • Founder effects - many of the men who have undertaken Y-DNA tests are Diaspora Irish as opposed to Local Irish. Founder effects in the families of Irish emigrants to the US may thus cause some DNA signatures to be more frequent than others. This could skew the results but such skew is likely to happen with the more downstream branches of the surname (i.e. post-1000 AD) rather than with the Clan-associated branches further upstream (i.e. pre-1000 AD). 
    Nevertheless, the large numbers involved in these analyses is likely to reduce the influence of any bias or skew resulting from over-representation of a particular subgroup or surname. We can thus conclude that Criterion 4 is fulfilled.


    Criterion 5 - People with an extensive genealogically validated pedigree (e.g. >20 generations) that goes back to the Clan in question, also share the same specific DNA signature

    Sir Conor O'Brien has a 30-generation pedigree back to Brian Boru and tests positive for L226. Thus Criterion 5 is fulfilled. This lends a lot of support for the accuracy of his personal Direct Male Line pedigree and bolsters the conclusions based on other analyses that L226 is associated with the Clan of the Dál gCais.

    It would provide further helpful confirmatory evidence if it were possible to test another direct male line descendant who triangulates on the same common ancestor (in this case Brian Boru). This would have the added advantage of helping to characterise the DNA signature of a specific ancient Irish king for whom we know the dates of when he lived (from the ancient texts). And this in turn would help "calibrate the genetic clock" in this particular portion of the Tree of Mankind (i.e. give a more accurate assessment of the number of Years per SNP).



    Using extensive pedigrees also raises the question: how are such extensive lineages validated?

    In Ireland, some lineages will have been confirmed by the Chief Herald and the Genealogical Office, but even this system is not infallible. Some lineages may have been widely researched by previous generations and may appear in official publications such as Burke's Peerage or DeBrett's Peerage. But again, both of these are known to be peppered with the (sometimes not so) occasional mistake.

    The benchmark for this kind of work is how Kevin Schurer researched the pedigree for Richard III. This was published in the supplementary information for the journal article that describes how Richard III was identified.


    Criterion 6 - The branching structure in the Clan-associated Surnames Family Tree corresponds with that of the Genetic Family Tree

    It is easy enough to extract surnames that "evolved" from various Irish Clans from publications like O Hart's Irish Pedigrees or Bart Jaski's Genealogical Tables and to build a Clan-associated Surname "family tree" (CST). Both of the above resources have the added advantage of including dates of when various chiefs / kings reigned or died, and so approximate dates of birth can be surmised for individuals within the tree. This is important when we come to consider branching points and the dates thereof.

    These publications are an excellent starting point for building a Clan-associated Surnames Tree (CST) but there are several important caveats to bear in mind when using these or similar resources:
    • Some pedigrees have been completely fabricated, usually for political gain or to augment someone's social standing. This is frequently cited as a reason for caution when using the ancient genealogies. In general, the earlier ones are likely to be more accurate than the later ones (see Katharine Simm's book entitled Medieval Gaelic Sources).
    • These pedigrees are frequently condensed and some individuals along the descendant lines may have been omitted. Consequently, this also potentially omits branching points within the overall tree and will give an inflated calculation for the number of years per generation.
    • In other instances, individuals with similar names have been accidentally conflated (i.e. erroneously assumed to be the same person). This may have the effect of shortening the descendant line (and thus increasing the number of years per generation) or may associate the wrong descendant branch with the wrong immediate ancestor.
    • Mistakes happen - some pedigrees have accidental omissions of certain descendants. For example, in the Clan-associated Surnames Tree (CST) below, there is no mention of the names McNamara, Quin, or O'Dea, surnames that arose from the brothers of Blad (son of Cas) and are therefore Dalcassian in origin. (My thanks to Bart Jaski for pointing this out).
    • Similarly, there are occasionally accidental inclusions or misplacements of people in the pedigree. 
      • For example, as in the CST below, Quirk are said to derive from Anluan brother of Brian Boru according to a poem in An Leabhar Muimhneach (ALM, ed. 1923, Ó Donnchadha), p. 330 = Origin Theory No. 1. But this may well be a mistake, as Brian Boru also has an ancestor named Anluan who has a grandson named Corc (Ua Cuirc > Quirk) = Origin Theory No. 2. Yet on p. 302 of ALM, Brian's other brother, Echtigern, is via his son Cináed said to be the ancestor of Muinter Chuirc = Origin Theory No. 3. The Múscraige on the other side of the Shannon were ruled by Ua Cuirc kings, and they are not Dál Cais = Origin Theory No. 4. So this raises the question: is the surname Quirk truly Dalcassian in origin? And if it is, exactly where in the Dalcassian tree does it branch from?
      • Similarly, in the CST below O'Cormacan etc. may well be McCormack, but we have no pedigree of the family, only the testimony in ALM p. 237 that Muinter Chormacáin and also Muinter Sheasnáin, and others, but not O'Curry, descend from Óengus son of Eochaid Ballderg, the latter being a forefather of Brian. (Again, my thanks to Bart Jaski for this analysis).
      • Thus, some members in the Tree below may be wrongly positioned within the tree or may not be Dalcassian at all. 

    Below is the CST for the tribe of the Dál gCais as it appears in Dennis Wright's 2009 paper but with SNP markers for some of the surnames in the tree superimposed as coloured dots. The SNP Sequences (or SNP Progressions) for each of these markers is given in the superimposed tabular summary for each of the relevant surnames. From this you can see where the genetic branching points occur.

    Clan Surnames Tree (CST)
    Mutation History Tree (MHT)

    Above is a "genetic family tree" that uses DNA markers instead of named individual ancestors. This Mutation History Tree (MHT) details the line of descent from a common DNA marker (L226 in this instance) and identifies all the branching points within the tree where a new SNP mutation arose, thus forming a new and distinct line of descent. The dates for each branching point in this tree were crudely calculated using SNP counting. This is not an optimal methodology but it is one of the most consistent methodologies for the time period in question (i.e. the time of the Irish Clans, which is roughly the period from 2500 years ago to 1000 years ago). The time estimates from YFULL often do not have sufficient data points, the estimates from the Big Tree tend to be better for more distant points in time (say, >2000 years), and the TiP Report is too inaccurate. No method is perfect. A better method is badly needed.

    So how do the two trees compare? There is good consistency (but not 100% agreement) between the two trees (CST & MHT). Many of the surnames found in the CST are also found in the MHT. In addition, the overall progenitor for both trees appears to fall in roughly the same time period (about 350 AD).

    Nevertheless, as mentioned above, the dating of each branch in the MHT needs to be further improved. Current methods of estimating dates of formation are sub-optimal, due to a variety of different factors:
    • SNP counting is less reliable on the more downstream branches (due to fewer data points). 
    • The number of Years per SNP (YPS) varies considerably depending on a number of factors including the size of families down through the generations. Powerful clans will have more descendants and hence better Y-DNA coverage to detect SNP mutations, thus lowering the average number of years per SNP. For the L226 Project, the YPS is calculated to be 70 years (compared to the population average of 84 years per SNP - a 17% decrease).

    However, the two trees are not 100% superimposable:
    • McMahon branches off the MHT earlier than in the CST
    • Curry also branches off the MHT earlier than in the CST
    • there are two branching points for the Casey surname - is one of them an NPE / SDS?
    • McCormick is equated with O'Cormacan but this may be incorrect

    This lack of 100% consistency could be because ...
    1. the Clan Surnames Tree is wrong (e.g. the ancient genealogies are incorrect or have been interpreted incorrectly), or ...
    2. the Mutation History Tree is wrong (e.g. some SNP markers have been placed in the wrong sequence on the Tree), or ...
    3. NPEs / SDSs have occurred over the course of time and some branches have a surname that does not go back to the progenitor of that branch
    4. other  reasons ...
    Further Steps:
    1) the MHT is currently just based on SNP data. We could add in extra subjects with the USP (Unique STR Pattern) of IT3 and repeat the analysis.
    2) the comparison of the two trees could be subjected to some sort of statistical analysis of "fit" or "concordance" ... but I'm not familiar with the type of statistical analysis this would entail.


    Criterion 7 - Analysis of ancient remains of known historical figures descended from the Clan reveals that the specific DNA signature is present

    One way to prove that Brian Boru tested positive for L226 would be to exhume him and test him. This is not as far-fetched as it sounds. The Barrymore DNA Project obtained permission to enter the mausoleum where some of the Earls of Barrymore were buried and took samples from the 3 sets of skeletal remains that they found there. Jim Barry (one of the Project Admins) presented his findings at Genetic Genealogy Ireland in 2016 (see the video of the presentation here).

    There are relics of famous chiefs and kings all over Europe. Some of these have already been examined and DNA extracted (e.g. Louis XVI of France). And this has helped to prove or disprove certain theories and hypotheses. So this particular approach could be a useful additional criterion that could provide conclusive proof.

    However, there are several obstacles and challenges that would need to be overcome:
    1. is the location of specific ancient remains known?
    2. are the remains definitely those of the person in question?
    3. can permission be obtained to excavate the site and recover the remains?
    4. can a suitable sample be extracted (e.g. from the petrous part of the temporal bone)?
    5. what type of analysis can be performed on the DNA sample? How many STRs can be generated? Is a USP (Unique STR Pattern) present? Is Whole Genome Sequencing (e.g. Big Y) possible?
    6. can a terminal SNP be identified? if not, can it be predicted from a USP? (the caveat here is that we are comparing it to modern data)
    7. can the remains be radio-carbon dated?


    Weighting the criteria in order of importance


    Having reviewed the 7 criteria, we can reorder them in potential order of importance thus:
    1. Analysis of ancient remains of known historical figures descended from the Clan reveals that the specific DNA signature is present [Criterion 7]
    2. People with an extensive pedigree (e.g. >20 generations) that goes back to the Clan in question, also share the same specific DNA signature [Criterion 5]
    3. The branching structure in the Clan-associated Surnames Family Tree corresponds with that of the Genetic Family Tree [Criterion 6]
    4. Comparing men with and without a specific DNA signature, men with the DNA signature have a greater incidence of Clan-associated Surnames [Criterion 2]
    5. Comparing men with and without Clan-associated Surnames, those with Clan-associated surnames have a higher frequency of the specific DNA signature [Criterion 4]
    6. Comparing men with and without a specific DNA signature, men with the DNA signature have a greater frequency of earliest known Direct Male Line ancestors from the Traditional Clan Territory [Criterion 1]
    7. Comparing men who have earliest known Direct Male Line ancestors in the Traditional Clan Territory to those who do not, those with such ancestors have a higher prevalence of the specific DNA signature [Criterion 3]

    Ultimately, it may be that no single criterion on its own provides sufficient evidence that confirms or refutes a theory to an acceptable level of scientific certainty (e.g. p < 0.01). In these circumstances, a Bayesian Statistical Analysis (like the one used to confirm the identity of the remains of Richard III or Thomas Kent) may have to be employed in order to confirm or refute a particular theory. This would be coupled with an Odds Ratio statistic.

    Alternatively a combined metric could be developed to convert information from the various possible analyses into a single outcome measure that could be subjected to statistical analysis.


    Conclusions

    1. Seven criteria are presented that can be used to help establish the likelihood (or not) of a particular DNA signature being that of a specific Irish Clan. 
    2. Using the example of L226 (which has previously been shown to be associated with the tribe of the Dál gCais, in Dennis Wright's seminal paper from 2009), 5 of the 7 criteria were adequately met. 
    3. Where statistical analysis could be performed (Criteria 2, 3 and 4), the results were highly statistically significant.
    4. Criteria 1 could not be explored due to a lack of adequate data on the FTDNA website. Such data is held by the company but is inadequately presented on their Results Pages. This failing needs to be urgently addressed.
    5. This methodology can be repeated for other Clans and will help advance Irish Clan research.
    6. The data raise interesting questions about the evolution of Irish DNA signatures over time and the potential forces that influence their survival or annihilation.
    Maurice Gleeson
    Mar 2020

    Acknowledgements
    A draft version of this blog post was reviewed by several colleagues who fed back very useful comments, insights and challenges that improved the final version. It is because of the collaborative nature of the genetic genealogy community that we can together make strides toward a common goal and the cause of this citizen science can be advanced. I thank all my colleagues for their efforts to make me a better scientist.

    Footnotes & Resources

    1) The specific subclade projects were chosen as they are known to contain a large number of participants who have ancestry from Ireland. L226 is associated with the Dál gCais (as established in the 2009 paper), M222 is associated with the Northern Uí Neill as is the downstream SNP DF105, Z255 is associated with southeast Ireland / Leinster, L1335 is associated with Ireland, and CTS4466 is predominant in Southwest Ireland.

    2) In addition, the estimated age of these subclades is likely to fall within (or close to) the time period within which the Irish Clans were formed, which is taken to be approximately 1000-2500 years before present (i.e. 500 BC to 1000 AD). The approximate age of each subclade is summarised in the table below using estimates from both the YFULL and Big Tree websites (both of which have major limitations).
    • L226 ... 1350-1915 ybp
    • M222 ... 2088-2700 ybp
    • DF105 ... 1560-1900 ybp
    • Z255 ... 1700-2035 ybp
    • L1335 ... 2200-3151 ybp
    • CTS4466 ... 1650-2316 ybp


    3) Process for extracting data from Haplogroup subclade projects:
    • I first copied and pasted data from the public Results Pages of the relevant projects and pasted this into an Excel Spreadsheet. 
    • I assumed that everyone in each project was either confirmed to be positive for the overarching project-specific SNP (i.e. M222, Z255, etc) or was judged to be highly likely to be positive for that particular SNP. 
    • I deleted all the STR marker value data so I was just left with the surname, MDKA (where available), country of origin, & Haplogroup columns. 
    • I then added an extra column with sequential numbers for each row in the worksheet (this would allow the original configuration to be easily restored). 
    • I then sorted the entire worksheet by Surname. 
    • I compared this list of project-related surnames with a list of Dalcassian surnames extracted from the 2009 paper, using the following formula to compare two lists of surnames: =COUNTIF(J:J,@H:H). 
    • the number of names were counted and the relative percentages of DC and non-DC surnames calculated.
    • at a later stage, I could go through each list manually and see if any DC surname spelling variants occur in each project and have been omitted from this initial analysis. I could also see if the original list of Dalcassian surnames could be expanded to include common variants.
    • the spreadsheet containing this analysis can be downloaded from the following Dropbox link ...  https://www.dropbox.com/s/9anhg0h81rje8w7/Dalcassian%20surname%20frequencies.xlsx?dl=0
    4) This is the amended list of Dalcassian surnames that was used in this initial analysis (which in turn was extracted from O Hart's Irish Pedigrees; or the Origin and Stem of the Irish Nation). This is taken from the 2009 paper and refined to include close variants of the original list (e.g. removing the O before O'Casey to give the additional surname Casey; removing the Mac in MacConsidine to give Considine). On a technical note, all spaces have to removed AFTER each name as otherwise the Excel formula in Footnote 3 above does not identify all the possible occurrences ...



    5) There are several options for identifying subjects with Clan-related surnames - using data from the Ireland Y-DNA Project, or from the Munster Irish Project, or from individual surname studies for specific DC surnames.

    Ireland Y-DNA Project
    1. The Ireland Y-DNA Project has over 8000 participants. The data from the public Results Page was copied and downloaded to an Excel spreadsheet (available via Dropbox here). The surnames therein were compared to the Amended Initial List of Dalcassian surnames (4) and this exercise identified 287 participants with DC surnames out of a total of 8263 subjects (3.5%). The remaining surnames could be considered to be non-DC. Thus it was relatively easy to identify those with and without DC surnames (and to separate them into 2 separate worksheets). The non-DC group consisted of 7976 individuals.
    2. The next big problem is to figure out how many subjects in each group (DC & non-DC) are L226+ or have the IT3 USP (Unique STR Pattern). There are 3 subgroups:
      1. those who are L226+ or have a terminal SNP downstream of L226 (the problem here is trying to identify where any given SNP is positioned on the Haplotree and thus determining if it is downstream of L226, upstream of L226, or on an entirely different branch). Rob Spencer's FTDNA Admin Utilities would have been a useful tool to use for this exercise but I only realised this after the exercise was completed.
      2. those with a terminal SNP potentially upstream of L226 - these might test positive for L226 or one of its downstream SNPs if they tested for L226 or its downstream SNPs
      3. those with an estimated terminal SNP (red text). In this case, allocation to the L226+ group would depend on assessment of each individual case for the ITC USP
    3. Trying to figure out the exact position of any given SNP on the Haplotree can be very time-consuming (even with Rob Spencer's FTDNA Admin Utilities). Some of them are very far downstream and may not appear in any of the commonly used haplotrees (YFull, Big Tree). I had to go down through the FTDNA Public Y-Haplotree to identify the downstream branches and then search for them in the Excel spreadsheet to ascertain how many of the downstream branches were represented among the 8263 total project participants ... but this process only uses the "lead SNP" for each SNP block and fails to take account of any phyloequivalent SNPs - thus some relevant subjects might have been missed. This exercise identified 120 subjects who were positive for L226 or one of its downstream SNPs ...
      • Only 19 of the 8263 subjects are recorded as being L226+
      • A further 90 have a "DC" SNP (DC = Dalcassian) - it was assumed that all DC SNPs fell below L226. This is really handy that so many SNPs downstream of L226 have been labelled with the DC prefix.
      • 2 subjects are recorded as FGC5660+
      • ZZ31 ... 1
      • FGC5659 ... 4
      • FT71838 ... 1
      • FGC5647 ... 1
      • FGC5639 ... 2
      • There were no subjects in the Ireland Y-DNA Project who were recorded as being positive for any of the following SNPs downstream of L226 ... Z17669, FGC79628, BY39756, BY4101, BY62163. FGC5628, BY4102, BY122665, A6097, FT62906, FT159770, FT14526, FT14460, FT72187, BY4103, ZZ34_1, Y5610, YFS231286, FT120209, FGC13418, BY5212, Y44000, FT96594, PH184, BY174767, FT101316
      • Rob Spencer's SNP Tracker and especially his FTDNA Admin Utilities would have been a useful tool to use for this exercise (one lives and learns).
    4. These SNPs were thereafter identified in the DC and non-DC subgroups. As a result, 3 subgroups emerged: L226+, L226-, and Uncertain (either upstream SNPs, or insufficiently tested SNPs).
      • In the DC group, L226 and its downstream SNPs could initially be identified in 24 subjects. Non-L226 subjects (n=38) were initially identified by being in a non-R Haplogroup or as belonging to well-known distinct subclades (e.g. M222, DF105). Uncertain subjects were identified if they had an obvious upstream terminal SNP (e.g. M269, Z253, L21, etc; n=135). For the remaining terminal SNPs (n=90), a search (using Cmd+F) was conducted in the L226 branch on the Big Tree for each of the terminal SNPs listed among the 287 subjects. The Big Tree was chosen because it contains phyloequivalent SNPs (which would therefore be found by any search). This search was undertaken only in the DC group because there were fewer numbers than the non-DC group (and hence the assessment was more manageable). Some SNPs had to be checked against the FTDNA Public Y-Haplotree, the ISOGG Tree, and Ybrowse.org to ascertain if they were upstream of L226. The final analysis yielded 24 subjects who were L226+, 122 who were L226-, and the remaining 141 as being undetermined. Analysing just the 146 subjects who could be confidently allocated to the L226+ or L226- subgroups, 16.4% were L226+, and 83.6% were L226-.
    5. Turning to the non-DC group, first it was necessary to identify any SNPs that were upstream of L226 and place them in the "uncertain" category. To do this, each of the SNP blocks above L226 were identified via the Big Tree and copied and pasted into an Excel spreadsheet. Each was allocated a separate cell using the "text to columns" function. They were then prefixed with R-. This column of SNPs was then compared to the column of SNPs in the non-DC group using the function: =COUNTIF(F:F,@T:T). In so doing, 258 instances of upstream SNPs were identified. These were excluded from the denominator. Most had already been identified by visual inspection. The exceptions were: L459 (1) & Z2542 (3). This exercise may not have captured all the phyloequivalent upstream SNPs but hopefully it captured most of them.
    6. A fourth group (column) was added - "Unlikely L226+" - and all other green terminal SNPs were placed in this group. The vast majority would be on distinct adjacent branches but it could not be ruled out that some might have been upstream to L226 as a result of insufficient downstream testing. 
    7. The analysis was performed with and without the "Unlikely L226" numbers in the denominator (the "uncertain" subgroup was excluded from both analyses).
      • total number of subjects = 7976
        • 97 subjects (1.2%) tested positive for L226 or a downstream SNP
        • 2162 subjects (27.1%) were definitely not L226+
        • 2072 (26%) were "unlikely L226"
        • 3645 (45.7%) were uncertain (i.e. unrecorded SNP or upstream terminal SNP)
      • the percentage of those with non-DC surnames who were L226+ was ... 
        • 97 / (97 + 2162) = 4.3% (without the "unlikely L226" included)
        • 97 / (97 + 2162 + 2072) = 2.2% (with the "unlikely L226" included)
    8. So comparing the DC and non-DC groups, the proportion of each who were L226+ was ...
      • 16.4% vs 4.3% (conservative approach, with "unlikely L226" excluded)
      • 16.4% vs 2.2% (conservative approach, with "unlikely L226" included)
    9. The numbers in the DC group could be further augmented by establishing which had the IT3 USP. This would entail assessing the haplotype of each individual and identifying which ones had a) the full IT3 USP, or b) a partial IT3 USP. This exercise was not undertaken at this time.
    A similar exercise could be repeated using the public data from the Munster Irish Project or from Individual DC Surname Projects. The advantage of using the Munster Irish Project database is that the project focuses exclusively on surnames known to be associated with the province of Munster in southern Ireland which incorporates the ancient territories of the Dál gCais. Thus this would provide a more geographically-relevant comparator population.

    6) Chi-squared tests using an online Chi-Squared Test calculator produced the following p values ...


    ClareKerryMarginal Row Totals
    L226+68 (60.27) [0.99]3 (10.73) [5.56]71
    L226-213 (220.73) [0.27]47 (39.27) [1.52]260
    Marginal Column Totals28150331 (Grand Total)

    The chi-square statistic is 8.3441. The p-value is .003869. Significant at p < .05.
    The chi-square statistic with Yates correction is 7.299. The p-value is .006899. Significant at p < .05.


    ClareKilkennyMarginal Row Totals
    L226+68 (54.88) [3.14]7 (20.12) [8.55]75
    L226-213 (226.12) [0.76]96 (82.88) [2.08]309
    Marginal Column Totals281103384 (Grand Total)

    The chi-square statistic is 14.5249. The p-value is .000138. Significant at p < .05.
    The chi-square statistic with Yates correction is 13.4386. The p-value is .000246. Significant at p < .05.

    ClareGalwayMarginal Row Totals
    L226+68   (45.14)   [11.58]3   (25.86)   [20.21]71
    L226-213   (235.86)   [2.22]158   (135.14)   [3.87]371
    Marginal Column Totals281161442    (Grand Total)

    The chi-square statistic is 37.8731. The p-value is < 0.00001. Significant at p < .05.
    The chi-square statistic with Yates correction is 36.2346. The p-value is < 0.00001. Significant at p < .05.



    TipperaryKerryMarginal Row Totals
    L226+4 (1.93) [2.23]3 (5.07) [0.85]7
    L226-15 (17.07) [0.25]47 (44.93) [0.1]62
    Marginal Column Totals195069 (Grand Total)

    The chi-square statistic is 3.4222. The p-value is .064324. Not significant at p < .05.
    The chi-square statistic with Yates correction is 1.9701. The p-value is .160434. Not significant at p < .05.

    TipperaryKilkennyMarginal Row Totals
    L226+4   (1.71)   [3.05]7   (9.29)   [0.56]11
    L226-15   (17.29)   [0.3]96   (93.71)   [0.06]111
    Marginal Column Totals19103122    (Grand Total)

    The chi-square statistic is 3.9743. The p-value is .046199. Significant at p < .05.
    The chi-square statistic with Yates correction is 2.4264. The p-value is .119305. Not significant at p < .05.

    TipperaryGalwayMarginal Row Totals
    L226+4   (0.74)   [14.39]3   (6.26)   [1.7]7
    L226-15   (18.26)   [0.58]158   (154.74)   [0.07]173
    Marginal Column Totals19161180    (Grand Total)

    The chi-square statistic is 16.7427. The p-value is .000043. Significant at p < .05.
    The chi-square statistic with Yates correction is 12.0022. The p-value is .000531. Significant at p < .05.

    An Excel spreadsheet with the data from each of the projects is available here.





    13 comments:

    1. That is a huge amount of work, Maurice - and does show what is now possible.

      I do have a passing interest in the surname Brian/Bryan. You may be unaware that there is a Barony in Pembrokeshire (Walwyns Castle) that was run by the De Brian / De Bryan family for many years. Whether any of them made their way over to Ireland, I don't know - but on paper I am working with Linda Horton on this, who has interests in the Brian surname. There are place-names like Brampton Bryan in Shropshire, which may relate to this family too.

      Brian

      ReplyDelete
    2. I look forward to using this on one of the surnames I'm working on thank you.

      ReplyDelete
    3. Maurice : You have trawled through a mass of data. You have thus developed a template. The use of the SNP data cuts through the triangle of geography/history/DNA. The SNP data leads the way. In the Fitzpatrick project we have a more contemporary time frame. The Lord and 1st Baron of Upper Ossory was born in 1485 AD. His pedigree has been established by Ronan Fitzpatrick and Steve Zalewski. Have a look in the Kilkenny project. Then see the number of (non biased testees ) who have SNPs in common with this historical figure. Needs some contact with you elucidate.

      ReplyDelete
    4. Hi,
      I noticed you have my surname in the non dal Cais list, My understanding is it was originally Mac Conchraidha,

      ReplyDelete
      Replies
      1. I thought my surname would show up, It's "Crow".

        Delete
      2. According to Woulfe, Mac Conchradha (and variants MacEnchroe, MacEncroe, Crough, Crowe) derived from 'son of Conchraidh' (a rare Irish personal name); the name of an old Thomond family, still well known in Clare, Tipperary and Limerick ... https://www.libraryireland.com/names/macc/mac-conchradha.php
        The ancestral territory suggests it may be Dalcassian, but not all Dalcassian surnames were recorded by O Hart (from which the list of surnames in the 2009 paper is derived).
        Crow was also a name associated with the "Adventurers for Lands in Ireland" ... https://www.libraryireland.com/Pedigrees2/names-cromwellian-adventurers-ireland.php

        Delete
    5. As an interesting note on the NPEs theory, the O’Canannáin clan of Tir Conaill (Donegal today) is known to have two branches, the M222 and the L226. My Irish cousin Francis Columba Cannon did some research and has this to say, "My conclusion is that our Y-chromosome was introduced into the O’Canannáin clan during an event in 1153, when the "Men of Munster" (that is the Dal gCais army) were billeted with our ancestral clan in Tir Conaill (Donegal today) for probably a month. The billeting was arranged by the O'Neill high-king at that time probably for several reasons but primarily to revitalize the O’Canannáin army with robust young warriors. Dr Katharine Simms, referenced in Thomas Cannon's paper, suggested that the O’Canannáin dynasty was so weakened in the mid-twelfth century from constantly defending itself from other dynasties, that it had become dependent on the O'Neill dynasty for support. My research suggests that propagating warriors was a vital social function in the 'warlord phase' of early Irish history, and billeting provided a means of fulfilling this function."

      ReplyDelete
      Replies
      1. Very interesting indeed. That certainly would explain why the surname is associated with several different SNP markers.
        The anglicised versions (Cannon, Canning) may also have at least two distinct Irish origins, according to Woulfe - one in Tir Conaill (as you mention above) and another in Ui Maine (Galway), the latter related to the O'Maddens ...
        1) https://www.libraryireland.com/names/oc/o-canannain.php
        2) https://www.libraryireland.com/names/oc/o-canain.php

        Delete
    6. I did some research for the origins of various surnames in O'Hart's and also in An Leabhar Muimhneach and I tried to reconcile the DNA (from the results of the R-L226 project and the testers surnames) with the genealogies and constructed a graph showing where I was theoretically successful with this. A lot of the things here are just speculation other things are probably a lot more obvious inferences where the DNA and genealogies just make total sense together. I think the most obvious example of this is how descendents of three sons of Cas: Bloid, Caisin and Aonghus all branch off from sons of Z17669.

      Here is a link containing a detailed image of the chart I have constructed:
      https://i.imgur.com/KLNpieN.png

      Another image here shows the full pedigree along the main line:
      https://i.imgur.com/VuQP4VD.png

      ReplyDelete
      Replies
      1. Nice graphs, Jack. This is exactly the type of research that needs to be undertaken. The genealogies may or may not be entirely accurate and evidence for this may show up when comparing the Clan Surnames Tree with the Mutation History Tree constructed from Y-DNA SNPs & STRs. I am exploring how these two trees can be compared for concordance and I hope to do a blog post about that at some stage in the future. In the meantime, comparing these Clan Surname Trees with the Mutation History Trees is a great way of generating working theories that can be subjected to further testing in due course. But there are significant limitations to be aware of - I don't think that DNA is likely to solve many problems of the type "was X really the son of Y or was he the great great grandson or the Uncle?"

        Delete
    7. Maurice I'm coming back to this blog post after some time and thinking it through again. These are perhaps obvious comments and you HAVE covered them in the article though not as separate points of their own:

      1. Clearly the DNA signature selected for comparison should be as close as possible in time to the founding/original growth of the clan, mainly to as cleanly as possible differentiate older descendant lines so as not to confuse the analysis. TMRCA estimates from either SNPs or STRs are possible though vague of course. But choosing the right DNA signature (STR or SNP) close to the clan origins to me is an underlying additional criterion of its own... heavens knows we have enough cases of thousands-of-years-old haplogroups being force-fit into identification with historical period cultural groups!

      There are probably also a number of subsidiary "negative correlation" criteria like that remains from older periods than the clan don't exhibit the DNA signature, or no significant percentage of remains outside the territory exhibit the DNA signature, etc, that follow as corollaries to your 7 criteria.

      2. Your statistical approach I think is definitely valid but even with a measure of statistical significance it's hard to put bounds around what rises to "supported" versus "likely" etc. In general both geography and surname distributions are subject to what I would term "drift" - geographical distribution from the migrations of the DNA signature in and out compared to the territories in question as well as dilution from influx of other DNA signatures, and surname distributions from the usual i-NPE and e-NPE events as well as forces like your "surname gravity" (and not forgetting differing surname origins as well). (Note Y-DNA signatures have genetic drift as well but sufficient SNP structure keeps that accounted for). You've described analysis approaches for those, but I wonder if both could be estimated in advance - geographic "drift" from clan histories and local territory history, and surname "drift" from population statistics appropriate to the time periods and cultures in question. Again you've already walked through examples of doing just that in your L226 example, I'm just wondering if it can be extrapolated reliably enough into predictions that results can be assessed against.

      My last thought for now is that for clans which have not had a single dominant Y-DNA signature through their history, I wonder if the approach could still work with two or three dominant Y-DNA signatures? Geographic or surname "drift" might be too chaotic at some point I suppose.

      ReplyDelete
      Replies
      1. Thanks for your comments, Dave - excellent points, all well made.

        Your first point about finding the SNP or STR signature closest to the time of Clan foundation is very important. And TMRCA estimates will always carry a huge margin or error. Your SAPP Programme is (to my mind) the best current method of estimating TMRCAs, especially when 851 STR marker values are used - currently these have to be manually extracted from individual Big Y-700 results but hopefully FTDNA will provide a way of including these (just the relevant ones i.e. only those markers with mutations between project members) on the Results Page of Surname DNA Projects so that they can be easily downloaded en masse in a csv file.

        Conceiving what might constitute a valid and useful statistical approach to confirming / refuting an association is a significant conundrum. I imagine this would take some major modelling work (as you suggest) and I don't know if some academic institution might be willing to take this on. And, like you say, trying to apply such a model when there is more than one dominant Y-DNA signature associated with a particular surname / clan could throw a rather large spanner in the works!

        Still, at least it looks like Clan Research will keep us busy for a while yet.

        Delete