DNA and Family Tree Research: April 2016

Wednesday, 27 April 2016

Which DNA Test is Best for Me?

People frequently ask the question: which DNA test should I do? Well the response is simple: The test that is best for you very much depends on the sort of questions that you would like answers to.

Below is a selection of the typical kind of questions that people ask about DNA testing and some brief answers to them. Be sure to explore the links for more information. Hopefully this will help you understand what each type of test can do and that in turn will help you decide which one is best for you.

How many types of DNA test are there?

There are 3 main tests you could do, and you could test one of several members of your family - it depends on what questions you would like answered:

Y-DNA traces your father’s father’s father’s line
mtDNA (mito or mitochondrial) traces your mother’s mother’s mother’s line
and atDNA (autosomal) traces ALL your ancestral lines and gives you your ethnic makeup.

Note that Y-DNA and mtDNA will only give you information about one ancestral line each, whereas atDNA gives you information on all your ancestral lines (but only has a reach of about 300 years, compared to 200,000 years with the other two types of test). You may wish to look at this YouTube video I made explaining the three types of test in more detail and giving examples of their application to genealogy.

Here’s a few examples of questions you might want answers to and the best test to address each one:

How do I find out about my ethnic origins?

atDNA will tell you roughly what percentage of your DNA is from Europe, Asia, Africa, etc. It will also give you rough estimates on a sub-regional level (e.g. "Central Europe" or "France/Germany") but is unlikely to identify a particular country. Currently this ethnic admixture test (also known as biogeographical analysis) only gives crude estimates and will continue to be refined over time. Still, it makes for a pretty picture which the kids can print out and take to school.
the general opinion among genetic genealogists is that 23andme gives the best genetic ethnicity estimates, followed by Ancestry, and then FamilyTreeDNA. None of the tests are accurate enough currently to pinpoint ancestral homelands but they might point you in the right direction. Check out this blog post for more info - Making the best of what's not so good by Judy G Russell, The Legal Genealogist, 22 February 2015.
both Y-DNA and mtDNA will tell you where that one particular ancestral line originated (eg Western Europe). And because both go back about 200,000 years to Africa, they will also give you the crude migration routes those particular ancestors took. More nice pictures for the kids school projects.

How do I find out more about my surname and where it came from?

do the Y-DNA test. And test the oldest generation, so that would be your father, uncle, or grandfather. If you are male, your Y-DNA should be exactly the same as your father's Y-DNA and your grandfathers, etc all the way back on the direct male line.
... unless there has been an NPE along the way. NPE stands for Non-Paternity Event or Not the Parent Expected. Common causes are secret adoptions, infidelity within marriage, and illegitimacy. These happen in about 1% of cases per generation.

Start off with the Y-DNA -37 test from FamilyTreeDNA and be sure to join any relevant surname or haplogroup projects. You can find these by doing a search for your name on the FamilyTreeDNA website and it will give you a selection of relevant projects for you to join.

After reviewing the results of your Y-DNA-37 test, ask the Admins of projects you have joined for advice on what additional testing might be warranted. This could mean upgrading to 67 or 111 markers, or it could mean doing SNP marker testing. The Project Admins will advise.
If you want to explore your mother's surname, test her brother. He is the one who inherited the Y-DNA that goes with that surname.
You can research ANY surname in your family as long as you test the appropriate male cousin who bears that particular surname.

How do I connect with genetic cousins?

Y-DNA will connect you with genetic cousins with whom you share the same surname.
mtDNA will connect you with cousins on your mother’s mother’s mother’s line but this is the least useful of all 3 tests - because mtDNA mutates so slowly, even an exact match could mean a common ancestor several thousand years ago (rather than several hundred years ago in the case of Y-DNA).
atDNA will connect you with about 500-1000 cousins you never knew existed (if you have European ancestry). It gives you the most "bang for your buck". Most of them will be distant cousins, but you may spot a few familiar names in your list of matches. The majority will be unknown cousins who are related to you via unknown ancestors beyond your ancestral Brick Walls, or they will be "false positive matches" (particularly if the amount of DNA they share with you is small). You will have hours of fun (and I mean hours) trying to figure out how they are connected. This test has “Retirement Plan” written all over it. But there are two important questions to address:

who to test?
and which company to test with?

Who do I test?

Anyone. Anyone can do a DNA test. But it’s always a good idea to test the oldest members of the family first, for two reasons:

they will not always be around
they have more DNA from particular ancestors than you do

Your mother for example would be a generation further back than you, and thus she will have twice the number of matches on your maternal side of the family compared to you … because she has twice as much “DNA from the maternal side of the family” - each generation loses 50% of the DNA from that side of the family, because only half of it is passed on from parent to child. So your Dad would only have (roughly) 25% of his DNA from his grandfather, you would have only 12.5%, and your son would have only 6.25%. The percentage inherited from any one specific ancestor roughly halves with each generation.
Testing yourself and a parent helps you isolate which side of the family your matches are from. So if you tested yourself and your Dad for example, any matches you both share in common have to be from his side of the family; and any matches that you have but he doesn’t, have to be from your mum’s side. Of course you could also test your mum to answer this same question, if she is still with us.
FamilyTreeDNA store the DNA samples for 25 years free of charge so this serves as a genetic legacy for future generations - could be important as the science of genetic genealogy progresses (and it has only been around for 15 years or so).

Which company should I test with and how much does it cost?

there are 3 companies - FamilyTreeDNA, 23andme, and Ancestry.com. Each have their pros and cons.
Re Y-DNA: if you want to research your surname, then only FamilyTreeDNA offer an infrastructure for surname research. You would have to test with them if you wanted to join the Farrell DNA Project (for example). 23andme will tell you what Y-DNA haplogroup you belong to (useful for knowing your crude migration path out of Africa) but that’s it.
Re mtDNA: only FamilyTreeDNA and 23andme offer this test, but not Ancestry. It is of little use for genealogy. 23andme have it as part of their single test (you get Y-DNA, mtDNA, and atDNA all in one test) and FamilyTreeDNA offer it as a separate test. I would start with the mtDNAplus test because it is cheaper ($69) and may give you all the information you need.
Re atDNA: whichever company you test with, you should upload your atDNA data to Gedmatch (for free). Anyone can do this and it allows you to compare your data with that of people who have tested with other companies and who have uploaded their data to the Gedmatch website. This allows you to fish in 3 genepools instead of 1 (only partial pools in this instance because not everyone uploads their results to Gedmatch). Also, if you test with Ancestry, you should upload your data to FamilyTreeDNA (for $39) so you are fishing in 2 genepools instead of 1 (complete genepools in this instance).

23andme will give you a medical risk assessment as well as a ton of genetic cousins. However the medical component was suspended in the US by the FDA and only partially restored in 2015. You may get a more comprehensive range of medical data if you order the test via their outlets in Canada, the UK, Ireland, & Australia but you would need to ask them about this as the situation is likely to change. Also, you may have to use a friend with a Canadian address (for example) as a middleman if you are ordering from outside the US (Canada in this example).
23andme give you all 3 DNA tests (Y, mt, and autosomal) for $199 in the US. It is more expensive than the other companies, and the Y-DNA and mtDNA tests give only limited results.
Ancestry just give you atDNA (no Y or mtDNA), usually for $99 although it can be $79 in their frequent Sales. There is very limited product support, no tools (such as a chromosome browser), and if you want to explore the results further you will need to upload to Gedmatch/FamilyTreeDNA. Also, no further testing is possible. The big advantage of Ancestry is that you can link your DNA results to your family tree and that will potentially allow you to compare your DNA with everyone else on Ancestry who has also done so. And many but not all have family trees ... so it can make finding the common ancestor a lot easier.
FTDNA (FamilyTreeDNA) store your sample for 25 years. Further testing can be done on the sample whenever you want (e.g. Y-DNA or mtDNA, or any future tests). Their atDNA test (called Family Finder) is $99, Y-DNA is $149 (for 37 markers; $129 if you buy it via a surname project) and mtDNA is $69 ($199 for the FMS full sequence).
I have tested with all 3 companies. I like FamilyTreeDNA the best and have had most success with them (i.e. my closest matches are on FamilyTreeDNA . My guess is that if you have Irish ancestry you will find most of your close matches on FamilyTreeDNA If you have US colonial ancestry, you will probably find most of your matches on Ancestry.
The most cost-effective option for atDNA would be to test with Ancestry ($99), then transfer your results for $39 to FamilyTreeDNA, and upload them for free to Gedmatch. The cheapest option outside the US is FTDNA ($99).
There are also some other neat websites that offer additional third party functionality that are very useful, DNAgedcom offers some tools and I particularly like Don Worth’s ADSA spreadsheet but historically it has only worked with FamilyTreeDNA results. They are developing it for Ancestry and 23andme.

So which test is best for you? Probably the atDNA test from FamilyTreeDNA or Ancestry if you are interested in general genealogy, or the Y-DNA-37 test from FamilyTreeDNA if you are specifically interested in exploring a particular surname.

Maurice Gleeson

April 2016

Monday, 25 April 2016

How to download your Ancestry DNA data and upload it to FTDNA

The DNA test from Ancestry is proving very popular and a lot of people are taking advantage of the low cost of the test, especially when there is a Sale on. Ancestry compares your DNA data to everyone in their database (over 1 million people at this stage) and gives you a list of your "matches" - many of these will be genetic cousins with whom you share a common ancestor who was born sometime in the last 250 years or so.

But you can get a lot more from your Ancestry DNA test if you upload the results to FamilyTreeDNA and take advantage of a second database of potential genetic cousins. FamilyTreeDNA (FTDNA) also have a database close to 1 million people and all these databases are expanding all the time. I have tested with both companies and (as of April 2016) I have 3671 matches with Ancestry and 828 matches with FTDNA. However, being Irish, I find that most of my closest matches are in the FTDNA database, not in the Ancestry database. So it is well worth while fishing in both pools.

The good news is that you can get your top 20 matches in the FTDNA database for free, and for a small fee ($39) you can reveal all your matches. So by paying the $39 transfer fee, you get to swim in two data pools for the price of 1.4 (the usual price of the Family Finder autosomal DNA test at FTDNA is $99).

Here's how you download your Ancestry DNA data and transfer it to FTDNA.

1. Log in to your Ancestry account and click on the DNA tab in the menu at the top.

2. Click on Settings (far right)

3. On the next page, in the box on the right, click on the Download Raw DNA Data button

4. Enter your password in the box indicated, and tick the check box below it ... then click on Confirm.

5. You will get a confirmation message asking you to check your email for a link to download your DNA.

6. Click on the green button in your email to confirm you want to download your raw DNA data.

7. This takes you to a new browser window where you will be asked to click on another green button (Download DNA Raw Data). Click on this (only once) and wait. This can take a while ...

... and be sure to note into which folder the DNA Raw Data file has been downloaded - you will need to access this folder in the next steps.

8. The next step is to upload the raw DNA data to FTDNA (FTDNA's instructions can be found here). Begin by visiting https://www.familytreedna.com/landing/atdna-landing.aspx.

9. New customers must enter their name and email address to get started. If you (or the person whose data you want to transfer) already have an FTDNA account, just click Already have a Family Tree DNA account? (Reminder: Each person needs to have their own kit number.) If you have an existing Y-DNA or mtDNA kit, be sure to sign in to it before launching the process.

Don't forget to click your gender and agree to the Terms of Service & Release Form.

10. Next, click Upload Raw Data to select and upload the raw data file from AncestryDNA from your computer. It is not necessary to unzip the file prior to uploading it.

The first round of results processing takes about an hour or so. An email notification will be sent to your registered email address after the raw data is processed. After the processing is complete, the free Family Finder experience includes:

your top matches
some Family Finder tools, such as the Chromosome Browser and the Family Finder Matrix. You can use the Chromosome Browser and Matrix with your matches to see the longest block and whether or not they’re an X match.

The free transfer has certain limitations compared to the full functionality you get for the $39 fee:

myOrigins is not included (genetic ethnic admixture estimates)
you cannot use the “in common with” feature
you cannot download any data
your results will not be seen by any of your matches

11. The price to unlock all of your matches and myOrigins results is $39. Once you’ve paid or unlocked your full results, it takes 3-5 business days for the full results to be completed.

Saturday, 23 April 2016

How to order the YFULL reanalysis of your Big Y data

To order the YFULL test, follow these step-by-step instructions:

1. first get hold of the link to your BAM file by following the instructions here …

2. then go to YFULL at https://www.yfull.com/

3. click on Order Now at the bottom

4. enter your details

check .BAM file and FTDNA Big Y and “Link to a sharing file” and enter the link you previously copied
Also tick “I am not a Robot” and the “agree to terms etc”
then click "Order now"

5. You should be notified of your results by email in about 4-6 weeks time. At that point, you will be asked to pay for them. Payment will be in roubles as this is a company based in Russia.

6. Once you have the results, you should join any relevant group projects on YFULL so that Haplogroup Project Administrators can see your results. You can join projects by following these simple steps:

log in to YFULL with your username / ID number & password
click on YFULL in the top left corner
click on Groups toward the bottom of the menu list on the left (see below)
click on the appropriate group in The Group List
click on Join Request in the yellow menu bar at the top
follow the instructions

How to share the link to your Big Y BAM file

If you have done the Big Y test at FamilyTreeDNA, FTDNA provide the facility to download your raw data (in .vcf, .bed, and .BAM files) which allows you to have the data analysed and interpreted by a host of other resources (see this blog post for a selection of them).

Here is a step-by-step description of how you can download your .vcf and .bed files, and also get the link to your BAM file and share it with other people.

1. Firstly, just SIGN IN to your FamilyTreeDNA account with your kit number and Password.

2. Then click on your Big Y Matches ...

3. Then, click on the blue Download Raw Data button …

4. In the subsequent pop-up box, you will be given the option to submit a request to have your BAM file prepared for downloading. This may take a few weeks to prepare and you should be notified by email when it is ready.

Once the file has been prepared, repeat steps 1-3 above, and now when you click on the Download Raw Data button, you will get a pop-up box with green buttons at the bottom, like in the diagram below.

The "Download VCF" option will download the .vcf and .bed files to your computer, usually as a zipped file or with both files in a folder (about 1.7 mB).

The "Share BAM" option will allow you to copy a temporary link to your BAM file. The file itself is >600 mB in size and is far too big to be sent by email, so sending the link instead allows the recipient to download your file from the FTDNA website directly to their computer, without clogging up your mailbox or theirs.

You can also download your BAM file if you wish but this is >600 mB in size so it will take a while.

5. After clicking on Share BAM, click on the Copy button that appears next (this copies the link to your BAM file) …

6. Now simply paste the copied link into an email (e.g. to send it to James Kane or Alex Williamson) or the appropriate box on an application form (e.g. to send it to YFULL or FGC).

A Comparison of Resources for analysing your Big Y data

There are several very useful resources available to us for interpreting the results of our Big Y tests. Here is a brief summary of what we get and what we don't from each resource - all have their Pros and Cons and all add something to the overall interpretation of the results. Because this is a completely new area of science, and we are on the crest of the wave of scientific discovery, the different analyses from the different sources frequently produce different results, which in turn allow us to ask why and to refine our methodologies further. This will continuously change over time as we understand more, adjust our SNP declaring criteria, and refine our interpretation of the data. We can expect big changes to take place over the next few years.

The discussion below relates to my father's Big Y results and those of other members of Lineage II in the Gleason/Gleeson DNA Project.

FTDNA

The presentation of the Big Y results from FTDNA is currently quite limited. This is not surprising as they were the pioneers in this area and their first offering in terms of how the data is presented is now outdated. But this is due to change in the near future when they introduce their new Big Y features. What these are as yet we don't know but we can expect exciting developments over the next few months. The march of progress carries on!

Currently we are given a list of close matches and the number and nature of Shared Novel Variants with each match (i.e. shared new SNPs), also Known SNPs that are not shared, and SNPs that are unique when comparing just two specific individuals. FTDNA also places us on their own version of the Haplotree (the human evolutionary tree) which tells us what SNPs lie at branching points above our own particular sub-branch.

Confusion arises from a number of different issues, some of them general points, some of them FTDNA-specific:

the separation of new ("Novel") SNPs from those already identified ("Known")
FTDNA's high threshold criteria for declaring a SNP misses some SNPs
often no SNP names are reported, only SNP positions - you have to go to YBrowse.org or YFULL to get specific information regarding the name of SNPs present at a particular location on the Y chromosome.

However, and most importantly, FTDNA provide the facility to download our raw data (in .vcf, .bed, and .BAM files) which allows us to have the data analysed and interpreted by a host of other resources. Details of how to access and share these files can be found here. The "Download VCF" option will download the .vcf and .bed files (about 1.6 mB) and the "Share BAM" option will allow you to copy a temporary link to your BAM file (which is >600 mB in size and so is far too big to be sent by email).

My current position on FTDNA's haplotree with details of SNPs tested
(green, positive; red, negative)
(click to enlarge)

YFULL

YFULL gives us a more detailed analysis of BAM files and places us on their Y-Haplotree in relation to other people nearby (i.e. who have undergone NGS [Next Generation Sequencing] testing, like the Big Y). It also identifies our terminal SNP (or SNP block), the SNPs at branching points further upstream (and hence the Shared SNPs we have with our neighbours), and the unique / personal / private SNPs that each member possesses (currently).

Over and above FTDNA's analysis, YFULL tells us the following:

how many people on adjacent branches have tested and where they are from
SNP names and any "equivalent SNPs" (i.e. exactly the same position on the Y chromosome but alternative name)
time estimates for the formation of each SNP (and hence the particular branching point)
TMRCA estimates for the people in each sub-branch (with 95% Confidence Intervals)
information on each SNP including position on the Y, ancestral and derived values, alternative names, and reference sequence. This information can be supplemented by YBrowse.org
easier-to-understand presentation of currently unique (personal) SNPs with an estimate of their "quality"
data on about 500 STRs including the majority found in FTDNA's 111 marker panel - this can be helpful in calculating TMRCA estimates and narrows the 95% range around the estimate (compared to TMRCAs based on 111 marker data)

The Gleeson Lineage II portion of the YFULL haplotree
(click to enlarge)

Full Genomes (FGC) Analysis
The FGC analysis of BAM files is comparable to the YFULL analysis but their public Y-Haplotree is not user-friendly and is of limited utility. FGC generates the following reports with the underlying files (processed BAM file, mtDNA, and STRs):

A detailed analysis of called variants report
A variant genotyping report
Haplogroup classification
Y-STR report, and
mtDNA report

clarifYdna analysis

clarifYDNA will reanalyse your Big Y data for $30 and produces a Y-DNA haplotree report from your results - this will be periodically updated as new data becomes available from other testers. Reports are in sync with a recent version of the ISOGG haplotree, and are able to indicate which aspects of the phylogenetic structure are robust and which are more tenuous - thus it combines aspects of the Y-haplotree that are both "established" and "provisional / experimental".

Unfortunately the tree is only available to subscribers and is not available to the public.

Click here for an Example analysis.

Haplogroup Project Administrators

The Administrators at the Z255 Haplogroup Project, and indeed the Admins of more upstream Haplogroup Projects (e.g. L21, R1b & subclades, etc) are an incredible resource and their respective Yahoo Discussion Groups are great places to post questions and get replies.

John Murphy puts together a regular updated spreadsheet / haplotree for the Z255 group which has an advantage over YFULL's analysis - it incorporates new SNP discoveries from specific SNP Packs and single SNP testing and not just from NGS testing (Big Y, FGC).

The Gleeson Lineage II portion of John Murphy's spreadsheet
(click to enlarge)

Alex Williamson's "Big Tree"

Alex is one of the most important people in the R1b research community and is a champion of data analysis, interpretation, and (most importantly) presentation. His Big Tree website (www.ytree.net) is a masterful display of complex data in a digestible format. He places us on his haplotree so that we can see our terminal SNP (block), SNPs at upstream branching points, and our neighbours on adjacent branches.

Advantages over YFULL include:

The Big Tree gives us our neighbours names and places of origin, thus making it easier to form an impression of where a particular sub-branch might have formed and if it is specific for a particular surname.
Easy to navigate with lots of additional information by simply clicking on a surname or a SNP.
His graphics are superb.
His Mutation Matrix allows us to see which SNPs are shared and which SNPs are not between us and our closest neighbours.
His presentation of unique (personal) SNPs gives us not only an estimate of "quality" but also the region of the Y-chromosome in which they are found (this can be useful in judging if this is a true SNP and also how easy it would be test for it in a bespoke SNP Panel)

The (current) 4 branches of the Gleeson Lineage II on the Big Tree

Nigel McCarthy's Z255 Subgroup

Nigel is another pioneer. He is one of the first people to combine SNP markers and STR markers into a single tree. We are lucky enough in the Gleeson Lineage II group that we are closely related to some of the people in Nigel's McCarthy DNA Project. As a result, Nigel has included us in the Z255 portion of his phylogenic tree (Group E).

Nigel's own SNP analysis is complementary to the ones above and he too will occasionally discover new SNPs that others have not included in their analyses.

Major advantages over previous analyses include:

As well as the SNPs, he also presents STR data and the change in STR values at each branching point
He includes people who have not been tested on the Big Y (i.e. anyone with Y-DNA-67 or Y-DNA-111 results). As a consequence, his portion of the haplotree contains more Gleeson's from Lineage II than any other tree - 12 members altogether (compared to 9 members on Alex's tree, 9 on John Murphy's, and 6 on the YFULL tree).
From Nigel's analysis it is possible to see where Back Mutations and Parallel Mutations occur in the STR markers.

The Gleeson Lineage II members in Nigel McCarthy's Group E of his McCarthy DNA Project

Mike W’s Haplotype Data for R1b-L21
Mike is an administrator of several FTDNA projects and a leader in the genetic genealogy community for a long time. He maintains a very comprehensive spreadsheet that can be downloaded from the Links section of the R1b-L21 project Yahoo group or a smaller version from the Z255 Yahoo group. This spreadsheet collects the STRs for 67 and 111 markers, and SNPs from the Big Ys or other sources. A user can calculate his genetic distance in relation to the complete database and he can infer his haplotype according to the most common haplotype of his closest matches. The spreadsheet also calculates the group mode and several statistics required to characterize a particular group.

James Kane's SNP Matrix

I am new to James' SNP Matrix but it too is a work of art, a magnum opus, not surprising for a 90 mB spreadsheet. Yet again, James' approach to NGS data analysis offers a fresh perspective and can detect possible/probable SNPs that have not turned up in other analyses. Having multiple analyses and interpretations of the same data is a great advantage - it allows us to see points of agreement and points of difference in the various approaches, and ultimately helps us to question the data more intelligently which in turn will lead to better analysis and interpretation.

James’ matrix compares the SNPs of all the participants while the other methods preselect the relevant SNPs. This capability is very important for the identification of new potential SNPs. These can be checked against other analyses for consistency or disagreement. Additionally, it is possible to evaluate unique SNPs or SNPs that belong to a particular group with different levels of quality or if they are part of the combBED area. It also provides positions for both the build 37 (GRCh37) and 38 (GRCh38) human reference genome sequence.

The matrix workbook requires BAMs for inclusion. What the scripts do is visit each file for every variant location and outputs the read depth in a very large combined VCF file. The idea is to remove the ambiguity of BED files. The old HTML based pages did include everything, but became unwieldy. They are being replaced soon.

James also has a blog site and an Experimental Y Tree (currently being updated) with SNP names & their equivalents, surnames, places or origin, and TMRCA estimates.

Everyone who has done the Big Y test should send James a link to their BAM file so he can include you in his analysis. This looks at the data from yet another perspective and helps further with the interpretation. It should help clarify the discoveries from other sources and may even identify some additional SNPs.

Below are the instructions and an explanation that James has put together about sending him your BAM file for analysis and what will happen after that:

What’s needed?

A link to your Big Y BAM file using the “Share BAM” button on your Raw Results page. Let me know if detailed instructions would be helpful. Please also include this statement in the email:

As the owner or administrator of FTDNA kit#, [YOUR KIT#], I consent to allow analysis of the Y DNA contained in the provided BAM file. The results of this analysis may be used the phylogenetic tree of haplogroup R or independent researchers for scientific purposes.

What will be done?

Your BAM will be downloaded and realigned to GRCh38. This will allow a new VCF/BED to be created and compared with others. Results will be included in http://www.it2kane.org/matrix/R-P312.html. When sufficient analysis is available for the branches, it will be possible to include time to most recent common ancestor estimation based on these results. The new VCF/BED will be provided to those interested.

What won’t be done?

Unlike the commercial 3rd party analysis, you won’t get mtDNA, STR value estimates, or variant naming any time in the near future.

See below for my data use policy.

Raw Data Policies

In light of the recent dust-up between FTDNA and another 3^rd party site, I have codified my data usage policies.

1. VCF/BED files submitted for analysis are made available for other R-L21 researchers using the R1b-L21(S145) Haplogroup and Subclades Y DNA forum hosted on Yahoo. This aids researchers to correctly assign variants to their related haplogroups.

2.     Raw BAM data is retained in a password protected cloud storage account. The project recognizes there is a low probability that files may contain data not actually on the Y chromosome, which may reveal medically relevant information about the participant. BAM files may be individually shared with qualified researchers and analysts only after approval of the sample’s owner.

3.     GRCh38 aligned versions of variant calls and BED coverage generated by the project’s bio-informatics workflow can be shared with researchers without the sample owner’s explicit authorization.

4.     FTDNA kit #’s are displayed for convenience of related surname projects or haplogroups in all reporting. As the identifier is used to log into the FTDNA account this has security implications for the kit owner. Project members may request reporting on tree or call matrix reports use an internal project id instead.

5.     Project members have the right to request that their raw data is removed from reporting at any time, but shared variants in the tree will be retained.

Some Closing Thoughts

This is a new science and we are still trying to get to grips with it. The pithy saying "Many hands make light work" operates quite nicely in this situation. It is only by looking at the data from a variety of different perspectives that we can hope to understand it better, and quickly. So we should be using all of the above utilities to analyse and interpret our Big Y results. Thanks to the internet, this process (which previously would have taken decades to complete) can now be accomplished in a matter of years thanks to what effectively is a crowd-sourcing approach - a group of citizen scientists working together toward a common goal and employing the power of the internet to communicate and collaborate effectively.

There is still a lot of testing to be done - we need more people to do the NGS tests (Big Y, FGC tests, etc). And we need clever people to develop more tools for analysis, interpretation, & presentation of the data. But as this critical mass of people tested builds, and as our ability to analyse and interpret and present the data improves, we will begin to reap greater and greater dividends.

Software packages are being developed to help build combination family trees using SNP data, STR data, and standard genealogy. Already you are able to add DNA markers to your Family Tree on Ancestry. This will advance even further and trees will start to be linked online via downstream SNP markers.

Furthermore, for Irish genealogies at least, we will be able to link some of our family trees to the Ancient Annals and Genealogies, bringing us back to before the time of surnames, back to 900 AD, 800 AD, 700 AD, and even further.

In a few years, when we look back at this time in human history, we will be able to say ...

I was there.

I contributed to that.

I helped build the Evolutionary Tree of Mankind.

And I know exactly where I sit on it.

Maurice Gleeson
German Creamer
Lisa Little

April 2016

Tuesday, 5 April 2016

Chromosomes, Markers & Evolutionary Trees

This blog post (from the Gleason/Gleeson DNA Project blog) discusses some of the basic principles behind Y-DNA testing, the different types of DNA, the different types of marker, and their practical applications, including the construction of evolutionary trees.

Let's recap on some of the basic science behind Y-DNA as this will help you understand what you are seeing when you look at your results, and how your results can be applied in practice.

Chromosomes - a closer look

We have 46 chromosomes, arranged in 23 pairs. Each pair has 2 copies, one of which you got from your mother, the other from your father. So for example, you have one paternal chromosome 14 and one maternal chromosome 14. Before you were conceived, your father made a copy of each of his 46 chromosomes but only passed on one copy from each pair to you. Similarly your mother made copies of all her 46 chromosomes but only passed on to you one copy from each pair. In this way the 23 chromosomes you got from your father combined with the 23 from your mother to bring your chromosome quotient back up to the usual 46.

click to enlarge

The 23rd pair is also known as the sex chromosomes. There are two types of sex chromosome - an X and a Y. At conception, if two X chromosomes combine, a female child is produced (XX). If an X and a Y chromosome combine, a male child is produced (XY). Women (XX) only have an X chromosome to pass on to their offspring, whereas men (XY) can pass on either an X or a Y to their offspring. Therefore the man's contribution decides the gender of the child. Women do not have a Y chromosome and so cannot do this particular DNA test.

Thus the Y chromosome is only passed on from Father to Son. This is why it is perfect for tracing the father's father's father's line and is the main type of DNA used for surname studies. Be aware though that it only assesses this single ancestral line, and if you go back 10 generations, this represents only 1 of your 1024 ancestors (which is equivalent to about 0.1% of your ancestors at that particular level).

Each of our 46 chromosomes consists of a long double-stranded helix of DNA. If we unwrapped it, it would look like a long ladder extending into infinity, or a railway track running from New York to Los Angeles. It's huge. If you untwisted all 46 chromosomes from a single cell, it would stretch for 2-3 metres (6-10 feet). All the untwisted DNA from the human body would stretch to the moon and back several times.

All along the "ladder" are the nucleotide bases, like rungs in the ladder, binding each strand of the helix to the other strand of the helix. The bases are called A, T, C, and G, after the first letters in their respective names - Adenine, Thymine, Cytosine, & Guanine. A only ever binds with T, C only ever binds with G. You can remember this by thinking the straight-sided letters only bind to each other, and the curved letters bind only to each other. Each base pair effectively forms a rung in the ladder.

click to enlarge

Because A only ever binds with T, and C only ever binds with G, if we know the sequence of bases on one strand of the helix, we automatically can tell what bases are on the other strand. Therefore, the sequence of bases along the DNA is only ever written as a single line of letters (e.g. ATCCGAATTGG). The sequence is read from what is called the 5' (5 prime) end of the DNA molecule (and is read toward the 3' end, like reading from left to right).

In each pair of chromosomes, the two copies (maternal and paternal) are virtually identical to each other in terms of size, length, morphology, etc. The exception is the sex chromosome pair, X and Y ... the X chromosome is 3 times bigger than the Y chromosome.

Although each chromosome in a pair is virtually identical, there are subtle differences between the nucleotide bases that run along the entire length. These variations in the bases are called mutations and can be identified because they occur at specific locations along the chromosome. These locations where mutations occur are referred to as DNA "markers". Each marker can be identified because it occurs at a specific position along the chromosome and thus can be given a particular name (e.g. DYS390 or Z255). People who share the same mutation may have inherited it from a shared Common Ancestor, and this is why DNA can be so helpful for genealogy.

A note on terminology: Y-DNA refers to the Y chromosome. Autosomal DNA refers to all the chromosomes EXCEPT the last pair (Pair 23, the sex chromosomes, X and Y - all the other chromosomes are called autosomes, hence autosomal DNA). Mitochondrial DNA refers to the DNA found in mitochondria (the "batteries" that power each cell). For a more detailed introduction to the three types of DNA test and how they are applied in genealogy, watch this YouTube video here.

The different types of DNA marker

There are two types of DNA marker - STR markers and SNP markers.

STR stands for Short Tandem Repeat and the key word here is "repeat". An STR marker is a sequence of bases repeated many times (e.g. CATCATCATCAT). In this example, the sequence is CAT and the repeat value of the sequence is 4. When the DNA is being copied before being passed on to any offspring, there are occasional mistakes made in the copying process. So for example, a copying mistake in the CAT sequence above might result in 3 repeats instead of 4, and so the value of that marker may shift from 4 in the parent to 3 in the offspring. This may be the first mistake to be made in this particular marker for many generations, and so not only will the male child differ from his father, grandfather, and great grandfather, but also from all his male siblings and cousins, who will all have a value of 4 for this particular marker.

The second type of DNA marker is the SNP marker, which stands for Single Nucleotide Polymorphism. The key word here is "substitution" - a single base at a specific location changes from what it normally is to a different base (e.g. an A changes to a C or a T or a G). Whereas the STR markers involve several bases in a row, the SNP marker only involves the substitution of a single base.

click to enlarge

Kelly Wheaton has written some excellent blog posts about DNA markers on the Y chromosome. You can read them by clicking here - STR markers & SNP markers.

There are some very important characteristics of STR and SNP markers which are key to understanding how they are applied in surname studies:

Mutations in STR markers are written as the value of the marker (e.g. 12) whereas mutations in SNP markers are given names (e.g. Z255) or are written as the location on the chromosome followed by the change that occurred in the bases there. For example, 17349992 (G>A) indicates that a G has been replaced by an A at position 17349992.
The mutation rate of STR markers varies from marker to marker. Some mutate relatively quickly (e.g. 1 mutation every 5 generations) whilst others mutate very slowly (e.g. 1 mutation every 500 generations). Mutations in slow-mutating markers are very good for studying human migration, whereas mutations in fast-mutating markers can be very useful for genealogy research (in the last 500 years or so).
A big problem with STR markers is that they can mutate back as well as forward. So for example an STR marker may have a value of 4 which changes to a 3 and then back to a 4. The first mutation (4 to 3) may have occurred 1000 years ago, and the second one (3 back to 4) may have occurred 300 years ago. The trouble is that the Back Mutation masks the fact that there was a significant mutation 1000 years ago and this may result in people with the 4 value being assigned to the wrong branch of the human evolutionary tree and hence the wrong family tree!
Another problem with STR markers is the Parallel Mutation. This happens when two very separate branches of the same family experience the same mutation "in parallel", giving the impression that the two branches are more closely related than they actually are in reality.
A further problem with STR markers is that it is very difficult to identify a Back Mutation, or a Parallel Mutation. And as a result we don't know how often they occur. We suspect that it happens fairly frequently, perhaps as often as a marker value mutates forward it also mutates back. We really don't know. But such "hidden" back mutations may seriously confound our interpretation of the data and may result in people being placed on the wrong branches of the human evolutionary tree.
Convergence is the name given to the situation when Back Mutations and Parallel Mutations on STR markers result in people appearing to be more closely related to each other than they actually are. This is a big problem when comparing people at 12 markers, but less of a problem when comparing at higher numbers of markers (e.g. 37, 67, or 111). However, even at 67 markers significant Convergence has been detected.
On the other hand, SNP markers mutate much more slowly. And because there are so many of them, Back Mutations and Parallel Mutations are extremely rare (and easily spotted). For this reason, when using DNA markers to place people on the human evolutionary tree, SNP markers trump STR markers i.e. more reliance is given to SNP markers than to STR markers.

Y-DNA, Population Migration, & the Human Evolutionary Tree

Because the Y chromosome is passed on virtually unchanged from father to son, and because mutations in the DNA markers along the Y chromosome happen relatively infrequently, it is also an extremely useful tool for studying the last great human migration out of the African Motherland (about 50,000 years ago) that ultimately led to the populating of the entire planet. There is an excellent interactive animation of human migration here, including the various ice ages and the catastrophic eruption of the Mount Toba volcano that almost destroyed Mankind.

Population geneticists have been studying the evolution of mutations on the human Y chromosome (and on mitochondrial DNA) for many years and have developed an evolutionary tree based on these mutations (called the Haplotree). They refer to each of the major branches of the tree as Haplogroups and have named them after the letters of the alphabet (e.g. Haplogroup R, or its subgroup Haplogroup R1b). You can think of a Haplogroup as a group of people with a broadly similar genetic signature.

click to enlarge

As modern humans moved around Africa and then moved out of Africa and spread to different places around the world, the humans who moved to Europe developed a totally different set of mutations to those humans who moved to India or Australia (for example). Thus certain haplogroups are found more commonly in Europe (e.g. R1b, I2b) than in India (e.g. H, L) or Australia (e.g. C, T).

Furthermore, genetic genealogy is a very young science, and more markers are being discovered all the time (thanks to novel tests like the Big Y test from FTDNA). As a result, scientists are still discovering finer and finer sub-branches of the human evolutionary tree, and we are approaching the point where we will discover the finer branching patterns associated with individual surnames (such as those in the Gleason/Gleeson DNA Project).

The old nomenclature for the various branches of the tree used a long string of letters (e.g. R1b1a2a1a2c1e) but this has been superseded by a system that simply puts the main Haplogroup letter followed by the "terminal SNP" (e.g. R-Z255). You can still see both terminologies in use on the ISOGG tree.

The terminal SNP refers to the SNP marker that currently occurs at the end of a branch. The word "currently" is important because as new SNP markers are discovered the current terminal SNP marker is likely to be replaced with a new one, and we will continue to move further and further down the finer branches of the tree until we identify SNP markers that are specific for your own family branch and even single individuals.

This will eventually allow us to reconstruct family trees based on DNA marker mutations. These are sometimes called phylogenetic trees, sometimes cladograms or phylograms, but my favourite is Mutation History Trees because it sounds similar to Family History Trees. The difference between the two is that Family History Trees are constructed using named individuals, whereas Mutation History Trees use DNA markers. It should be possible to superimpose one upon the other and in this way we can look 'beyond the Brick Wall" of individual pedigrees and see where different family branches are likely to connect. This in turn will help focus further documentary research.

There are various groups working on the human evolutionary tree and they have produced their own version of the haplotree:

The YCC Haplotree is produced by the Y-Chromosome Consortium. This is an academic effort and it is frequently out of date, being surpassed by the ISOGG tree which is updated much more frequently and harnesses the continuous output of genetic genealogists working on Haplogroup Projects (such as the R-Z255 & Subclades Project to which all members of Lineage II in the Gleason/Gleeson DNA Project belong). The most recent update of the YCC tree is from March 2015 but the tree itself is not user-friendly.
The ISOGG tree is the result of the efforts of ISOGG (the International Society of Genetic Genealogy) who co-ordinates the analysis and interpretation of the findings from various Haplogroup Projects and as a result has developed a much larger tree than the YCC Tree. It too is quickly out-dated as the pace of new SNP marker discovery advances and further sub-branches are discovered. Lineage II members can click here and search (Cmd+F or Ctrl+F) for Z255 to see where this particular sub-branch sits on the main Haplogroup R branch.
Several of the commercial companies have developed their own haplotrees which at times may be more advanced than the ISOGG tree, and at times less advanced:

FTDNA tree - this can be accessed from the Haplotree & SNPs page of your personal FTDNA webpage
YFULL Experimental Tree - YFULL is a company that offers SNP testing and will interpret the results of SNP testing carried out by other companies. This tree is relatively easy to navigate but again requires use of the Find function (Cmd+F or Ctrl+F).
FGC tree - like YFULL, FGC (Full Genomes Corporation) also offer SNP testing and interpretation. The visual presentation of the tree is not easy to navigate.

Haplogroup Project Administrators work at the coal face of scientific discovery in relation to the finer branches of their own particular haplogroup project. The R-Z255 & Subclades Haplogroup Project updates its draft tree periodically as new member results come in to the project. You have to sign up to the project to access these updates but here is the most recent update as of July 15th (for members only). It is important to appreciate the pivotal role that Haplogroup Project Administrators are playing in the ongoing discovery of the finer branches of the tree. Surname Project Admins will work closely with Haplogroup Project Admins to advise their project members regarding which tests to take next and why.
Alex Williamson's "Big Tree" is a tree that specifically focuses on the Haplogroup R-P312 branch of the human evolutionary tree (of which Z255 is a subgroup). Alex has done incredible work placing newly discovered SNP markers in their best estimated position on the tree, and most importantly for us, creating a visual representation that is easy to navigate and makes the current state of the tree so much more understandable. The members of Lineage II feature here too, in the Z255 subsection. There are two interesting features to Alex's tree:

if you click on the name of any individual, an analysis of their unique genetic signature comes up. Here is the analysis for member N74958 showing his position on the tree, his unique mutations, and his putative haplotype progression (i.e. the estimated progression of his mutations from previous ancestors).
the Overlay STR Feature allows you to compare the results for all STR markers (one by one) across the whole group. Here it is for DYS439.

Nigel McCarthy runs the McCarthy DNA Project and has pioneered the development of phylogenetic trees based on a combination of SNP and STR markers. Luckily for us in Lineage II, one particular area of his research is also focussed on the Z255 subclade to which we belong (Group E in his project). We'll be talking a lot about Nigel's work in due course as it is particularly relevant to the next steps in the DNA Project for Lineage II members.

The portion of Alex Williamson's "Big Tree" that deals specifically with members of Lineage II

You may have to read this several times before a lot of the information sinks in but stick with it - it's worth it! Knowing the basics behind the science of Y-DNA and how it can be applied will help you understand a lot of the discussion about SNP testing and Big Y results that will follow in subsequent posts.

Maurice Gleeson

30 July 2015

Friday, 1 April 2016

Getting the Most out of your Y-DNA Test

One of the most important responsibilities of Project Administrators is helping our project members to get the most out of their Y-DNA results. However, it is a common experience that we find that project members forget to put in basic information about their MDKA, or inadvertently leave their results as "Private" so that no one can see them, or do not include any information about their pedigree.

Here is a blog post I wrote recently for my Farrell DNA Project. It lists a series of actions that members can do to get the most out of their results, thus helping to ensure that their money is well spent. Please feel free to adapt it and use it for your own surname project if you wish.

No matter which test you have taken, there are a few essential actions you should take to get the most out of your Y-DNA test. You may not be able to do all of them all at once, so come back to this page often and check it out again to see if there is anything else you could be doing to maximise the value you get from your Y-DNA test.

You may wish to share the link to this page with any potential recruits so that they can see what they will get if they test.

Make yourself visible to your cousins

1) Add your Farrell Ancestral Line as a Comment on our Post your Pedigree Page. This will potentially help other people to connect with you. It would help if you could provide it in the following format:

1) James FARRELL b c1835 Ardagh, Co.Longford, d 12 Nov 1879 Keenagh, Co. Longford, m 13 Apr 1860 Maria COYLE, Keenagh, Co. Longford
2) Mortimer FARRELL b 1861 etc ... (birth, death, marriage details)
3) Abigail FARRELL b 1890 ... (as above)
4) Francis KENNEDY b 1914 … as above, but not including dates for a) births <100 years ago, b) marriages <75 years ago, or c) deaths <50 years ago
Researcher: (insert your initials here)
Your email address
DNA Kits: (insert your DNA kit numbers)

2) Use your kit number and password to Log in to your personal webpage and explore it. There are a lot of bits & pieces of information you can include on your personal webpage that will optimise your chances of successful collaboration with your DNA matches. And knowing what your DNA results can tell you will help you get the most out of them.

3) You should add your MDKA information (Most Distant Known Ancestor) including dates & locations for both birth and death. The format we recommend is the same as the one above, but you may have to abbreviate it as only a certain number of letters are allowed in this field. Location of birth is the most important piece of information. Here is an example:

James FARRELL b1835 Kilcullen, Co.Kildare, d1879 New York

To add this information, simply click on your name in the top right of your homepage - Account Settings - Genealogy - Most Distant Ancestors ... I have posted instructions on how to do this here.

4) Add your Ancestral Surnames (click on your name in the top right - Account Settings - Genealogy - Surnames). I suggest to put SURNAMES in capital letters and Locations in normal text, as this makes the surnames "jump out" and easier for the reader to scan through.

5) Upload your Family Tree as a GEDCOM file so that you have a version of your family tree on your FTDNA webpages. This is particularly important if you have done a Family Finder test (autosomal DNA). You can also add your Family Tree manually if it is easier for you. And if you have a Family Tree online, leave a link to it in the About Me section of your Personal Profile. Click here for specific instructions on uploading a Gedcom file - https://www.familytreedna.com/learn/ftdna/how-to-family-tree/

6) Optimise your Privacy settings so that your potential cousins can see your results:

Hover over your Name in the top right
Click on Account Settings, then the Privacy & Sharing tab at the end of the menu bar above
Then simply change the settings under My DNA Results by clicking on the words "Project Members" at the end, and on the next screen checking the box beside "Make my mtDNA & Y-DNA data public". Then press Save.

Before the change

After the change

Check out Project-related Resources

There are a lot of resources that are particularly relevant to the Farrell DNA Project and you should check out and use these as you feel appropriate.

1) Join the Farrell Clan Facebook group if you haven’t already. It is a great place for sharing information and keeping in touch with recent developments.

2) Check out the various pages and the previous posts on this website (the Farrell DNA Project blog). It has a lot of useful information that will help you understand your results.

3) Each Y-DNA-tested member of the project gets a Privatised Personalised Report which includes an interpretation of their results, how they relate to others within the project, and next steps for that particular individual. These reports are published here on this blog for the benefit of everyone - prospective new recruits can thus see what they will get and everybody benefits from the general educational aspect of these reports.

4) Join the relevant Haplogroup projects
Your results will reveal your haplogroup (your branch of the human Y-DNA tree and/or human mtDNA tree). Once your results arrive, make sure you join all the relevant projects as these will assist us in the further analysis of your data and in particular your deep ancestry (where in the world your particular ancestors originated several thousand years ago). The projects are run by volunteer project administrators and they are a rich source for advice, guidance, and support. Frequently there is an associated mailing list or Facebook group you can join to keep abreast of up-to-date developments (this is a fast-moving field).

Relevant Y-DNA haplogroup projects identified thus far include the following:

R1b - Groups 1-5 & Ungrouped (Haplogroup R) - should join R1b ALL subclades
R1a groups - should join the R1a Project
I1 groups - should join the I1 subclade project

If your haplogroup project is not listed here, you can see if there is a specific project for your haplogroup on this list: http://www.isogg.org/wiki/Y-DNA_haplogroup_projects

5) Join the relevant Geographical Projects
Relevant Y-DNA geographical projects may include the following:

There may be other geographical projects that are relevant to your ancestral line and you can find them on this list: http://www.isogg.org/wiki/Geographical_DNA_projects

Check out General Resources

There is a lot of information out there about genetic genealogy in general and it can be a bit confusing knowing where to find it. Below is a selection of our "best bits".

1) FTDNA have a lot of useful information in their Learning Centre. Be sure to check out the FAQs (Frequently Asked Questions).

2) The ISOGG wiki is a great place to start looking for general information about any topic related to genetic genealogy, including your particular type of test.

3) Read Kelly Wheaton's beginners’ guide to genetic genealogy: https://sites.google.com/site/wheatonsurname/beginners-guide-to-genetic-genealogy

4) Download and read the e-book from the resources tab on your myFTDNA homepage.

5) There are a variety of different YouTube videos on genetic genealogy which have been prepared by ISOGG members and Project Administrators.

Watch the recordings of the WDYTYA Live lectures: http://www.youtube.com/channel/UC7HQSiSkiy7ujlkgQER1FYw
Watch the FTDNA videos and webinars: http://www.familytreedna.com/learn/ftdna/webinars
Watch videos from Genetic Genealogy Ireland: https://www.youtube.com/channel/UCHnW2NAfPIA2KUipZ_PlUlw

6) Sign up to the relevant genetic genealogy mailing lists, forums and Facebook groups. These can be great sources of help if you have a specific question. See the list here: http://www.isogg.org/wiki/Genetic_genealogy_mailing_lists.
We particularly recommend:

The ISOGG Facebook group: https://www.facebook.com/groups/isogg
The ISOGG DNA Newbie list: https://groups.yahoo.com/neo/groups/DNA-NEWBIE/info
The Family Tree DNA Forums: http://forums.familytreedna.com

7) Read blogs written by experienced genetic genealogists. See this list of genetic genealogy blogs: http://www.isogg.org/wiki/Genetic_genealogy_blogs

8) Read the relevant articles about your specific DNA-test ...

Y-DNA - traces your father's father's father's line
Y-DNA basics: http://www.familytreedna.com/learn/dna-basics/ydna

Y-DNA user guide: http://www.familytreedna.com/learn/y-dna-results-pages

Y-DNA STR testing: http://www.familytreedna.com/learn/y-dna-testing

Y-DNA SNP testing and haplogroups: http://www.familytreedna.com/learn/y-dna-snp-testing-haplogroups

Mitochondrial DNA (mtDNA) - traces your mother's mother's mother's line

mtDNA basics: http://www.familytreedna.com/learn/dna-basics/mtdna

mtDNA user guide: http://www.familytreedna.com/learn/mtdna

mtDNA testing for advanced users: http://www.familytreedna.com/learn/mtdna-testing

These two pages are relevant if you have taken the full mitochondrial sequence (FMS) test:
mtDNACommunity: http://www.familytreedna.com/learn/mtdna-community

mtDNA scientific collaboration: http://www.familytreedna.com/learn/mtdna-results-donation

Autosomal DNA (atDNA) - traces all your ancestral lines

Family Finder user guide: http://www.familytreedna.com/faq/answers.aspx?id=47

Understanding Family Finder results: http://www.familytreedna.com/faq/answers.aspx?id=17

Understanding Population Finder results: http://www.familytreedna.com/faq/answers.aspx?id=22

Please let me know if any of these links are broken or cease working.