When kits are combined

So The Legal Genealogist finally got a chance to play with a new GEDmatch Genesis tool called the Superkit. And yes, in fact, there are differences between the match lists you get with a Superkit and the match lists you get from an individual company test.

So a quick review first: GEDmatch is a third party website that allows people to compare DNA test results even if the tests were taken at different companies.1 GEDmatch Genesis is the new updated platform for this website.2 There are a lot of free tools at GEDmatch, but some are available only if you support the website with a fee. That’s called Tier 1.3

The Superkit is a new Tier 1 utility that allows you to combine your DNA results from multiple testing companies into one super-duper-kit. The idea is that — although there’s a lot of overlap from company to company — the individual testing companies do look at somewhat different places in your autosomes for genealogically relevant markers. So, by combining results, you’d get a more complete set of markers than from any one test and a more accurate set — or at least more accurate rankings — of matches.

Think about it this way. Let’s say company A looks at your autosomal DNA — the kind of DNA used in the tests that help us compare our DNA to that of our biological cousins so we can work together to trace our family history4 — in places 1, 2, 3, 4, 5, 8, and 10. And company B looks at places 1, 2, 3, 4, 5, 6 and 7. And company C looks at places 1, 2, 3, 4, 5, 9, and 10. Adding those three kits together would give us a combined look at places 1-10. Pretty cool, huh?

superkit threads

Since I’ve tested with just about everybody under the sun, and I am a Tier 1 GEDmatch user, I figured it was worth trying it to see what if any differences there might be between the test results from individual companies and this kind of a superkit. You can combine tests from four companies so I figured I’d combine results from 23andMe, AncestryDNA, Family Tree DNA and Living DNA.

After resolving for this purpose a snag with my Living DNA results,5 I went ahead and created a superkit with test data from those four companies. I then loaded the match lists from the superkit and all four companies into a spreadsheet to see what the differences might be.

The first thing I looked at was how the combination impacted the match rankings: who were the top matches with the superkit versus the top results from each of the companies?

None of the top five spots — my five siblings — changed position at all. I match them everywhere in exactly the same order: brother, sister, sister, brother, brother.

The number six spot is occupied by one of my uncles, my mother’s brother, and he’s in the number six spot across the board.

Things start to change at the number seven position. GEDmatch Genesis results for two of the companies — Living DNA and Ancestry — put one of my aunts in that number seven position, while the other two — 23andMe and Family Tree DNA — put another uncle in that slot. The superkit went with my aunt.

Now you’d think that the number eight slot, then, would definitely go with that uncle who got nosed out of the number seven position in a tie, right? No. The superkit promoted yet a third uncle into that position. He was in the number eight slot at Family Tree DNA, but in the number nine slot at the other three companies. Yet somehow he outranked his brother who came in at number seven in two tests and number eight in two tests, but slid to number nine in the combined superkit.

Positions ten and 11 have the same back-and-forth: two companies put a nephew into that slot, two rank another aunt. The superkit puts the nephew into position 10 and the aunt into position 12.

After that the changes become bigger and less easy to quantify. I was able to quickly note that known cousins who tested at Family Tree DNA ended up considerably farther down the rankings chart in the superkit. One cousin who’s in the number 23 position at Family Tree DNA doesn’t make the combined chart until position 62; another who’s in the number 27 position at Family Tree DNA doesn’t make the superkit list until position 45.

I figured that would be because adding the other companies’ data into the combined results would produce a more complete set of data that would stack up differently against the match list.

But that’s not exactly all that’s happening here. There’s an actual difference in the amount of DNA in common and the size of the largest segment being reported between the superkit and the various testing companies.

The superkit says I have 37.8 cM in common with the cousin who’s in the superkit slot number 62, with a largest segment of 26.4 cM. Family Tree DNA reports 63.3 cM in common with a largest segment of 38 cM. The other company results range from 33.2 cM-37.8 cM in common and largest segment sizes from 17.1 cM-26.4 cM. And the cousin in the superkit slot 45 shows there with 40.7 cM in common and a largest segment of 15.1 cM. The company data ranges from 43.2 cM-58.1 cM in common and a largest segment of 15.7 cM-19.2 cM.

Okay.

Waitaminnit.

The theory here is that we’re combining company A data at places 1, 2, 3, 4, 5, 8, and 10 with company B data at places 1, 2, 3, 4, 5, 6 and 7, company C data at places 1, 2, 3, 4, 5, 9, and 10 to get to a combined look at places 1-10, right? How then does the superkit have less reported data than one of the companies does in the first case and less than any of the companies reports in the second case?

It turns out that GEDmatch Genesis isn’t combining every bit of data from every one of the tests. It’s combining only a selected subset of the company data that it knows it’s going to use to create the new kit because the data in that subset contains the bits and pieces GEDmatch Genesis uses for matching; they appear to be the ones most useful for comparing people.6 Any data that falls outside of that subset gets discarded. So yes we’re getting 1-10, but tossing out any data that happens to fall outside that range — say, at 11-12.

Now this may end up in the final analysis making the superkit more accurate for comparisons and matching against other superkits.

But when you look at that big ball of string in the illustration and how it’s made up of pieces of the individual balls of string from each of the data sets from each of the companies, remember that some data — and yes perhaps less useful data but data nonetheless — is being left behind.


SOURCES

Cite/link to this post: Judy G. Russell, “Superkit Sunday,” The Legal Genealogist (https://www.legalgenealogist.com/blog : posted 14 Apr 2019).

  1. See generally Judy G. Russell, “Updated look at GedMatch,” The Legal Genealogist, posted 26 Mar 2017 (https://www.legalgenealogist.com/blog : accessed 14 Apr 2019). It’s also the website of choice for law enforcement to use our DNA to try to solve cases.
  2. See ibid., “Genesis at GEDmatch,” posted 16 Dec 2018.
  3. See Kitty Cooper, “New Utilities at GEDmatch: Tier 1 for paid members,” Kitty Cooper’s Blog, posted 20 Oct 2014 (https://blog.kittycooper.com/ : accessed 7 Apr 2019).
  4. See generally ISOGG Wiki (https://www.isogg.org/wiki), “Autosomal DNA,” rev. 8 Apr 2019.
  5. The data file I was connected to at the Living DNA site did not contain my data. Still waiting for an explanation on how that happened, but at least I now do have my own test data.
  6. See generally, Louis Kessler, “My Whole Genome Sequencing. The VCF File,” Behold Genealogy, posted 6 Fed 2019 ( : accessed 14 Apr 2019) (“The slimmed SNPs are the ones that GEDmatch Genesis uses for matching. They are the ones that are the most different between people and give you the most ‘bang for the buck’”).
Print Friendly, PDF & Email