Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I suspect many people that frequent HN could use the MinION to generate the raw data, but generating gigabytes of DNA reads != assembling a genome. Remember, only this year the very first genome for any human was "completed". You're going to get thousands of overlapping reads, of varying lengths. Then you'll need serious processing power to combine these overlaps, then overlap the overlaps, so to speak, and on and on. What software to pick, how to parameterize/use it, this is the job of post-docs and others who like to tear their hair out. I haven't looked lately, but many one-off scientific machines have their own propitiatory binary versions of the data, for vendor lock in, so that you must also buy their crappy software to process it, double check that this isn't the case.

When your first run fails, are you willing to pay again as much for the kits to run it again (these machines are very much following the cheap printer/expensive ink model IMO)?

Once you have some data, do you know how you will BLAST it against annotated genomes to figure out if you have mutation X? How do you interpret e-scores, etc?

For the OP company, when they say "download the data" do they mean the raw reads, or assemblies? Make sure this is spelled out (it likely is, I haven't looked lately). Do the downloads/service provide adequate metadata on the data generation process so that you can tease out errors in reads from reality (real single-nucleotide mutations), etc.?

All fun stuff, but only for a very serious hobbyist, so to speak.



No one should be BLASTing individual reads... And pretty much no one is going to be assembling with OLC, even for ONT data.

On the products page it says they provide FASTQs, aligned BAM and VCF. Which is exactly what one would expect. They're almost certainly just running the DRAGEN pipeline (or something similar) and giving you the output it creates.


I think your typical programmer could automate the alignment, even without reading about existing algorithms. It isn't much harder than a standard interview question, especially if you're willing to use your hardware inefficiently?

You aren't trying to sequence the human genome from nothing, but instead have a reference genome to work with. You take each of the reads you get from the sequencer and find where it best matches up with the reference genome.

I think the wet portion is a far bigger challenge for the typical computer programmer hobbyist.

(Also I don't think $1,000 will get you, in addition to the sequencer, a flow cell sufficient to gather enough reads to tell the difference between sequencing errors and mutations?)


> your typical programmer could automate the assembly

Alignment. You could easily automate alignment of raw sequencing reads to their position in the reference genome. From there, even variant calling (finding your individual changes from the reference) is pretty easy as well.

Assembly is a bigger challenge that takes the raw reads and puts them together into a larger (graph) structure. This is usually de-novo, but could be able to be done with a reference as well. But this is not really something that is routine or possible with 30X sequence coverage to the degree one would expect.

The wet-lab portion isn't terrible and anyone with skills making anything (soldering, for example), should be able to do it -- once you figured out some of the jargon.

The bigger challenge someone not trained in this would have is interpreting the variations that you'd find. This can be quite difficult to figure out if an A>C at a particular position is meaningful or not. Again, much of these can be filtered using existing tools, but even then, it can be quite difficult to interpret results.


> Alignment

Whoops; fixed!

> This can be quite difficult to figure out if an A>C at a particular position is meaningful or not.

What about https://promethease.com ?


Promethease isn’t a bad place to start. I don’t know how it works with uncharacterized variants, but it would be a good place to start.

Note: I do this work for cancer genomes (WGS, tumor and germline). Promethease is not something that I use for annotations, but my workflows are significantly more complex.

The biggest thing I’d look out for is mindset. Biology is very different from comp sci. CS is very deterministic. You change something here, you get a result there. Biology, on the other hand, is very messy. There are compensation mechanisms on top of compensation mechanisms, and everything favors life. If you see a deleterious variant, that still doesn’t mean you’ll see a phenotype (effect on you). It’s quite a different world and can be very scary if you are looking at your own data.


I think the biggest reason it's scary is a statistics literacy problem: everyone has lots of mutations, and a mutation is much more likely to be bad than good. This means that when people look at a Prometheus report and see a long list of deleterious mutations they tend to take it, overall, as bad news: so much red! But since that's what you would expect to see even before looking at the report it shouldn't move your opinion either way.


My prior is that any random variant would likely do nothing. It’s all about location, location, location. First off, a variant would need to be in a gene to do something (not entirely true, but a good enough approximation). Then, the variant would need to be in a coding region/exon. Finally, the variant would need to change the amino acid (there is a Twitter/biorxiv war going on about this specifically now). You have approximately 0.1% unique DNA sequence compared to someone else. But there are 3 billion bp in the human genome. So that’s an expected 3 million variants. And that’s just to start. And just to add to the prior probabilities — you’re probably living, right? So, whatever variants you might find haven’t been lethal yet! (Even better if you’ve made it to adulthood!)

And to complicate things further, one variant in a gene could do nothing if the other copy is still good. Even if you have two bad variants in a single gene, they might be on the same allele (from the same parent). In which case, you still have one good copy! It’s like how Iceman got hit twice in the same engine at the end of the original Top Gun. He still had one good engine to get home. For many genes, that’s all you need. For others, you might not even need one good copy as other genes could take over the same job.

Yes, there are some variants that you don’t want to see. Things that might indicate high likelihood of disease down the road. But even that is bound by statistical probabilities. Good news — some of those can’t be seen by this type of sequencing.

You’re absolutely right though — it can be scary. After looking at many genomes, it can be amazing that we all work as well as we do.


I used https://genvue.geneticgenie.org/ – it's free and I've found it to be the best starting point for further investigation. I also have access to Nebula Genomics' tools as part of the package I paid for. Yes, there is a fear factor when you're looking at your own genome. On the other hand, seeing you have an increased risk of something can be quite motivating to make some compensating lifestyle changes.


This is patronizing to say a programmer could write algorithms that are complex enough there is a major dedicated to it at most universities. You have to realize that you're probably thinking of using a naive solution that produces an extra another order of runtime complexity for datasets that reach 100GB easily.


The complexity comes from trying to do it efficiently, both in using the hardware well and in squeezing as much data as possible out of the sequencing reads you have. But if you relax both of those it's a manageable problem for an amateur.


Efficiency is crucial though. Running cutting edge published alignment tools running on the best AWS server will take hours to process a 30x human genome. An amateur approach for NGS on anything but a model organism would take years or decades to finish alignment on current hardware.


Part of why the best tools take so long is that they are trying really hard, but an amateur approach can do a decent job in far less time by trying less hard. (No fuzzy matching, only exact matching on shortish chunks)

Additionally, we're talking about nanopore data, which is much longer reads, and so is going to be computationally a bit easier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: