November 15, 2007

Malaria’s Neat Information Problem

Posted by Eric at 8:29 am | Category: Biology, Literature, Science

Malaria is a very weird parasite with biology that we still really don’t understand.

For example, since it lives in the red blood cells, everything it makes has to come from, essentially, hemoglobin. Scientists think that’s partly why the parasite has a lot of As and Ts in its DNA, but not too many Gs and Cs. In fact, about 80% of its genome is made of As and Ts!

Blogging on Peer-Reviewed ResearchNow, this is a problem for the malaria. If it has to use mostly two letters to write out everything, then it can’t say as much in the same length of DNA. For example, why are words and names in Hawaiian often so (stereotypically) long? It’s because they don’t have as many different syllables as other languages, so it takes Hawaiian speakers longer to make the same amount of information into their words. Binary numbers are also longer than decimal numbers: 1573 is much shorter than the equivalent 11000100101. We say that the decimal system is more information rich; computer scientists like to say that it has more “entropy.”

Biology has a similar problem. If you have only two letters, it’ll take longer for you to say something unique. Unfortunately, biology often needs to use stretches of DNA that are very specific and unique, which act like unique signposts or ID tags that proteins recognize when they’re looking for specific parts of DNA.1 With only two letters, though, malaria’s signposts will be double the length! It costs energy to make DNA with longer “names”, and it takes another large dollop of energy to make the double-sized proteins that would recognize the double-length names.

So, malaria evolved to survive well in humans; maybe it’s found a way to avoid paying such a huge cost?

The Trick!

Yes, I think malaria has just such a trick! You see, it seems to save the Gs and Cs for the signpost parts of DNA, which turn genes on or off!

You can kind of see that in a new paper published in Molecular Cell2 by Olivier Elemento, Noam Slonim, and Saeed Tavazoie. They basically crunched computers to find pieces of DNA that seem to control how genes are turned on and off, which are these signposts (since the proteins that control genes just find and follow the signposts in the DNA). What Elemento and the rest of ‘em found in malaria was that the proportion of Gs and Cs in the DNA rise in these signposts to be a more reasonable 43%!

Cool, no? The authors don’t come outright and say that that’s the reason behind the GC-rich nature of the signposts they find. This is pure speculation on my part, and there are other reasons that the elements might be GC rich, but I think my speculation isn’t pure crap.

Let’s take a look at how much better 43% is than 20%. This will have a little bit of math in it, but settle down, I’ll skip over the unnecessary stuff. It’ll help see if my speculation is warranted.

So, How Much More?

We want some measurement that’ll tell us how long a word we need for a certain amount of information. For example, if I have just one letter (let’s say ‘X’), then I’ll never be able to tell you anything! Turn left, turn right, meeting’s canceled, even if I have an infinite amount of space, I can’t tell you anything! (Note that not sending something counts as a letter; ’space’ is equivalent to a letter here. In my hypothetical ‘one letter’ alphabet, spaces don’t exist, and so I can’t even stop!) If I have two letters, say ‘0′ and ‘1′, then I can uniquely identify 8 things with 3 letter-long words (000, 001, 010, etc.). Let’s say I have 4 letters, A, G, T, and C. Then with a 3-letter long word, I can uniquely identify 64 different things with the same length word! Much nicer, eh?

Now, things get complicated when you have to follow percentages for how often you use certain letters. For example, if your words have to have a certain average percentage of As and Ts, such as 80%, then you can’t quite say as much, right? In a way, you can think of this situation as being partly in between the 4-letter alphabet and the 2-letter alphabet. If the required percentage for 2 of your 4 letters is 50%, then you have a perfectly 4 letter alphabet. If you have to use those 2 letters 100% of the time, then basically you can’t use the other two letters anyway, so you’re reduced down to a 2-letter alphabet.

Using a formula3 that generalizes the above concept, we can calculate that without adding Gs and Cs back into the signposts, malaria would have to have DNA addresses that are almost 40% longer in order to identify the same number of items uniquely! That’s a pretty big cost! Imagine if you needed to spend 40% more energy to do everything. Your 25 miles per gallon car would become a painful 18 mpg gas guzzler. Your $70 / month electricity bill would go up to almost $100! I mean, that’s a ton of energy wasted. And evolution seems to have come up with a simple solution to the problem!

It’s simultaneously marvelous and terrifying how efficient and systematic malaria is, but in a way, we can use that to our advantage, because we can assume that malaria has rapidly evolved to reproduce well in humans, and then we can explore its biology with hypotheses generated from that. I didn’t know, for example, that malaria’s DNA signposts had more Gs and Cs than the rest of the genome, but I figured it might, since natural selection would push malaria that way. (Intelligent design, now your turn for hypotheses. No? No hypotheses? No no, don’t cry, it’s not your fault you aren’t a real science.)

I’m really curious if looking for particularly G and C rich pieces of DNA on malaria would make for finding new signposts faster, as right now, we don’t know of too many. Malaria is forced to use Gs and Cs in the DNA for its genes, so it balances out the AT/GC ratio by draining its in-between regions of Gs and Cs (so there, the proportions are actually 90% As and Ts). That sounds like a possible science project somewhere…

1. The pieces of DNA are called ‘cis-acting elements’ for historical reasons that date back to the early days of genetics.

2. Elemento et al. (2007) “A Universal Framework for Regulatory Element Discovery across All Genomes and Data Types,” Mol Cell, 28: 337-350.

3. It’s called Shannon’s Entropy

Leave a Reply