Use Case: Nucleic Acids
module Tutorial.DPair.DNA
import Data.DPair
import Data.Either
import Data.HList
import Data.List
import Data.List1
import Data.Singleton
import Data.String
import Data.Vect
import Text.CSV
%default total
We'd like to come up with a small, simplified library for running computations on nucleic acids: RNA and DNA. These are built from five types of nucleobases, three of which are used in both types of nucleic acids and two bases specific for each type of acid. We'd like to make sure that only valid bases are in strands of nucleic acids. Here's a possible encoding:
data BaseType = DNABase | RNABase
data Nucleobase : BaseType -> Type where
Adenine : Nucleobase b
Cytosine : Nucleobase b
Guanine : Nucleobase b
Thymine : Nucleobase DNABase
Uracile : Nucleobase RNABase
NucleicAcid : BaseType -> Type
NucleicAcid = List . Nucleobase
RNA : Type
RNA = NucleicAcid RNABase
DNA : Type
DNA = NucleicAcid DNABase
encodeBase : Nucleobase b -> Char
encodeBase Adenine = 'A'
encodeBase Cytosine = 'C'
encodeBase Guanine = 'G'
encodeBase Thymine = 'T'
encodeBase Uracile = 'U'
encode : NucleicAcid b -> String
encode = pack . map encodeBase
It is a type error to use Uracile
in a strand of DNA:
failing "Mismatch between: RNABase and DNABase."
errDNA : DNA
errDNA = [Uracile, Adenine]
Note, how we used a variable for nucleobases Adenine
, Cytosine
, and Guanine
: These are again universally quantified, and client code is free to choose a value here. This allows us to use these bases in strands of DNA and RNA:
dna1 : DNA
dna1 = [Adenine, Cytosine, Guanine]
rna1 : RNA
rna1 = [Adenine, Cytosine, Guanine]
With Thymine
and Uracile
, we are more restrictive: Thymine
is only allowed in DNA, while Uracile
is restricted to be used in RNA strands. Let's write parsers for strands of DNA and RNA:
readAnyBase : Char -> Maybe (Nucleobase b)
readAnyBase 'A' = Just Adenine
readAnyBase 'C' = Just Cytosine
readAnyBase 'G' = Just Guanine
readAnyBase _ = Nothing
readRNABase : Char -> Maybe (Nucleobase RNABase)
readRNABase 'U' = Just Uracile
readRNABase c = readAnyBase c
readDNABase : Char -> Maybe (Nucleobase DNABase)
readDNABase 'T' = Just Thymine
readDNABase c = readAnyBase c
readRNA : String -> Maybe RNA
readRNA = traverse readRNABase . unpack
readDNA : String -> Maybe DNA
readDNA = traverse readDNABase . unpack
Again, in case of the bases appearing in both kinds of strands, users of the universally quantified readAnyBase
are free to choose what base type they want, but they will never get a Thymine
or Uracile
value.
We can now implement some simple calculations on sequences of nucleobases. For instance, we can come up with the complementary strand:
complementRNA' : RNA -> RNA
complementRNA' = map calc
where calc : Nucleobase RNABase -> Nucleobase RNABase
calc Guanine = Cytosine
calc Cytosine = Guanine
calc Adenine = Uracile
calc Uracile = Adenine
complementDNA' : DNA -> DNA
complementDNA' = map calc
where calc : Nucleobase DNABase -> Nucleobase DNABase
calc Guanine = Cytosine
calc Cytosine = Guanine
calc Adenine = Thymine
calc Thymine = Adenine
Ugh, code repetition! Not too bad here, but imagine there were dozens of bases with only few specialized ones. Surely, we can do better? Unfortunately, the following won't work:
complementBase' : Nucleobase b -> Nucleobase b
complementBase' Adenine = ?what_now
complementBase' Cytosine = Guanine
complementBase' Guanine = Cytosine
complementBase' Thymine = Adenine
complementBase' Uracile = Adenine
All goes well with the exception of the Adenine
case. Remember: Parameter b
is universally quantified, and the callers of our function can decide what b
is supposed to be. We therefore can't just return Thymine
: Idris will respond with a type error since callers might want a Nucleobase RNABase
instead. One way to go about this is to take an additional unerased argument (explicit or implicit) representing the base type:
complementBase : (b : BaseType) -> Nucleobase b -> Nucleobase b
complementBase DNABase Adenine = Thymine
complementBase RNABase Adenine = Uracile
complementBase _ Cytosine = Guanine
complementBase _ Guanine = Cytosine
complementBase _ Thymine = Adenine
complementBase _ Uracile = Adenine
This is again an example of a dependent function type (also called a pi type): The input and output types both depend on the value of the first argument. We can now use this to calculate the complement of any nucleic acid:
complement : (b : BaseType) -> NucleicAcid b -> NucleicAcid b
complement b = map (complementBase b)
Now, here is an interesting use case: We'd like to read a sequence of nucleobases from user input, accepting two strings: The first telling us, whether the user plans to enter a DNA or RNA sequence, the second being the sequence itself. What should be the type of such a function? Well, we're describing computations with side effects, so something involving IO
seems about right. User input almost always needs to be validated or translated, so something might go wrong and we need an error type for this case. Finally, our users can decide whether they want to enter a strand of RNA or DNA, so this distinction should be encoded as well.
Of course, it is always possible to write a custom sum type for such a use case:
data Result : Type where
UnknownBaseType : String -> Result
InvalidSequence : String -> Result
GotDNA : DNA -> Result
GotRNA : RNA -> Result
This has all possible outcomes encoded in a single data type. However, it is lacking in terms of flexibility. If we want to handle errors early on and just extract a strand of RNA or DNA, we need yet another data type:
data RNAOrDNA = ItsRNA RNA | ItsDNA DNA
This might be the way to go, but for results with many options, this can get cumbersome quickly. Also: Why come up with a custom data type when we already have the tools to deal with this at our hands?
Here is how we can encode this with a dependent pair:
namespace InputError
public export
data InputError : Type where
UnknownBaseType : String -> InputError
InvalidSequence : String -> InputError
readAcid : (b : BaseType) -> String -> Either InputError (NucleicAcid b)
readAcid b str =
let err = InvalidSequence str
in case b of
DNABase => maybeToEither err $ readDNA str
RNABase => maybeToEither err $ readRNA str
getNucleicAcid : IO (Either InputError (b ** NucleicAcid b))
getNucleicAcid = do
baseString <- getLine
case baseString of
"DNA" => map (MkDPair _) . readAcid DNABase <$> getLine
"RNA" => map (MkDPair _) . readAcid RNABase <$> getLine
_ => pure $ Left (UnknownBaseType baseString)
Note, how we paired the type of nucleobases with the nucleic acid sequence. Assume now we implement a function for transcribing a strand of DNA to RNA, and we'd like to convert a sequence of nucleobases from user input to the corresponding RNA sequence. Here's how to do this:
transcribeBase : Nucleobase DNABase -> Nucleobase RNABase
transcribeBase Adenine = Uracile
transcribeBase Cytosine = Guanine
transcribeBase Guanine = Cytosine
transcribeBase Thymine = Adenine
transcribe : DNA -> RNA
transcribe = map transcribeBase
printRNA : RNA -> IO ()
printRNA = putStrLn . encode
transcribeProg : IO ()
transcribeProg = do
Right (b ** seq) <- getNucleicAcid
| Left (InvalidSequence str) => putStrLn $ "Invalid sequence: " ++ str
| Left (UnknownBaseType str) => putStrLn $ "Unknown base type: " ++ str
case b of
DNABase => printRNA $ transcribe seq
RNABase => printRNA seq
By pattern matching on the first value of the dependent pair we could determine, whether the second value is an RNA or DNA sequence. In the first case, we had to transcribe the sequence first, in the second case, we could invoke printRNA
directly.
In a more interesting scenario, we would translate the RNA sequence to the corresponding protein sequence. Still, this example shows how to deal with a simplified real world scenario: Data may be encoded differently and coming from different sources. By using precise types, we are forced to first convert values to the correct format. Failing to do so leads to a compile time exception instead of an error at runtime or - even worse - the program silently running a bogus computation.
Dependent Records vs Sum Types
Dependent records as shown for AnyVect a
are a generalization of dependent pairs: We can have an arbitrary number of fields and use the values stored therein to calculate the types of other values. For very simple cases like the example with nucleobases, it doesn't matter too much, whether we use a DPair
, a custom dependent record, or even a sum type. In fact, the three encodings are equally expressive:
Acid1 : Type
Acid1 = (b ** NucleicAcid b)
record Acid2 where
constructor MkAcid2
baseType : BaseType
sequence : NucleicAcid baseType
data Acid3 : Type where
SomeRNA : RNA -> Acid3
SomeDNA : DNA -> Acid3
It is trivial to write lossless conversions between these encodings, and with each encoding we can decide with a simple pattern match, whether we currently have a sequence of RNA or DNA. However, dependent types can depend on more than one value, as we will see in the exercises. In such cases, sum types and dependent pairs quickly become unwieldy, and you should go for an encoding as a dependent record.