We use cookies and other tracking technologies to improve your browsing experience on our site, analyze site traffic,
and understand where our audience is coming from. To find out more, please read our privacy policy.
By choosing 'I Accept', you consent to our use of cookies and other tracking technologies.
We use cookies and other tracking technologies to improve your browsing experience on our site, analyze site traffic,
and understand where our audience is coming from. To find out more, please read our privacy policy.
By choosing 'I Accept', you consent to our use of cookies and other tracking technologies. Less
We use cookies and other tracking technologies... More
Login or register to apply for this job!
Login or register to start contributing with an article!
Login or register to see more jobs from this company!
Login or register to boost this post!
Show some love to the author of this blog by giving their post some rocket fuel π.
Login or register to search for your ideal job!
Login or register to start working on this issue!
Engineers who find a new job through Functional Works average a 15% increase in salary π
Recently, I have been working on a very interesting and sophisticated project called typerepmap. A lot of advanced features and tricks were used during the development process and I have discovered many amusing and new sides of Haskell. So, I decided to share the ideas, steps, issues, etc. in this blog post.
If you want to skip all the funny parts, here is the link to the code itself:
The basic idea behind typerepmap is to have a data structure like Map, but where types serve as keys, and values stored in the map are of the type specified in the corresponding key.
An example image of this structure:
Int: 42
Bool: True
String: "Haskell"
There can be only one keyvalue pair for each type.
And here is an example written in pseudocode for better understanding:
We also want our values to be indexed by a polymorphic type, but that will be explained later.
Existing Solutions
There already exist some libraries that implement ideas similar to typerepmap:
typemap appears to resemble our project, however the interface is different. They track the elements in the types and don't provide the desired parametrization.
dependentmap is closer to our goal in terms of the interface but the package has a complete reimplementation of Data.Map.Lazy, and the goal of the typerepmap project is to have an efficient implementation based on primitive unboxed arrays.
Motivation
You might wonder what typerepmap brings to the table if there are other packages that aim to fulfil the same purpose. The primary goal is to use it in the caps library instead of the DMap type from dependentmap parametrized by TypeRep.
In caps the performance of lookups is extremely important so it makes sense to prioritize its performance above that of other functions.
Implementing TypeRepMap
Sections below describe the details of the implementation phases and the general concepts.
NOTE: in this blog post I am talking about ghc8.0.2 or higher.
Extensions used
The code snippets in this blog post assume that the following extensions are enabled:
XTypeApplications
XScopedTypeVariables
XGADTs
XTypeInType
XAllowAmbiguousTypes
Mapbased implementation
The reference implementation is more or less straightforward. It uses a lazy Map from containers as an internal representation of the desired data type.
Keys
Normally, types in Haskell are only present at compiletime: after typechecking, they are completely erased from the program. And yet we want to use types as keys in a map. This requires a runtime representation for types. Luckily, the standard library provides a runtime representation for types in the form of TypeRep. But there are actually two different definitions of TypeRep in base:
The one in Type.Reflection was introduced in GHC 8.2 and the old one was rewritten to be based on it. Type.Reflection.TypeRep has kind TypeRep :: k > * while the old one has kind TypeRep :: *.
To have the basic idea of what actually TypeRep is, you can think of the old TypeRep as an infinite ADT with all types enumerated as tag constructors:
data TypeRep = Int  Bool  Char  String  ...
and the new TypeRep is an equivalent to the infinite GADT:
data TypeRep (a :: k) where
Int :: TypeRep Int
Bool :: TypeRep Bool
Char :: TypeRep Char
String :: TypeRep String
...
If you are interested in the actual difference between old and new versions of the TypeRep and motivation for this change, here is a nice ICFP video by Simon Peyton Jones:
I use the old TypeRep that comes from Data.Typeable. And I have an explanation for that: there is a limitation in regular Map that all keys must be of the same type and this is not possible to achieve with parameterized TypeRep. Also, the old TypeRep will never be deprecated (from 8.2 it is just a different interface to the new TypeRep, so it's not obsolete), and it is sufficient for our goal to support older GHC versions.
Here is a usage example of basic TypeRep interface:
ghci> :t typeRep
typeRep :: Typeable a => proxy a > TypeRep
ghci> typeRep (Proxy @Int)
Int
ghci> :t it
it :: TypeRep
Values
For the first prototype, I decided to use Dynamic as values in our TypeRepMap.
ghci> :t toDyn
toDyn :: Typeable a => a > Dynamic
ghci> toDyn True
<<Bool>>
ghci> fromDynamic (toDyn "Haskell") :: Maybe String
Just "Haskell"
insert :: forall a . Typeable a => a > TypeRepMap > TypeRepMap
insert val = TypeRepMap . LMap.insert (typeRep (Proxy @a)) (toDyn val) . unMap
lookup :: forall a . Typeable a => TypeRepMap > Maybe a
lookup = fromDynamic <=< LMap.lookup (typeRep $ Proxy @a) . unMap
When looking at the Dynamic data type implementation
data Dynamic = Dynamic TypeRep Obj
type Obj = Any
you can notice that it already stores TypeRep inside, so it seems like it's a bit suboptimal decision due to redundancy. And we can safely use Any as our value type.
According to the Dynamic implementation, we can use unsafeCoerce function for the conversion to Any and from Any.
ghci> let x = lookup $ insert (11 :: Int) empty
ghci> x :: Maybe Int
Just 11
ghci> x :: Maybe ()
Nothing
All right, we have a simple working version. But there are ways to improve it.
Parameterization
The next step is to parametrize our data type by type variable f with kind f :: k > *. This f will be the interpretation of our keys. Such parameterization allows us to encode additional structure common between all elements, making it possible to use TypeRepMap to model a variety of things from extensible records to monadic effects. This sort of parametrization may be familiar to users of vinyl records.
Note that the input kind is k β we want to support arbitrary kinds as well. Since TypeRep is polykinded, the interpretation can use any kind for the keys (see some examples in documentation).
newtype TypeRepMap (f :: k > *) = TypeRepMap
{ unMap :: LMap.Map TypeRep Any
}
The implementation of the functions stays the same, but the types are different:
insert :: forall a f . Typeable a => f a > TypeRepMap f > TypeRepMap f
lookup :: forall a f . Typeable a => TypeRepMap f > Maybe (f a)
Our previous implementation is just equivalent to TypeRepMap Identity in the context of the described design.
NOTE: Another reason to get rid of the Dynamic: if we keep it then we have to specify Typeable (f a) constraint instead of Typeable a in the type declarations. And having Typeable a constraint would let us implement the following function efficiently:
hoist :: (forall x. f x > g x) > TypeRepMap f > TypeRepMap g
Vectorbased implementation
The next step is to write an alternative implementation based on unboxed vectors, which is expected to be faster.
We want to use Vector (TypeRep, Any) instead of our lazy map. This vector is going to be sorted by TypeRep. insert/lookup algorithms should be implemented manually in the following way:
insert: allocate a new vector of n + 1 element, copy everything from the initial vector adding the new element and don't forget to keep the sorting.
lookup: the simple binary search.
The point of the unboxed vector is that it helps to get rid of the pointer indirection. If we take just Vector we will observe this picture (Ty stands for TypeRep and El stands for an element):
Unfortunately, TypeRep doesn't have the Unbox instance and it looks like it's not possible to write it. So instead of storing TypeRep we will be storing a Fingerprint. Basically, Fingerprint is the hash for TypeRep, so it makes sense to move in this direction.
If we take a look at the Ord instance of SomeTypeRep in base we'll see that it's confirmed that Fingerprint is unique for each TypeRep. That means it's okay to use Fingerprint as a key instead of TypeRep.
Vector
This is initial vectorbased implementation:
data TypeRepMap (f :: k > *) = TypeRepMap
{ fingerprints :: Vector Fingerprint
, anys :: Vector Any
}
We want to use unboxed vector as a type for the fingerprints field of TypeRepMap.
Every unboxed vector is the newtype wrapper over some primitive vector. In order to use an unboxed vector of Fingerprint we need to implement an instance of the Prim typeclass from the primitive package for Fingerprint. It was proposed to add this instance under this issue in primitive library (having this instance inside library would simplify implementation a lot):
As the reference for Prim instance implementation, we can use the Storable type class which contains similar functions. There is already the instance Storable for Fingerprint. An assumption is that there is no significant difference between Storable and Prim for our lookup checks and we can use storable vector instead of unboxed one. For more information about the difference between those typeclasses see this SO answer.
Though our initial assumptions were false and turned out that Storable doesn't give the desired performance boost as shown with benchmarks.
Optimal Vector
According to the source code, Fingerprint is a pair of (Word64, Word64). So instead of having a single vector of Fingerprints we can have a vector of Word64 where Fingerprint with number i stored on 2 * i and 2 * i + 1 indices.
But actually, itβs better to split it into two separate vectors of Word64 where one vector stores the first halves of Fingerprint and the other one stores the second halves correspondingly. It makes the implementation easier and also faster (checked with benchmarks) because of the assumption that it should be almost always enough to compare only the first part and it makes key comparison faster.
After all described optimizations were done our structure took the following form:
data TypeRepMap (f :: k > *) = TypeRepMap
{ fingerprintAs :: Unboxed.Vector Word64
, fingerprintBs :: Unboxed.Vector Word64
, anys :: Boxed.Vector Any
}
And the lookup function was implemented like this:
lookup :: forall a f . Typeable a => TypeRepVector f > Maybe (f a)
lookup tVect = fromAny . (anys tVect V.!)
<$> binarySearch (typeRepFingerprint $ typeRep $ Proxy @a)
(fingerprintAs tVect)
(fingerprintBs tVect)
It uses a manually implemented version of the binary search algorithm optimized for unboxed vectors. The algorithm initially performs a binary search using the fingerprintAs vector only. And then, after finding the first half, walks through the fingerprintBs vector.
At first, a simple naive binary search was implemented but later it was rewritten into a cacheoptimized binary search ([see the description here](http://bannalia.blogspot.ru/2015/06/cachefriendlybinarysearch.html )) which boosted the performance significantly.
Arraybased implementation
But thatβs not all. Later we noticed that every vector has the following definition:
data Vector a = Vector {# UNPACK #} !Int
{# UNPACK #} !Int
{# UNPACK #} !(Array a)
As you can see it contains two Int fields. So we can make our representation more optimal by using Array instead of boxed vector and PrimArray instead of unboxed vector directly in the TypeRepMap data type.
After all optimizations the final shape of the TypeRepMap is following:
Initially, I was frustrated about this part because I had no idea how to create the Map of 1000 elements as that means I needed to somehow generate 1000 types. But there was actually a pretty elegant solution for this puzzle β polymorphic recursion.
Let's introduce the following data types as typelevel natural numbers:
data Z
data S a
Using these data types we can now implement the function which builds TypeRepMap of the desired size.
buildBigMap :: forall a . Typeable a
=> Int
> Proxy a
> TypeRepMap Proxy
> TypeRepMap Proxy
so when I run buildBigMap with size n and Proxy a, it calls itself recursively with n  1 and Proxy (S a) at each step, so the types are growing on each step.
But this wasnβt the only challenge in benchmarking TypeRepMap. There were also a few interesting things with benchmarks to keep in mind:
We should force maps to normal form before benchmarking.
We can't use rnf function. Deriving NFData instance for TypeRepMap is not possible because there can be no NFData for Any. We won't be able to use rnf because it would try to force both the keys and the values, as our values are Any (can't force them), but since evaluating the values is not important at all for the benchmark, we could try to define a function like rnf but without touching the values.
For Mapbased implementation we need to benchmark the lookup function on different depths of our tree (as Map is internally a tree). But the key can be very close to the root so our benchmarks wonβt be honest enough. Thus we need to test on different Proxys with different types.
Here is the diagram of how the treeβs been constructed. You can notice that the Char element is the direct child of the root:
size: 16
tree:
+Proxy * (S (S (S (S (S (S (S (S (S (S (S (S (S (S Z))))))))))))))

+Char
 
  +Proxy * (S (S (S (S (S (S Z))))))
  
 +Proxy * (S (S (S (S (S (S (S (S (S (S (S Z)))))))))))
 
 +

Proxy * (S (S (S (S (S (S (S (S (S (S (S (S (S Z)))))))))))))

 +Proxy * (S (S (S (S (S (S (S (S (S (S Z))))))))))
 
 +Proxy * (S (S (S (S (S (S (S (S (S Z)))))))))
  
  +Proxy * (S (S (S (S (S (S (S (S Z))))))))
 
+Proxy * (S (S (S (S (S (S (S Z)))))))

 +Proxy * (S Z)
 
 +Proxy * (S (S (S Z)))
  
  +Proxy * (S (S (S (S (S (S (S (S (S (S (S (S Z))))))))))))
 
+Proxy * (S (S (S (S Z))))

 +Proxy * (S (S Z))
 
+Proxy * (S (S (S (S (S Z)))))

+Proxy * Z
Since we can't predict how Ord on TypeRep will behave we need to select a Proxy from our range randomly, however, because our types were huge we introduced the following type family to solve that issue:
type family BigProxy (n :: Nat) :: * where
BigProxy 0 = Z
BigProxy n = S (BigProxy (n  1))
While running this version of benchmarks it turned out that rnf function was taking a lot of time mostly on normalisation of the enormous TypeRep keys which consisted of tall nested types like S (S (S ...)).
So, eventually, I end up using the ghc plugin ghctypelitsknownnat and the type of the buildBigMap became:
buildBigMap :: forall (a :: Nat) . KnownNat a
=> Int
> Proxy a
> TypeRepMap (Proxy :: Nat > *)
> TypeRepMap (Proxy :: Nat > *)
In order to benchmark lookup function we implemented a special function fromList to use in place of the bunch of inserts, so we will be able to see the real time measures of lookup operation itself.
data TF f where
TF :: Typeable a => f a > TF f
fromList :: [TF f] > TypeRepMap f
Now the buildBigMap function will have type
buildBigMap :: forall (a :: Nat) . KnownNat a
=> Int
> Proxy a
> [TF (Proxy :: Nat > *)]
> [TF (Proxy :: Nat > *)]
Benchmarks make 10 lookups to collect average performance statistics:
In this blog post, I wanted to show the difficulties, tricks, and useful information which I personally learned during the implementation of an optimized version of TypeRepMap. Also, I needed to somehow structure the knowledge Iβve gained while working on this project. You can say that some parts of the post can be skipped or might be irrelevant but I wrote it in such a way on purpose to highlight the topics that I find very hard to find and understand quickly. So I hope you too will find this knowledge useful!
Acknowledgments
Many thanks to Vladislav Zavialov (\@intindex) for mentoring this project! It was the great experience for me.
Bonus
A few more challenges on the way to the release typerepmap:
KindOf
During interface enhancement I ran into some weird issue described below.
Itβs nice to have the member function and it makes sense to implement it using already written lookup function:
member :: forall a f . Typeable a => TypeRepMap f > Bool
member trMap = case lookup @a trMap of
Nothing > False
Just _ > True
Type of the lookup function is the following:
lookup :: forall a f . Typeable a => TypeRepMap f > Maybe (f a)
Unfortunately, this implementation of member doesnβt compile! The problem is in the fact that the compiler canβt infer that type variable a and the argument to f have the same kind. These two functions have the following type with implicitly inferred kinds:
lookup :: forall {k} (a :: k) (f :: k > *) . Typeable a
=> TypeRepMap f > Maybe (f a)
member :: forall {k1} {k2} (a :: k1) (f :: k2 > *) . Typeable a
=> TypeRepMap f > Bool
After this ghc proposal is implemented, it should be possible to write such type signatures directly in code. The current workaround is to use this trick with KindOf type:
type KindOf (a :: k) = k
member :: forall a (f :: KindOf a > Type) . Typeable a => TypeRepMap f > Bool
New TypeRep performance
During benchmarking the Mapbased implementation of TypeRepMap, very perceptible performance degradation was noticed. Here is the comparison table with the results we have with our Mapbased implementation.
We didnβt observe this performance degradation when we used Fingerprint as keys, so it's probably an issue with the new TypeRep.
KnownNat and Typeable
Initial version of buildBigMap function had this type signature:
buildBigMap :: forall (a :: Nat) . (Typeable a, KnownNat a) => ...
But, unfortunately, it became broken on GHC8.4.3! Turned out that Typeable and KnownNat constraints donβt play well together. This observation resulted in the following ghc ticket with quite an interesting discussion: