I love the huge curated datasets, but they don't cover everything. I have my own huge dataset which is too much to distribute as an EntityStore
so is there a way to build my own analog of, say, ElementData
?
1 Answer
Update: Template Notebook
I finally got around to finishing my curated data template notebook. You can see it here
Currently it only supports single-type data paclets, but a template for multi-type curated data is in the works and I'll test that with my accumulated Stack Exchange data.
In the meantime, I took some arbitrary airline safety data and turned it into a paclet. This is the construction notebook for it, so that you can get a sense for how it works.
Custom Curated Data Paclets
This answer is split into three parts, the basic setup (i.e. the answer to the question), an example, and a note on automating the process.
Setup
This is exactly what the DataPaclets`
functionality is built for. Our core data will be stored in a set of three paclets, two of which will define our data in WDX
and one will define our interface.
Base Interface
We'll start with the interface as this tells us how to format our data. If we look at the DownValues
for ElementData
we see a pile of stuff in the DataPaclets`ElementDataDump`
context. We can get most of the way to building our interface by simply copying these definitions into a separate .m file and replacing every instance of ElementData
with our own data function, say PackageData
. One thing to note is that you may see a "DefaultProperty"
that the *Data
interface loads. Set that to whatever property in your data you want to be the default, or, if you want to use the Entity
framework set it to None
.
There are a few extra things we'll need here, though. Notice that there's a function DataPaclets`ElementDataDump`Initialize
. If we look at it's DownValues
they define how we can load the base data from a data paclet. We can simply copy these over, but note that we'll need to copy over the HoldFirst
attribute too.
There are then a few other functions for which we copy over the definitions (all in the DataPaclet`*Dump
context):
handlePropertyError
StandardNameQ
GroupQ
FileQ
KeyQ
PropertyQ
CompiledPropertyQ
internalElementData
This will define most of our primary interface. We need one last thing before our interface is set-up, which is to set the initialization. Consider the OwnValues
:
In[7]:= DataPaclets`ElementDataDump`$Properties // OwnValues
Out[7]= {HoldPattern[DataPaclets`ElementDataDump`$Properties] :>
DataPaclets`ElementDataDump`Initialize[
DataPaclets`ElementDataDump`$Properties]}
We need to set this for all of the following:
{
$Groups,
$GroupHash,
$PrivateGroups,
$KeySource,
$Keys,
$SubgroupToGroup,
$PropertyHash,
$CompiledProperties,
$SourceGroups,
$StandardName,
$Entities,
$EntityHash,
$Properties,
ComputeFunction
}
This is why that HoldFirst
attribute was crucial. Without it setting
$Groups:=Intialize[$Groups]
throws iteration errors
Data Index Paclet
Now we can set up our data.
First we need to set up a *Data_Index paclet which will contain a Data
folder with the following:
{
"Index.wdx",
"Names.wdx",
"Entities.wdx",
"Properties.wdx",
"Functions.wdx",
"Groups.wdx",
"PrivateGroups.wdx"
}
Note that if you want to see example of usage for each you can just run this:
DataPaclets`ImportData[
"*Data_Index",
DataPaclets`GetDataPacletResource["*Data_Index",
"fname.wdx"
],
All
]
where the fname
is one of the preceding file names and the *
is some type of data, say Chemical
.
I'll give what needs to be exported to each of these files, anyway, though
Index.wdx
we export this expression:
{
"Sources" -> {
"Data" -> {
"Part01" -> ents
}
},
"Properties" -> {
"Data" -> Thread@List@props
}
}
where ents
is our list of entity names and props
is our list of entity property names.
Names.wdx
{
...
{canonicalName_i,...,alternateName_ij,...}
...
}
where canonicalName_i
is the CanonicalName
for the ith Entity
and alternateName_ij
is the jth alternate name for the canonical name.
Entities.wdx
Compress@
Map[
Hash[#, "Adler32"] -> # &,
ents
]
This provides a numerical index to the entity name. I'm not sure using an Integer
hash is necessary, but it seemed to be what was in some of the examples in the $UserBasePacletsDirectory
.
Properties.wdx
Thread@List@props
Just the names of the properties. If a given property has a sub-property you list that as a successive element in the list, just like the alternateNames
in the "Entities.wdx"
file
{
"Primary" -> {
(HoldPattern[ComputeFunction[__]]:>___)...
},
"Helpers" -> {
"OwnValues"->{
fname->ov
},
"DownValues"->{
fname->ov
},
...
}
}
the ComputeFunction
will be used as fallbacks (I think) in our *Data
function. I just leave both of these lists empty. They do seem to be necessary, though. Leaving this entirely blanks throws errors.
Groups.wdx
{
...
classname->{ ..., canonicalName_i, ...}
...
}
This is how to define an EntityClass
in the data, I think. Haven't used it (just left the list blank), but that's definitely the format one uses.
PrivateGroups.wdx
<same as Groups.wdx>
Presumably these are just never put in the public interface, but rather are only accessible via some inner functions. I just leave this blank.
Data Paclet
Now we can do the actual core data:
The data should be chunked into small enough portions (it seems WRI uses ~2 mb?) that they can be quickly downloaded from a paclet server, if you're going to do server-based distribution, otherwise there's no reason not to just place all your data in a single WDX file.
Each chunk of the data should be exported to a "Partnn.wdx"
in the "Data"
subfolder of the "*Data_Partnn"
data paclet.
Call that file wdx
. We export it like this:
Export[wdx,
{
"Keys"->ents,
"Properties"->Thread@List@props,
"Data"->values
},
"DataTable"
]
where the values
are just the values of the properties associated with each entity name in this chunk and the ents
and props
are identical to what's in "Names.wdx"
and "Properties.wdx"
, but restricted to only the entities in this chunk
Usage / Entities
Note that this format means we only have to distribute two small paclets -- the interface and the index -- and the data can be downloaded in small chunks when necessary (although I have yet to plumb the details of how to set one's own download server -- something that will come later)
Then if someone uses our data function, e.g. PackageData
they can use it like:
PackageData["BTools", "Name"]
or as
PackageData["BTools::ht1md", "Name"]
where that BTools::ht1md
is the true CanonicalName
and the other is an alternate name from the "Names.wdx"
file.
So that's fun.
But the issue is that this doesn't link into the Entity
framework and that's what really makes these *Data
functions powerful. If we have version 11 this is okay though, because we can make an implicit entity store which just routes to our *Data
function to look up properties.
We can set this up like:
EntityStore[
dataTypeName -> <|
"EntityValidationFunction" -> (True &),
Join[
<|
"Label" -> <|
"DefaultFunction" ->
CanonicalName
|>
|>,
AssociationMap[
<|
"DefaultFunction" ->
EntityFramework`BatchApplied[
Function[ents, dataFunction[ents, #]]
]
|> &,
props
]
]
|>
]
where dataTypeName
is the string name for our data type.
Then we append this to Internal`$DefaultEntityStores
and we can link into the basic Entity
framework. I copied that definition basically directly from the preloaded Earthquake EntityStore
in there.
Then we'll need to vectorize our *Data
looks to handle lists of entities, but this isn't bad. Simply go to the section of your definitions that deals with:
*Data[obj_]
and add
*Data[obj_List] := *Data[#] & /@ obj;
before it and then for for the section with:
*Data[obj_, prop__]
prepend
*Data[obj_List, prop__] :=
*Data[#, prop] & /@ obj;
*Data[obj_, prop_List] :=
*Data[obj, #] & /@ prop;
Custom Servers
If you create a paclet server as detailed here (by the way if you want to distribute via a free Cloud account as I do see the answer after that one) you can don't even need to distribute the data -- just set up your server to serve the data paclets. DataPaclets`GetDataPacletResource
will download parts as necessary, and it calls the PacletManager
which will loop over the available paclet sites to find the appropriate data paclet.
Version 10 (coming soon, hopefully)
For v10 I still need to dig further into how to hook into the Entity
interface but I have some ideas.
Example
I set up a set of these paclets in a cloud paclet server. The server has since been migrated, but the way it works is as follows. We install a paclet server, then install the main paclet and use PacletSiteAdd
to allow Mathematica to look for the other paclets on our server.
Before it was migrated, the paclet server did this to install a PackageData
paclet which would then load the _Index
and _Part01
. Then you could load it with <<PackageData`
And to search it:
In[46]:= PackageData["MaTeX"]
Out[46]= Entity["Package", "MaTeX"]
And then we could either work with the Entity
or the PackageData
function:
In[47]:= Entity["Package", "MaTeX"]["Author"]
Out[47]= "szhorvat"
Automated Creation
We can also automate this process, since we can simply set up a template with all these definitions.
I do that here.
Then if we have an EntityStore
we can easily automate the creation of a curated data function as the EntityStore
framework was undoubtedly built to work in a similar way.
The top-level function in that package (note that the package is not stand-alone) will loop through all the definitions in an EntityStore
and export them to the appropriate paclets, although I haven't implemented the data chunking yet.