26
$\begingroup$

I love the huge curated datasets, but they don't cover everything. I have my own huge dataset which is too much to distribute as an EntityStore so is there a way to build my own analog of, say, ElementData?

$\endgroup$

1 Answer 1

27
$\begingroup$

Update: Template Notebook

I finally got around to finishing my curated data template notebook. You can see it here

Currently it only supports single-type data paclets, but a template for multi-type curated data is in the works and I'll test that with my accumulated Stack Exchange data.

In the meantime, I took some arbitrary airline safety data and turned it into a paclet. This is the construction notebook for it, so that you can get a sense for how it works.


Custom Curated Data Paclets

This answer is split into three parts, the basic setup (i.e. the answer to the question), an example, and a note on automating the process.

Setup

This is exactly what the DataPaclets` functionality is built for. Our core data will be stored in a set of three paclets, two of which will define our data in WDX and one will define our interface.

Base Interface

We'll start with the interface as this tells us how to format our data. If we look at the DownValues for ElementData we see a pile of stuff in the DataPaclets`ElementDataDump` context. We can get most of the way to building our interface by simply copying these definitions into a separate .m file and replacing every instance of ElementData with our own data function, say PackageData. One thing to note is that you may see a "DefaultProperty" that the *Data interface loads. Set that to whatever property in your data you want to be the default, or, if you want to use the Entity framework set it to None.

There are a few extra things we'll need here, though. Notice that there's a function DataPaclets`ElementDataDump`Initialize. If we look at it's DownValues they define how we can load the base data from a data paclet. We can simply copy these over, but note that we'll need to copy over the HoldFirst attribute too.

There are then a few other functions for which we copy over the definitions (all in the DataPaclet`*Dump context):

handlePropertyError
StandardNameQ
GroupQ
FileQ
KeyQ
PropertyQ
CompiledPropertyQ
internalElementData

This will define most of our primary interface. We need one last thing before our interface is set-up, which is to set the initialization. Consider the OwnValues:

In[7]:= DataPaclets`ElementDataDump`$Properties // OwnValues

Out[7]= {HoldPattern[DataPaclets`ElementDataDump`$Properties] :> 
  DataPaclets`ElementDataDump`Initialize[
   DataPaclets`ElementDataDump`$Properties]}

We need to set this for all of the following:

{
 $Groups,
 $GroupHash,
 $PrivateGroups,
 $KeySource,
 $Keys,
 $SubgroupToGroup,
 $PropertyHash,
 $CompiledProperties,
 $SourceGroups,
 $StandardName,
 $Entities,
 $EntityHash,
 $Properties,
 ComputeFunction
 }

This is why that HoldFirst attribute was crucial. Without it setting

$Groups:=Intialize[$Groups]

throws iteration errors

Data Index Paclet

Now we can set up our data.

First we need to set up a *Data_Index paclet which will contain a Data folder with the following:

{
 "Index.wdx",
 "Names.wdx",
 "Entities.wdx",
 "Properties.wdx",
 "Functions.wdx",
 "Groups.wdx",
 "PrivateGroups.wdx"
 }

Note that if you want to see example of usage for each you can just run this:

DataPaclets`ImportData[
 "*Data_Index",
 DataPaclets`GetDataPacletResource["*Data_Index", 
  "fname.wdx"
  ],
 All
 ]

where the fname is one of the preceding file names and the * is some type of data, say Chemical.

I'll give what needs to be exported to each of these files, anyway, though

Index.wdx

we export this expression:

{
 "Sources" -> {
   "Data" -> {
     "Part01" -> ents
     }
   },
 "Properties" -> {
   "Data" -> Thread@List@props
   }
 }

where ents is our list of entity names and props is our list of entity property names.

Names.wdx

{
 ...
 {canonicalName_i,...,alternateName_ij,...}
 ...
 }

where canonicalName_i is the CanonicalName for the ith Entity and alternateName_ij is the jth alternate name for the canonical name.

Entities.wdx

Compress@
 Map[
  Hash[#, "Adler32"] -> # &,
  ents
  ]

This provides a numerical index to the entity name. I'm not sure using an Integer hash is necessary, but it seemed to be what was in some of the examples in the $UserBasePacletsDirectory.

Properties.wdx

Thread@List@props

Just the names of the properties. If a given property has a sub-property you list that as a successive element in the list, just like the alternateNames in the "Entities.wdx" file

{
 "Primary" -> {
  (HoldPattern[ComputeFunction[__]]:>___)...
  },
 "Helpers" -> {
  "OwnValues"->{
    fname->ov
    },
  "DownValues"->{
    fname->ov
    },
  ...
  }
 }

the ComputeFunction will be used as fallbacks (I think) in our *Data function. I just leave both of these lists empty. They do seem to be necessary, though. Leaving this entirely blanks throws errors.

Groups.wdx

{
 ...
 classname->{ ..., canonicalName_i, ...}
 ...
 }

This is how to define an EntityClass in the data, I think. Haven't used it (just left the list blank), but that's definitely the format one uses.

PrivateGroups.wdx

<same as Groups.wdx>

Presumably these are just never put in the public interface, but rather are only accessible via some inner functions. I just leave this blank.

Data Paclet

Now we can do the actual core data:

The data should be chunked into small enough portions (it seems WRI uses ~2 mb?) that they can be quickly downloaded from a paclet server, if you're going to do server-based distribution, otherwise there's no reason not to just place all your data in a single WDX file.

Each chunk of the data should be exported to a "Partnn.wdx" in the "Data" subfolder of the "*Data_Partnn" data paclet.

Call that file wdx. We export it like this:

Export[wdx,
 {
  "Keys"->ents,
  "Properties"->Thread@List@props,
  "Data"->values
  },
 "DataTable"
 ]

where the values are just the values of the properties associated with each entity name in this chunk and the ents and props are identical to what's in "Names.wdx" and "Properties.wdx", but restricted to only the entities in this chunk

Usage / Entities

Note that this format means we only have to distribute two small paclets -- the interface and the index -- and the data can be downloaded in small chunks when necessary (although I have yet to plumb the details of how to set one's own download server -- something that will come later)

Then if someone uses our data function, e.g. PackageData they can use it like:

PackageData["BTools", "Name"]

or as

PackageData["BTools::ht1md", "Name"]

where that BTools::ht1md is the true CanonicalName and the other is an alternate name from the "Names.wdx" file.

So that's fun.

But the issue is that this doesn't link into the Entity framework and that's what really makes these *Data functions powerful. If we have version 11 this is okay though, because we can make an implicit entity store which just routes to our *Data function to look up properties.

We can set this up like:

EntityStore[
 dataTypeName -> <|
   "EntityValidationFunction" -> (True &),
   Join[
    <|
     "Label" -> <|
       "DefaultFunction" ->
        CanonicalName
       |>
     |>,
    AssociationMap[
     <|
       "DefaultFunction" ->
        EntityFramework`BatchApplied[
         Function[ents, dataFunction[ents, #]]
         ]
       |> &,
     props
     ]
    ]
   |>
 ]

where dataTypeName is the string name for our data type.

Then we append this to Internal`$DefaultEntityStores and we can link into the basic Entity framework. I copied that definition basically directly from the preloaded Earthquake EntityStore in there.

Then we'll need to vectorize our *Data looks to handle lists of entities, but this isn't bad. Simply go to the section of your definitions that deals with:

*Data[obj_]

and add

*Data[obj_List] := *Data[#] & /@ obj;

before it and then for for the section with:

*Data[obj_, prop__]

prepend

*Data[obj_List, prop__] :=
 *Data[#, prop] & /@ obj;
*Data[obj_, prop_List] :=
 *Data[obj, #] & /@ prop;

Custom Servers

If you create a paclet server as detailed here (by the way if you want to distribute via a free Cloud account as I do see the answer after that one) you can don't even need to distribute the data -- just set up your server to serve the data paclets. DataPaclets`GetDataPacletResource will download parts as necessary, and it calls the PacletManager which will loop over the available paclet sites to find the appropriate data paclet.

Version 10 (coming soon, hopefully)

For v10 I still need to dig further into how to hook into the Entity interface but I have some ideas.


Example

I set up a set of these paclets in a cloud paclet server. The server has since been migrated, but the way it works is as follows. We install a paclet server, then install the main paclet and use PacletSiteAdd to allow Mathematica to look for the other paclets on our server.

Before it was migrated, the paclet server did this to install a PackageData paclet which would then load the _Index and _Part01. Then you could load it with <<PackageData`

And to search it:

In[46]:= PackageData["MaTeX"]

Out[46]= Entity["Package", "MaTeX"]

And then we could either work with the Entity or the PackageData function:

In[47]:= Entity["Package", "MaTeX"]["Author"]

Out[47]= "szhorvat"

Automated Creation

We can also automate this process, since we can simply set up a template with all these definitions.

I do that here.

Then if we have an EntityStore we can easily automate the creation of a curated data function as the EntityStore framework was undoubtedly built to work in a similar way.

The top-level function in that package (note that the package is not stand-alone) will loop through all the definitions in an EntityStore and export them to the appropriate paclets, although I haven't implemented the data chunking yet.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.