Machine Learning

Make your data beautiful again.

Classifier Commands

The monsterDB engine now utilises the WEKA libraries from the Machine Learning Group at the University of Waikato.

Full instructions on the WEKA classification algorithms can be found at their website and on the internet. This page serves as the usage of the libraries as embedded into the monsterDB core, including the new browser output mode.

arrf

The output from the arrf command will be in the format of the json arrf standards.

usage from CLI: 
  aggregate.aCollection.arrf({...defintion...}) | ...optional further commandor  db.aCollection.aggregate([{ $arrf: { ... definition ...}}, ... optional further command ])

The command is part of the aggregate pipeline function set and can be used in the tradional (ie mongodb) format or using the piped format that operating system users be be used to. Arrf is used to prepare a set of objects from a collection either this can be the whole collection as shown or a subset of a collection by proceeding the arrf with a match filter (or fuzzy filter).

Selective filter: 
  aggregate.aCollection.match({ ... filter ... }) | arrf({...defintion...}) | ... optional further commandor  db.aCollection.aggregate([{$match: { ... filter ...} }, {$arrf: { ... definition ...}}, ... optional further command ])

Definition: this document object must be passed to arrf, it can be empty, but the document if populated must contain the types of the attributes you would like to use, normally if an attribute is not found in this document then it would be guessed by the arrf generator, however if you would like to force the generator to use nominal, string, numeric or date on any of the fields then this is how you would do this:

.... | arrf({name: "string", dob: "date yyyy-MM-dd'T'HH:mm:ss.SSSZ", age: "numeric", hair: "nominal"}) | ...

The date format is optional, but recommended

Nominal on its own as a string value will not seed the nominal values, if you wish to seed the nominal values with certain strings then it can be done like this:

.... | arrf({name: "string", dob: "date yyyy-MM-dd'T'HH:mm:ss.SSSZ", age: "numeric", hair: ["brown", "blonde", "red"] ) | ...

This is a seed of the nominal values if other values are found in the data then they will be appended to the list. This can be useful if your nominal values only end up creating a singleton list which is unusable by many classifiers (J48), you could add a dummy value to ensure at least 2 values are classified.

classifierBuild

As the name suggests cluster build will take an arrf format output and build a model from the using a training set(s) and a testing set(s) in the same way it would be done in R or WEKA workbench.

Usage from CLI: 
  aggregate.aCollection.clusterBuild({...options...}) 

An example of a clusterBuild is:

aggregate.gleif.match({}) | arrf() | classifierBuild({ "trees.J48": {numFolds: 10, className: "Entity.Solvency"} })

The example would take the output of the entire gleif collection, convert it to a json arrf format and then build a decision tree on the data using the J48 algorithm.

The optional, but highly recommended options are in a document where each entry pertains to an algorithm, this way you can produce multiple models from the same input set, the list of algorithm names available can be seen using the listClassifiers command. 

Each classifier has the following options:

numFolds – how many fold (randomised training sets) to generate from the source data.

className – the attribute that we are looking for a classification for – ie if we are aiming to predict the outcome of the solvency of the company then this field would be the className. Note the dot notation is because the data in the collection is nested, solvency appears under Entity.

classifierBuild will output a largely unreadable json document, you should store this as shown, the rest of the fields in the options will also be stored with the output, this means you can use an _id field to ensure that you only keep one version of the model:

aggregate.gleif.match({}) | arrf() | classifierBuild({ "trees.J48": {numFolds: 10, className: "Entity.Solvency", "_id": "model1"} }) | out("modelsCollection")

For the interested the model can be viewed using a standard find on the collection:

db.modelsCollection.find({_id: "model1"})

would produce something like:

{"numFolds":10,"_id":"models1","className":"Entity.Solvency","classifier":"trees.J48","accuracy":100,"createDate":"Mar 25, 2020 6:08:05 PM","model":{"m_storedObjectArray": [ ... ],"m_isCompressed":false,"m_isJython":false}}

Often it is key to visit the model to assess the accuracy value, which will be added after the training set is tested.

classifierPredict

Now we have a model trained and stored in the database we can use it to predict an outcome from a new dataset, this is done using the classifierPredict aggregate function:
classifierPredict({"modelFilter": {"from": "collectionName", filter: { ... filter ... } } } )
or
classifierPredict({"numFolds":10,"_id":"models1","className":"Entity.Solvency","classifier":"trees.J48","accuracy":100,"createDate":"Mar 25, 2020 6:08:05 PM","model":{"m_storedObjectArray": [ ... ],"m_isCompressed":false,"m_isJython":false}})
As the output from the classifierBuild can not be piped into a classifierPredict (as the input is your data) then it will need to be told where to receive the trained model.
 
Using modelFilter:
This can be done by passing a modelFilter document containing two fields from and filter, from is the collection name in the same database and filter is in the format used by match and find in the database and can include complex arguments. The output must be a single document, subsequent documents will be ignored (first is used)
 
Passing the model on the command line:
If you wish then you can pass the document containing the model as per the output in the classifierBuild above.
 
An example is as follows:
aggregate.gleif.match({}) | arrf({})  | classifierPredict({"modelFilter": {"from": "modelCollection", filter: {_id: "model1"}}})

Produces:

{... data ..., "Class":"SOLVENT"},

For each object passed to the classifierPredict it will be returned with an Class field appended, this is the prediction.

classifierTree

Classifier tree is used to produce human readable decision trees, not all classifiers in WEKA will work yet, but any that product dot format output should be fine.

Usage:
aggregate.collection.classifierTree({"modelFilter": {"from": "collection name", filter: { .... }}})

Model filter uses from to select the collection in the same database and filter to select a single object in that collection (first used).

listClassifiers

Lists the usable classifiers in the system:

listClassifiers

produces:

{"value": "trees.HoeffdingTree"}, 
{"value": "bayes.net.EditableBayesNet"},
{"value": "bayes.net.BayesNetGenerator"},
{"value": "trees.J48"},
{"value": "trees.M5P"},
{"value": "pmml.consumer.TreeModel"},
{"value": "meta.CostSensitiveClassifier"},
{"value": "meta.RandomizableFilteredClassifier"}, {"value": "meta.CVParameterSelection"},
{"value": "trees.RandomTree"},
{"value": "trees.REPTree"},
{"value": "trees.LMT"},
{"value": "meta.AttributeSelectedClassifier"},
{"value": "meta.FilteredClassifier"},
{"value": "bayes.BayesNet"},
{"value": "misc.InputMappedClassifier"},
{"value": "bayes.net.BIFReader"}

Full example – Weather

Create a collection containing your weather data and the decision you made to play tennis or not:

Day

Outlook

Temp.

Humidity

Wind

Decision

1

Sunny

Hot

High

Weak

No

2

Sunny

Hot

High

Strong

No

3

Overcast

Hot

High

Weak

Yes

4

Rain

Mild

High

Weak

Yes

5

Rain

Cool

Normal

Weak

Yes

6

Rain

Cool

Normal

Strong

No

7

Overcast

Cool

Normal

Strong

Yes

8

Sunny

Mild

High

Weak

No

9

Sunny

Cool

Normal

Weak

Yes

10

Rain

Mild

Normal

Weak

Yes

11

Sunny

Mild

Normal

Strong

Yes

12

Overcast

Mild

High

Strong

Yes

13

Overcast

Hot

Normal

Weak

Yes

14

Rain

Mild

High

Strong

No

this can be done using a set of save commands, and then build a model using J48:

create collection weatherdb.weather.save({'Day':'1','Outlook':'Sunny','Temp.':'Hot','Humidity':'High','Wind':'Weak','Decision':'No'})
db.weather.save({'Day':'2','Outlook':'Sunny','Temp.':'Hot','Humidity':'High','Wind':'Strong','Decision':'No'})
db.weather.save({'Day':'3','Outlook':'Overcast','Temp.':'Hot','Humidity':'High','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'4','Outlook':'Rain','Temp.':'Mild','Humidity':'High','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'5','Outlook':'Rain','Temp.':'Cool','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'6','Outlook':'Rain','Temp.':'Cool','Humidity':'Normal','Wind':'Strong','Decision':'No'})
db.weather.save({'Day':'7','Outlook':'Overcast','Temp.':'Cool','Humidity':'Normal','Wind':'Strong','Decision':'Yes'})
db.weather.save({'Day':'8','Outlook':'Sunny','Temp.':'Mild','Humidity':'High','Wind':'Weak','Decision':'No'})
db.weather.save({'Day':'9','Outlook':'Sunny','Temp.':'Cool','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'10','Outlook':'Rain','Temp.':'Mild','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'11','Outlook':'Sunny','Temp.':'Mild','Humidity':'Normal','Wind':'Strong','Decision':'Yes'})
db.weather.save({'Day':'12','Outlook':'Overcast','Temp.':'Mild','Humidity':'High','Wind':'Strong','Decision':'Yes'})
db.weather.save({'Day':'13','Outlook':'Overcast','Temp.':'Hot','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'14','Outlook':'Rain','Temp.':'Mild','Humidity':'High','Wind':'Strong','Decision':'No'})
aggregate.weather.match({}) | arrf({})  | classifierBuild({ "trees.J48": {numFolds: 10, "_id": "weather", className: "Decision"} }) | out("models")

 

 

Now set the output to your browser and draw the tree it made:

set output browser

 

aggregate.weather.classifierTree({“modelFilter”: {“from”: “models”, filter: {_id: “weather”}}})

The browser should open and the tree shown:

Now you can use your model:

create collection weather2

 

db.weather2.save({‘Day’:’15’,’Outlook’:’Rain’,’Temp.’:’Mild’,’Humidity’:’Normal’,’Wind’:’Strong’, ‘Decision’: ‘No’})

aggregate.weather2.match({}) | arrf({}) | classifierPredict({“modelFilter”: {“from”: “models”, filter: {_id: “weather”}}})

Produces:

{
"Wind": "Strong",
"Outlook": "Rain",
"Temp.": "Mild",
"Humidity": "Normal",
"Day": "15",
"Decision": "No",
"Class": "No"
}

Definitely the wrong day for tennis! 🙂

Fork me on GitHub