Classifier Commands
The monsterDB engine now utilises the WEKA libraries from the Machine Learning Group at the University of Waikato.
Full instructions on the WEKA classification algorithms can be found at their website and on the internet. This page serves as the usage of the libraries as embedded into the monsterDB core, including the new browser output mode.
arrf
The output from the arrf command will be in the format of the json arrf standards.
usage from CLI: aggregate.aCollection.arrf({...defintion...}) | ...optional further command
or
db.aCollection.aggregate([{ $arrf: { ... definition ...}}, ... optional further command ])
The command is part of the aggregate pipeline function set and can be used in the tradional (ie mongodb) format or using the piped format that operating system users be be used to. Arrf is used to prepare a set of objects from a collection either this can be the whole collection as shown or a subset of a collection by proceeding the arrf with a match filter (or fuzzy filter).
Selective filter: aggregate.aCollection.match({ ... filter ... }) | arrf({...defintion...}) | ... optional further command
or
db.aCollection.aggregate([{$match: { ... filter ...} }, {$arrf: { ... definition ...}}, ... optional further command ])
Definition: this document object must be passed to arrf, it can be empty, but the document if populated must contain the types of the attributes you would like to use, normally if an attribute is not found in this document then it would be guessed by the arrf generator, however if you would like to force the generator to use nominal, string, numeric or date on any of the fields then this is how you would do this:
.... | arrf({name: "string", dob: "date yyyy-MM-dd'T'HH:mm:ss.SSSZ", age: "numeric", hair: "nominal"}) | ...
The date format is optional, but recommended
Nominal on its own as a string value will not seed the nominal values, if you wish to seed the nominal values with certain strings then it can be done like this:
.... | arrf({name: "string", dob: "date yyyy-MM-dd'T'HH:mm:ss.SSSZ", age: "numeric", hair: ["brown", "blonde", "red"] ) | ...
This is a seed of the nominal values if other values are found in the data then they will be appended to the list. This can be useful if your nominal values only end up creating a singleton list which is unusable by many classifiers (J48), you could add a dummy value to ensure at least 2 values are classified.
classifierBuild
As the name suggests cluster build will take an arrf format output and build a model from the using a training set(s) and a testing set(s) in the same way it would be done in R or WEKA workbench.
Usage from CLI:
aggregate.aCollection.clusterBuild({...options...})
An example of a clusterBuild is:
aggregate.gleif.match({}) | arrf() | classifierBuild({ "trees.J48": {numFolds: 10, className: "Entity.Solvency"} })
The example would take the output of the entire gleif collection, convert it to a json arrf format and then build a decision tree on the data using the J48 algorithm.
The optional, but highly recommended options are in a document where each entry pertains to an algorithm, this way you can produce multiple models from the same input set, the list of algorithm names available can be seen using the listClassifiers command.
Each classifier has the following options:
numFolds – how many fold (randomised training sets) to generate from the source data.
className – the attribute that we are looking for a classification for – ie if we are aiming to predict the outcome of the solvency of the company then this field would be the className. Note the dot notation is because the data in the collection is nested, solvency appears under Entity.
classifierBuild will output a largely unreadable json document, you should store this as shown, the rest of the fields in the options will also be stored with the output, this means you can use an _id field to ensure that you only keep one version of the model:
aggregate.gleif.match({}) | arrf() | classifierBuild({ "trees.J48": {numFolds: 10, className: "Entity.Solvency", "_id": "model1"} }) | out("modelsCollection")
For the interested the model can be viewed using a standard find on the collection:
db.modelsCollection.find({_id: "model1"})
would produce something like:
{"numFolds":10,"_id":"models1","className":"Entity.Solvency","classifier":"trees.J48","accuracy":100,"createDate":"Mar 25, 2020 6:08:05 PM","model":{"m_storedObjectArray": [ ... ],"m_isCompressed":false,"m_isJython":false}}
Often it is key to visit the model to assess the accuracy value, which will be added after the training set is tested.
classifierPredict
classifierPredict({"modelFilter": {"from": "collectionName", filter: { ... filter ... } } } ) or classifierPredict({"numFolds":10,"_id":"models1","className":"Entity.Solvency","classifier":"trees.J48","accuracy":100,"createDate":"Mar 25, 2020 6:08:05 PM","model":{"m_storedObjectArray": [ ... ],"m_isCompressed":false,"m_isJython":false}})
aggregate.gleif.match({}) | arrf({}) | classifierPredict({"modelFilter": {"from": "modelCollection", filter: {_id: "model1"}}})
Produces:
{... data ..., "Class":"SOLVENT"},
For each object passed to the classifierPredict it will be returned with an Class field appended, this is the prediction.
classifierTree
Classifier tree is used to produce human readable decision trees, not all classifiers in WEKA will work yet, but any that product dot format output should be fine.
Usage:
aggregate.collection.classifierTree({"modelFilter": {"from": "collection name", filter: { .... }}})
Model filter uses from to select the collection in the same database and filter to select a single object in that collection (first used).
listClassifiers
Lists the usable classifiers in the system:
listClassifiers
produces:
{"value": "trees.HoeffdingTree"},
{"value": "bayes.net.EditableBayesNet"},
{"value": "bayes.net.BayesNetGenerator"},
{"value": "trees.J48"},
{"value": "trees.M5P"},
{"value": "pmml.consumer.TreeModel"},
{"value": "meta.CostSensitiveClassifier"},
{"value": "meta.RandomizableFilteredClassifier"}, {"value": "meta.CVParameterSelection"},
{"value": "trees.RandomTree"},
{"value": "trees.REPTree"},
{"value": "trees.LMT"},
{"value": "meta.AttributeSelectedClassifier"},
{"value": "meta.FilteredClassifier"},
{"value": "bayes.BayesNet"},
{"value": "misc.InputMappedClassifier"},
{"value": "bayes.net.BIFReader"}
Full example – Weather
Create a collection containing your weather data and the decision you made to play tennis or not:
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
3 | Overcast | Hot | High | Weak | Yes |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
7 | Overcast | Cool | Normal | Strong | Yes |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
this can be done using a set of save commands, and then build a model using J48:
create collection weatherdb.weather.save({'Day':'1','Outlook':'Sunny','Temp.':'Hot','Humidity':'High','Wind':'Weak','Decision':'No'})
db.weather.save({'Day':'2','Outlook':'Sunny','Temp.':'Hot','Humidity':'High','Wind':'Strong','Decision':'No'})
db.weather.save({'Day':'3','Outlook':'Overcast','Temp.':'Hot','Humidity':'High','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'4','Outlook':'Rain','Temp.':'Mild','Humidity':'High','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'5','Outlook':'Rain','Temp.':'Cool','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'6','Outlook':'Rain','Temp.':'Cool','Humidity':'Normal','Wind':'Strong','Decision':'No'})
db.weather.save({'Day':'7','Outlook':'Overcast','Temp.':'Cool','Humidity':'Normal','Wind':'Strong','Decision':'Yes'})
db.weather.save({'Day':'8','Outlook':'Sunny','Temp.':'Mild','Humidity':'High','Wind':'Weak','Decision':'No'})
db.weather.save({'Day':'9','Outlook':'Sunny','Temp.':'Cool','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'10','Outlook':'Rain','Temp.':'Mild','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'11','Outlook':'Sunny','Temp.':'Mild','Humidity':'Normal','Wind':'Strong','Decision':'Yes'})
db.weather.save({'Day':'12','Outlook':'Overcast','Temp.':'Mild','Humidity':'High','Wind':'Strong','Decision':'Yes'})
db.weather.save({'Day':'13','Outlook':'Overcast','Temp.':'Hot','Humidity':'Normal','Wind':'Weak','Decision':'Yes'})
db.weather.save({'Day':'14','Outlook':'Rain','Temp.':'Mild','Humidity':'High','Wind':'Strong','Decision':'No'})
aggregate.weather.match({}) | arrf({}) | classifierBuild({ "trees.J48": {numFolds: 10, "_id": "weather", className: "Decision"} }) | out("models")
Now set the output to your browser and draw the tree it made:
set output browser
aggregate.weather.classifierTree({“modelFilter”: {“from”: “models”, filter: {_id: “weather”}}})
The browser should open and the tree shown:
Now you can use your model:
create collection weather2
db.weather2.save({‘Day’:’15’,’Outlook’:’Rain’,’Temp.’:’Mild’,’Humidity’:’Normal’,’Wind’:’Strong’, ‘Decision’: ‘No’})
aggregate.weather2.match({}) | arrf({}) | classifierPredict({“modelFilter”: {“from”: “models”, filter: {_id: “weather”}}})
Produces:
{
"Wind": "Strong",
"Outlook": "Rain",
"Temp.": "Mild",
"Humidity": "Normal",
"Day": "15",
"Decision": "No",
"Class": "No"
}
Definitely the wrong day for tennis! 🙂