Clustering and Coercing Data

Make your data beautiful again.

About Clustering & Coercing

 

Clustering

It is essential to be able to cluster data based upon similarity in demographic profiles, this can be done in monsterDB in a fuzzy collection only as the cluster command relies on the fuzzy setup in the collection. A fuzzy collection can be created based on the information in the fuzzy section of the website. As a contrast to fuzzy matching which is primarily only able to compare and identify data similarities within a single collection, clustering can find similarities within any stream of data using the fuzzy setup regardless of whether the information is stored in a collection or is simply being processed by a stream.

Clustering takes any document that may or may not have any data related to a person, company, product or otherwise and determines if it has any similarities based up it properties to other documents, to to this it is necessary to map the properties of a document to a concept using concept – table mappings, this is a process where we identify all fields that represent a name, id, or other identifying factors.

Using cluster:
stream.collection.find({ some selection criteria }) | cluster( {score: 90.1} )

Cluster will look for records that have a fuzzy similarity with each other which is at 90.1 or above, if there are more than one record that has some similarity then those records will be grouped together to form a cluster:

{"_id":4,"size":15,"cluster":[{"Entity":{"LegalForm":{"OtherLegalForm":"LIMITED","EntityLegalFormCode":"8888"},"RegistrationAuthority":{"RegistrationAuthorityEntityID":"RC2176","RegistrationAuthorityID":"RA000469"},"OtherEntityNames":[],"LegalAddress":{"AdditionalAddressLine":"MARINA","PostalCode":"23401","FirstAddressLine":"112 BROAD STREET","City":"LAGOS","Country":"NG","Regio........

 

The output shows the cluster has a numeric identifier and a size, this would help you understand how many records that have been identified as similar, the cluster list is the raw records used to create the cluster, this will be a list of 15 documents in full. There is no score, but all matches are above the score supplied.

Coerce Command

Whilst cluster will enable you to find similarities between records in a stream or a collection, coerce will then map those records into a single representation of a concept, this then gives you a high level representation of each group of records suitable for a search result or summary page. Here is an example:

stream.collection.find(....) | cluster( {score: 90.1} ) | coerce()

This tries to fit each record in the cluster into a conceptual representation:

{"_id":4,"size":15,"Company_Name":"AFRICAN ALLIANCE INSURANCE PLC","Country":"NG","RegCoNum":"RC2176","Address":"LAGOS 112 BROAD STREET"}
Each property of the document represents a conceptual item in the fuzzy definition. You will notice that the attributes for the records underneath are no longer visible, however you can expand the command to ask for additional properties from the source documents to be visible:
stream.collection.find(....) | cluster( {score: 90.1} ) | coerce( {projection: {location: "last"}} )

 

This would result in a more complete record:

{"_id":4,"size":15,"Company_Name":"AFRICAN ALLIANCE INSURANCE PLC","Country":"NG","RegCoNum":"RC2176","Address":"LAGOS 112 BROAD STREET","location":{"type":"Point","coordinates":[41,6]}}

You will notice the location field representing the geo point of the company is now included, this will be last or first NON NULL instance of the field depending on the natural order that is returned by the clustering.

Example

A full example of clustering and coercing that finds documents in a geographic area then subsequently clustered by similarity of 90.1%:

aggregate.gleif.match( { location: { $geoWithin :{type: “Polygon”, coordinates: [ [ [ 41 , 5 ] , [ 41 , 6 ] , [ 42 , 6 ] , [ 42 , 5 ], [ 41 ,  5 ] ] ] } }} | cluster( {score: 90.1} ) | coerce( { projection: { “location”: “first” } } )

{
  "_id": 4,
  "size": 15,
  "Company_Name": "AFRICAN ALLIANCE INSURANCE PLC",
  "Country": "NG",
  "RegCoNum": "RC2176",
  "Address": "LAGOS 112 BROAD STREET",
  "location": {
    "type": "Point",
    "coordinates": [
      41,
      6
    ]
  }
}
Fork me on GitHub