edu.isi.karma.modeling.semantictypes
Class SemanticTypeUtil

java.lang.Object
  extended by edu.isi.karma.modeling.semantictypes.SemanticTypeUtil

public class SemanticTypeUtil
extends java.lang.Object

This class provides various utility methods that can be used by the semantic typing module.

Author:
Shubham Gupta

Constructor Summary
SemanticTypeUtil()
           
 
Method Summary
static java.util.ArrayList<java.lang.String> getTrainingExamples(edu.isi.karma.rep.Worksheet worksheet, edu.isi.karma.rep.HNodePath path)
          Prepares and returns a collection of training examples to be used in semantic types training.
static void identifyOutliers(edu.isi.karma.rep.Worksheet worksheet, java.lang.String predictedType, edu.isi.karma.rep.HNodePath path, edu.isi.karma.rep.metadata.Tag outlierTag, java.util.Map<CRFModelHandler.ColumnFeature,java.util.Collection<java.lang.String>> columnFeatures, CRFModelHandler crfModelHandler)
          Identifies the outlier nodes (table cells) for a given column.
static boolean populateSemanticTypesUsingCRF(edu.isi.karma.rep.Worksheet worksheet, edu.isi.karma.rep.metadata.Tag outlierTag, CRFModelHandler crfModelHandler)
          This method predicts semantic types for all the columns in a worksheet using CRF modeling technique developed by Aman Goel.
static java.lang.String removeNamespace(java.lang.String uri)
          Removes the namespace from a given URI.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SemanticTypeUtil

public SemanticTypeUtil()
Method Detail

getTrainingExamples

public static java.util.ArrayList<java.lang.String> getTrainingExamples(edu.isi.karma.rep.Worksheet worksheet,
                                                                        edu.isi.karma.rep.HNodePath path)
Prepares and returns a collection of training examples to be used in semantic types training. Parameter TRAINING_EXAMPLE_MAX_COUNT specifies the count of examples. The examples are randomly chosen to get a uniform distribution of values across the column. Empty values are currently not included in the set.

Parameters:
worksheet - The target worksheet
path - Path to the target column
Returns:
Collection of training examples

populateSemanticTypesUsingCRF

public static boolean populateSemanticTypesUsingCRF(edu.isi.karma.rep.Worksheet worksheet,
                                                    edu.isi.karma.rep.metadata.Tag outlierTag,
                                                    CRFModelHandler crfModelHandler)
This method predicts semantic types for all the columns in a worksheet using CRF modeling technique developed by Aman Goel. It creates a SemanticType object for each column and puts it inside the SemanticTypes object for that worksheet. User-assigned semantic types are not replaced. It also identifies nodes (table cells) that are outliers and are stored in the outlierTag object.

Parameters:
worksheet - The target worksheet
outlierTag - Tag object that stores outlier nodes
crfModelHandler - The CRF Model Handler to use
Returns:
Returns a boolean value that shows if a semantic type object was replaced or added for the worksheet. If nothing changed, false is returned.

identifyOutliers

public static void identifyOutliers(edu.isi.karma.rep.Worksheet worksheet,
                                    java.lang.String predictedType,
                                    edu.isi.karma.rep.HNodePath path,
                                    edu.isi.karma.rep.metadata.Tag outlierTag,
                                    java.util.Map<CRFModelHandler.ColumnFeature,java.util.Collection<java.lang.String>> columnFeatures,
                                    CRFModelHandler crfModelHandler)
Identifies the outlier nodes (table cells) for a given column.

Parameters:
worksheet - Target worksheet
predictedType - Type which was user-assigned or predicted by the CRF model for the given column. If the type for a given node is different from the predictedType, it is tagged as outlier and it's id is stored in the outlier tag object
path - Path to the given column
outlierTag - The outlier tag object which stores all the outlier node ids.
columnFeatures - Features such as column name, table name that are required by the CRF Model to predict the semantic type for a node (table cell)
crfModelHandler -

removeNamespace

public static java.lang.String removeNamespace(java.lang.String uri)
Removes the namespace from a given URI. It makes a assumption that the namespace is until the last # or last '/' in the URI string, so it should be used only for interface purposes and not for reasoning or logic. The right way would be store the namespaces map in memory and use that to remove the namespace from a URI.

Parameters:
uri - Input URI
Returns:
URI string with namespace removed