Python API¶
Searching a Datamart¶
- class datamart.Datamart(connection_url)¶
All Datamarts must implement this abstract class.
- abstract search(query)¶
This entry point supports search using a query specification.
The query specification supports querying datasets by keywords, named entities, temporal ranges, and geospatial ranges.
Datamart implementations should return a DatamartQueryCursor immediately.
- Parameters
query (DatamartQuery) – Query specification.
- Returns
A cursor pointing to search results.
- Return type
- abstract search_with_data(query, supplied_data)¶
Search using on a query and a supplied dataset.
This method is a “smart” search, which leaves the Datamart to determine how to evaluate the relevance of search result with regard to the supplied data. For example, a Datamart may try to identify named entities and date ranges in the supplied data and search for companion datasets which overlap.
To manually specify query constraints using columns of the supplied data, use the
search_with_data_columns()
method andTabularVariable
constraints.Datamart implementations should return a DatamartQueryCursor immediately.
- Parameters
query – Query specification
supplied_data – The data you are trying to augment.
- Rtype query
DatamartQuery
- Rtype supplied_data
container.Dataset
- Returns
A cursor pointing to search results containing possible companion datasets for the supplied data.
- Return type
- abstract search_with_data_columns(query, supplied_data, data_constraints)¶
Search using a query which can include constraints on supplied data columns (
TabularVariable
).This search is similar to the “smart” search provided by
search_with_data()
, but caller must manually specify constraints using columns from the supplied data; Datamart will not automatically analyze it to determine relevance or joinability.Use of the query spec enables callers to compose their own “smart search” implementations.
Datamart implementations should return a
DatamartQueryCursor
immediately.- Parameters
query (DatamartQuery) – Query specification
supplied_data (container.Dataset) – The data you are trying to augment.
data_constraints (list) – List of
TabularVariable
constraints referencing the supplied data.
- Returns
A cursor pointing to search results containing possible companion datasets for the supplied data.
- Return type
Search results¶
- class datamart.DatamartQueryCursor¶
Cursor to iterate through Datamarts search results.
- abstract get_next_page(*, limit=20, timeout=None)¶
Return the next page of results. The call will block until the results are ready.
Note that the results are not ordered; the first page of results can be returned first simply because it was found faster, but the next page might contain better results. The caller should make sure to check
DatamartSearchResult.score()
.- Parameters
- Returns
A list of
DatamartSearchResult
’s, or None if there are no more results.- Return type
Sequence[
DatamartSearchResult
] or None
- class datamart.DatamartSearchResult¶
This class represents the search results of a Datamart search.
Different Datamarts will provide different implementations of this class.
Those objects can be saved to a string for later using the
serialize()
anddeserialize()
methods, for example for use as part of a pipeline.- abstract augment(supplied_data, augment_columns=None, *, connection_url=None)¶
Produces a D3M dataset that augments the supplied data with data that can be retrieved from this search result. The augment methods is a baseline implementation of download plus augment.
Callers who want to control over the augmentation process should use the download method and use their own augmentation algorithm.
- Parameters
supplied_data (container.Dataset) – A D3M dataset containing the dataset that is the target for augmentation.
augment_columns (typing.List[
DatasetColumn
]) – If provided, only the specified columns from the Datamart dataset that will be added to the supplied dataset.connection_url (str) – A connection string used to connect to a specific Datamart deployment. If not provided, a different deployment might be used.
- Returns
The result of augmenting supplied_data with the search result
- Return type
container.Dataset
- classmethod deserialize(string)¶
Deserializes a search result back into an object.
- Parameters
string (str) – A string obtained via serialize().
- Returns
The search result, instance of the current class.
- Return type
- abstract download(supplied_data, *, connection_url=None)¶
Produces a D3M dataset (data plus metadata) corresponding to the search result. Every time the download method is called on a search result, it will produce the exact same columns (as specified in the metadata – get_metadata), but the set of rows may depend on the supplied_data. Datamart is encouraged to return a dataset that joins well with the supplied data, e.g., has rows that match the entities in the supplied data. Datamarts may ignore the supplied_data and return the same data regardless.
If the supplied_data is None, Datamarts may return None or a default dataset, based on the search query.
- Parameters
supplied_data (container.Dataset) – A D3M dataset containing the dataset that is the target for augmentation. Datamart will try to download data that augments the supplied data well.
connection_url (str) – A connection string used to connect to a specific Datamart deployment. If not provided, a different deployment might be used.
- Returns
The downloaded dataset.
- Return type
container.Dataset
- abstract get_augment_hint()¶
Returns specification for augmenting supplied data with the data that can be downloaded using this search result.
- abstract get_metadata()¶
Access the metadata of the dataset.
- Returns
The Datamart metadata of the dataset.
- Return type
DataMetadata
- abstract score()¶
Returns a non-negative score of the search result. Larger scores indicate better matches. Scores across Datamart implementations are not comparable.
- abstract serialize()¶
Serializes this result to a string.
Search query¶
- class datamart.DatamartQuery(keywords: List[str] = None, variables: List[VariableConstraint] = None)¶
A Datamart query consists of two parts:
A list of keywords.
A list of required variables. A required variable specifies that a matching dataset must contain a variable satisfying the constraints provided in the query. When multiple required variables are given, the matching dataset should contain variables that match each of the variable constraints.
The matching is fuzzy. For example, when a user specifies a required variable spec using named entities, the expectation is that a matching dataset contains information about the given named entities. However, due to name, spelling, and other differences it is possible that the matching dataset does not contain information about all the specified entities.
In general, Datamart will do a best effort to satisfy the constraints, but may return datasets that only partially satisfy the constraints.
- property keywords¶
Alias for field number 0
- property variables¶
Alias for field number 1
- class datamart.TabularVariable(columns, relationship)¶
Specifies that a matching dataset should contain variables related to given columns in the supplied_dataset.
The relation ColumnRelationship.CONTAINS specifies that string values in the columns overlap using the string equality comparator. If supplied_dataset columns consists of temporal or spatial values, then ColumnRelationship.CONTAINS specifies overlap in temporal range or geospatial bounding box, respectively.
The relation ColumnRelationship.SIMILAR specifies that string values in the columns overlap using fuzzy string matching.
The relations ColumnRelationship.CORRELATED and ColumnRelationship.ANTI_CORRELATED specify the columns are correlated and anti-correlated, respectively.
The relations ColumnRelationship.MUTUALLY_INFORMATIVE and ColumnRelationship.MUTUALLY_UNINFORMATIVE specify the columns are mutually and anti-correlated, respectively.
- Parameters
columns (typing.List[int]) – Specify columns in the dataframes of the supplied_dataset
relationship (ColumnRelationship) – Specifies how the the columns in the supplied_dataset are related to the variables in the matching dataset.
- class datamart.ColumnRelationship(value)¶
An enumeration.
- ANTI_CORRELATED = 4¶
- CONTAINS = 1¶
- CORRELATED = 3¶
- MUTUALLY_INFORMATIVE = 5¶
- MUTUALLY_UNINFORMATIVE = 6¶
- SIMILAR = 2¶
- class datamart.VariableConstraint¶
Abstract class for all variable constraints.
- class datamart.NamedEntityVariable(entities)¶
Specifies that a matching dataset must contain a variable including the specified set of named entities.
For example, if the entities are city names, the expectation is that a matching dataset must contain a variable (column) with the given city names. Due to spelling differences and incompleteness of datasets, the returned results may not contain all the specified entities.
- Parameters
entities (List[str]) – List of strings that should be contained in the matched dataset column.
- class datamart.TemporalVariable(start, end, granularity=None)¶
Specifies that a matching dataset should contain a variable with temporal information (e.g., dates) satisfying the given constraint.
The goal is to return a dataset that covers the requested temporal interval and includes data at a requested level of granularity.
Datamart will return best effort results, including datasets that may not fully cover the specified temporal interval or whose granularity is finer or coarser than the requested granularity.
- Parameters
start (datetime) – A matching dataset should contain a variable with temporal information that starts earlier than the given start.
end (datetime) – A matching dataset should contain a variable with temporal information that ends after the given end.
granularity (TemporalGranularity) – A matching dataset should provide temporal information at the requested level of granularity.
- class datamart.TemporalGranularity(value)¶
An enumeration.
- DAY = 3¶
- HOUR = 4¶
- MONTH = 2¶
- SECOND = 5¶
- YEAR = 1¶
- class datamart.GeospatialVariable(latitude1, longitude1, latitude2, longitude2, granularity=None)¶
Specifies that a matching dataset should contain a variable with geospatial information that covers the given bounding box.
A matching dataset may contain variables with latitude and longitude information (in one or two columns) that cover the given bounding box.
Alternatively, a matching dataset may contain a variable with named entities of the given granularity that provide some coverage of the given bounding box. For example, if the bounding box covers a 100 mile square in Southern California, and the granularity is City, the result should contain Los Angeles, and other cities in Southern California that intersect with the bounding box (e.g., Hawthorne, Torrance, Oxnard).
- Parameters
latitude1 (float) – The latitude of the first point
longitude1 (float) – The longitude of the first point
latitude2 (float) – The latitude of the second point
longitude2 (float) – The longitude of the second point
granularity (GeospatialGranularity) – Requested geospatial values are well matched with the requested granularity
Augmentation specification¶
- class datamart.AugmentSpec¶
Abstract class for D3M augmentation specifications
- class datamart.TabularJoinSpec(left_resource_id, right_resource_id, left_columns, right_columns)¶
A join spec specifies a possible way to join a left dataset with a right dataset. The spec assumes that it may be necessary to use several columns in each datasets to produce a key or fingerprint that is useful for joining datasets. The spec consists of two lists of column identifiers or names (left_columns, left_column_names and right_columns, right_column_names).
In the simplest case, both left and right are singleton lists, and the expectation is that an appropriate matching function exists to adequately join the datasets. In some cases equality may be an appropriate matching function, and in some cases fuzz matching is required. The join spec does not specify the matching function.
In more complex cases, one or both left and right lists contain several elements. For example, the left list may contain columns for “city”, “state” and “country” and the right dataset contains an “address” column. The join spec pairs up [“city”, “state”, “country”] with [“address”], but does not specify how the matching should be done e.g., combine the city/state/country columns into a single column, or split the address into several columns.
- class datamart.UnionSpec¶
A union spec specifies how to combine rows of a dataframe in the left dataset with a dataframe in the right dataset. The dataframe after union should have the same columns as the left dataframe.
Implementation: TBD