Elasticsearch
Houston supports Elasticsearch, with automatic and explicit definitions of schemas. You may add Elasticsearch support to an existing model with app.extensions.register_elasticsearch_model
, which registers the specified model with the automatic Elasticsearch indexing. The name of each index is derived automatically from the location of the registered model. For example, the app.modules.users.models.User
model is automatically given the index name app.modules.users.models.user
without any further input from the user.
The following HoustonModels
now support indexing and searching with Elasticsearch:
Annotation
AssetGroup
AssetGroupSighting
Asset
AuditLog
Collaboration
EmailRecord
Encounter
FileUpload
Individual
Integrity
Mission
MissionCollection
MissionTask
Name
Notification
Organization
Project
Relationship
Sighting
SocialGroup
User
#
UsageFor all registered objects, the primary entry point for using Elasticsearch is objs = cls.elasticsearch(body)
. This function will search a class with the provided Elasticsearch body and will ALWAYS return objects for that class. To retrieve all results out of Elasticsearch, use body=None
or body={}
. When the results from Elasticsearch are returned, we receive only a list of GUIDs that match the given query. All GUIDs are then checked against the local Houston database. If the GUID does not exist in the local database, it is automatically pruned out of Elasticsearch. If the GUID exists in the database, the objects are loaded from the local Houston database. We do not rely on the cached copy in Elasticsearch, we always use Elasticsearch to resolve GUIDs for a given class and then load the objects fresh out of the Houston DB.
The cls.elasticsearch()
function also takes in an additional load=False
parameter which will return a list of GUIDs instead of a list of objects. The benefit of this is that it is extremely fast to manipulate lists of GUIDs. When loading is turned off during Elasticsearch, the API will still guarantee that the GUIDs exist in the local database, it simply skips the costly step of loading those objects into memory from the database.
Note: This API does not support Elasticsearch-based sorting or paging. Any sorting or pagination arguments passed to Elasticsearch are ignored. All sorting and pagination are performed by Houston and support Houston model attributes.
#
Supported Serialization SchemasAll registered models are serialized automatically unless a schema is specified. The get_elasticsearch_schema()
function can be used to tell Elasticsearch how to serialize an object and is defined for the following models:
HoustonModel | Schema |
---|---|
Annotation | DetailedAnnotationSchema |
AssetGroup | CreateAssetGroupSchema |
AssetGroupSighting | BaseAssetGroupSightingSchema |
Asset | DetailedAssetTableSchema |
AuditLog | DetailedAuditLogSchema |
Collaboration | DetailedCollaborationSchema |
EmailRecord | BaseEmailRecordSchema |
Encounter | ElasticsearchEncounterSchema |
FileUpload | DetailedFileUploadSchema |
Individual | ElasticsearchSightingSchema |
Integrity | BaseIntegritySchema |
Mission | DetailedMissionSchema |
MissionCollection | CreateMissionCollectionSchema |
MissionTask | BaseMissionTaskTableSchema |
Name | DetailedNameSchema |
Notification | BaseNotificationSchema |
Organization | DetailedOrganizationSchema |
Project | BaseProjectSchema |
Relationship | DetailedRelationshipSchema |
Sighting | ElasticsearchSightingSchema |
SocialGroup | DetailedSocialGroupSchema |
User | UserListSchema |
The Elasticsearch extension will dynamically walk over an object and look for JSON serializable attributes if a schema is not specified. This automatic parsing supports all build-in JSON types and adds support for datetime.datetime
and uuid.UUID
.
#
Database ModelsAll of the existing FeatureModels inherit from ElasticsearchModel
. In addition, all FeatherModels
support an indexed
timestamp and an elasticsearchable
flag to know if a given model is searchable in Elasticsearch.
#
Serializing Objects#
Indexing Objects#
Fetching Objects#
Pruning Objects#
Elasticsearch Extension Features#
Automatic Database Changes & Elasticsearch SessionsBy default, anytime a Houston DB object is added, modified, or deleted, we instantly update the Elasticsearch for that object if it is registered. This happens automatically and does not require any additional code. There is also a context manager to allow all changes to Elasticsearch to be done in bulk. This is handy when large amounts of objects are changed in a single transaction and need to be sent to Elasticsearch. Just as individual database changes are slow, the same is true when updating the Elasticsearch index one item at a time.
All db.session.begin()
contexts automatically begin an Elasticsearch session. A session may be started manually with the following code:
The context manager supports nesting, resets on exceptions, and configuration. Furthermore, the backend code will force any session to close if it has been open longer than es.ELASTICSEARCH_MAXIMUM_SESSION_LENGTH
(defaults to 15 minutes). For configuration, you may pass the following arguments to es.session.begin()
:
blocking=True
- Forces the batch operation to happen in the foreground if background Celery jobs are enabled.verify=True
- Forces the batch operation to complete before continuing (similar toes.es_checkpoint()
).forced=True
- Forces any touched object to index on the spot, ignoring if it wasn't modified since it was last indexed.
#
Background Celery Updates & ConfigurationAdding Elasticsearch to all database transactions adds about a 40% overhead during the automated testing. Houston supports background Celery workers to process all indexing operations to prevent this from being a performance hit with the production web server. This happens automatically with each Elasticsearch session when the app/extensions/elasticsearch
extension is enabled. By default, all changes are done in batch and in the background.
The base config supports the following new values:
ELASTICSEARCH_BLOCKING
- Defaults to False, which will perform all bulk Elasticsearch operations in the background (non-blocking)ELASTICSEARCH_BUILD_INDEX_ON_STARTUP
- Defaults to False, which will re-build the entire Elasticsearch index for all registered models on start-up. This happens in the background ifELASTICSEARCH_BLOCKING
is False.
Furthermore, two additional background Celery tasks happen on a schedule:
es_task_refresh_index_all
- Runs every hour to update any out-of-date objects in the database (which may happen if the on-demand updates fail for any reason). The frequency can be specified with ases.ELASTICSEARCH_UPDATE_FREQUENCY
es_task_invalidate_indexed_timestamps
- Runs every 12 hours to re-index the entire Elasticsearch index for all registered models (this ensures that all objects are continually refreshed at a known cadence and ensures any cruft in the index is automatically deleted). The frequency can be specified with ases. ELASTICSEARCH_FIREWALL_FREQUENCY
#
REST APIsWe query Elasticsearch for matching results when provided with a search body. The response from Elasticsearch is a list of GUIDs and the results are sorted, paginated, and loaded from the Houston database directly. While the contents of specific search results may be slightly outdated, the returned schemas are based on the local DB version of an object. Furthermore, we pass Elasticsearch APIs errors back to the user as a BAD_REQUEST
with the same error message.
You may search supported models with Elasticsearch using the APIs listed below:
Additional APIs are available to interact with Elasticsearch as a service
[GET] /api/v1/search/
to list all of the available index names in Elasticsearch[GET] /api/v1/search/<index>/mappings
to list all available attributes in Elasticsearch[GET] /api/v1/search/status
to list how outdated the Elasticsearch database is relative to Houston. This API also shows the number of active background Celery jobs in use by Elasticsearch[GET] /api/v1/search/sync
to force an Elasticsearch re-index on-demand.
#
Pagination & List FilteringAll of the Elasticsearch APIs support pagination and sorting. By default all requests return a maximum of 100 search results. In the event a tie is encountered during sorting, the table's primary key will be used to break the tie.
Furthermore, all search APIs support a basic listing filter through a model's query_search_term_hook(cls, term)
function. This function is called automatically by the @api.paginate()
decorator and will support basic attributes in the local Houston database. This list filtering is NOT intended to replace Elasticsearch, but simply a way to provide some basic searching on GUIDs, etc.
#
Web Server Start-upThe logs will show the registered models and that the database listeners have been attached on start-up. After this, the start-up will re-index the entire Houston database and (by default) ask Celery to update the index with fresh copies of all data.
#
Docker ServicesThe Elasticsearch service uses multi-node setup and uses version 7.17.0. Kibana is also available for visualizing / manipulating the index. Flower is also available to visualize Celery (3 workers by default).
#
TestingWhen testing, all Elasticsearch index names are given a prefix of testing.
. This means that all of the index names for testing are disjoint from the main application, and the automated Celery tasks will not conflict when tests are running.
Testing utilities:
prep_randomized_tus_dir()
creates a Tus test directory with random images (128 pixels by 128 pixels by RGB) with random noisewait_for_elasticsearch_status()
wait for Elaticsearch to catch up
The benchmarking of tests will automatically report all tests that take longer than 3 seconds.
Lastly, use the pytest --no-elasticsearch
flag to disable Elasticsearch globally during testing (makes the test setup much faster)
#
Known Issues#
1. SYSCTL Resource LimitsOn Linux, the Docker service needs to be configured with more resources for Elasticsearch. Add the following line to /etc/sysctl.conf
and then reboot your Operating System:
Alternatively, you can apply this in real-time by running sudo sysctl -w vm.max_map_count=270000
#
2. Partial Support for Attributes when SortingSorting for all paginated APIs only supports attributes listed in the corresponding Houston model. This is done during the SQL SELECT query for efficiency and does not support derived attributes. This can be supported later but will likely be slow. This may be accomplished by using a hybrid property in SQLAlchemy.