Skip to content

File To Elasticsearch Indexer

This project contains code which monitors a file system and indexes the files into an Elasticsearch index so that they can be searched by client applications (or by APIs). It can be configured to crawl a directory (also called 'repository' or 'repo') and can also listen for messages from a RabbitMQ server and it will look for events in the format generated by the project fsmon-handler. It will crawl the directory specified (or index the file referenced by a message) and sync up information about the files to an Elasticsearch database. It uses both file stats from the operating system in conjunction with other information extracted using Tika, CSV, NetCDF and other libraries. There are two ways you can configure this application. The first is using a configuration file that is placed in the classpath used when running the application. The second is by defining environment variables.

Data Flow

graph LR repo[repo
directory] --> fsmon repo --> FileCrawler fsmon -- json --> uri[fsmon-handler] uri -- amqp --> exchange[Fanout Exchange] subgraph RabbitMQ exchange --> queue end subgraph File to ES Indexer queue --> fsmonevent-consumer[FSMonEvent
Consumer] FileCrawler end subgraph GitHub mbari-data-model --> FileCrawler mbari-data-model --> fsmonevent-consumer end subgraph Elasticsearch FileCrawler --> repo-db[Repo
Database] fsmonevent-consumer --> repo-db end

Environment Variables

Below are the environment variables that can be set to configure this application.

Note

The Elasticsearch properties are required if either (or both) fileCrawler and fsMonEventConsumer properties are set to true. In addition, the RabbitMQ properties are required if you set the fsMonEventConsumer property to true. They are documented as required because we are assuming you want to use the application to index documents. If all indexing is disabled using the first two properties, you don't need those either, but that seems kind of silly.

Variable Description Required? Default
FTESI_FILE_CRAWLER A boolean to indicate if a FileCrawler should be launched false true
FTESI_FSMON_EVENT_CONSUMER A boolean to indicate if a FSMonEventConsumer should be launched false true
FTESI_DATA_REPO This is the local directory that contains the files to be indexed into Elasticsearch true
FTESI_REPO_NAME This is a short name that is used in a couple ways. It will be the name of the Elasticsearch repository and also will be the name of the RabbitMQ exchange that will be subscribed to. It needs to conform to the requirements for a Elasticsearch database name:
  • Lowercase only
  • Cannot include \, /, *, ?, ", <, >, |, (space character), ,, #
  • Indices prior to 7.0 could contain a colon (:), but that’s been deprecated and won’t be supported in 7.0+
  • Cannot start with -, _, +
  • Cannot be . or ..
  • Cannot be longer than 255 bytes (note it is bytes, so multi-byte characters will count towards the 255 limit faster)
true
FTESI_NETWORK_MOUNT A flag to indicate that the files in the dataRepo are mounted over a slow network. This will trigger the indexer to make a copy of the file locally so that indexing is quicker. This is still somewhat experimental and it's not clear if this really speeds things up or not. That is why it defaults to false. false false
FTESI_DATA_REPO_BASE_URL This is the base URL segment that points to the files in the repository being indexed and will be prepended to the path of the file to create an HTTP link to the file. true
FTESI_INTERVAL_BETWEEN_REPO_CRAWLS_IN_MINUTES This is an integer that determines how long (in minutes) the FileCrawler will pause between repository indexing crawls. If files don't change much, this can be made larger. false 5
FTESI_ELASTICSEARCH_HOST This is the name of the host that is running the elasticsearch server. true
FTESI_ELASTICSEARCH_PORT This is the port number that the Elasticsearch server is listening to. true 9200
FTESI_ELASTICSEARCH_USERNAME This is the username that will be used by the indexers to connect to the Elasticsearch server true elastic
FTESI_ELASTICSEARCH_PASSWORD This is the password that will be used by the indexers to connect to the Elasticsearch server true changeme
FTESI_RABBITMQ_HOST This is the hostname of the RabbitMQ server that will be connected to as a consumer of FSMON events. true
FTESI_RABBITMQ_PORT This is the port the application will use to connect and communicate with the RabbitMQ server true 5672
FTESI_RABBITMQ_USERNAME This is the username that will be used to connect to the RabbitMQ server true
FTESI_RABBITMQ_PASSWORD This is the password that will be used to connect to the RabbitMQ server true
FTESI_RABBITMQ_VHOST This is the name of the virtual host on the RabbitMQ server that this application will connect to true
FTESI_RABBITMQ_EXCHANGE_NAME This is the name of the exchange that will be used by the indexing application. Usually, it's the same as the repoName. true
FTESI_RABBITMQ_ROUTING_KEY This is the routing key to use when binding the queue created by the indexing application to the exchange. true *
FTESI_RABBITMQ_QUEUE_NAME This is a name given to the queue that is automatically created by the indexing application. You don't need to specify one as it will be automatically generated for you, but it's here if you need it. false

Configuration File (config.json)

You can use a configuration file named config.json placed in the classpath to configure this indexing service. An example is shown below and you can consult the section on Environment Variables for a description of each property.

{
  "fileCrawler": true,
  "fsMonEventConsumer": true,
  "dataRepo": "/path/to/repo/directory",
  "repoName": "your-repo-name",
  "networkMount": false,
  "dataRepoBaseURL": "https://data-catalog-test.local.com/repo",
  "intervalBetweenRepoCrawlsInMinutes": 1,
  "elasticsearch": {
    "host": "localhost",
    "port": 9200,
    "username": "elastic",
    "password": "password"
  },
  "rabbitmq": {
    "host": "localhost",
    "port": 5672,
    "username": "dcuser",
    "password": "password",
    "vhost": "fsmon",
    "exchangeName": "your-repo-name",
    "routingKey": "*"
  }
}

Running in Docker

The best way to run this application is to use the latest image in DockerHub and run using Docker. The docker image and instructions can be found here.

If you want to run it from the image, copy the .env.template to .env file and edit it to configure the necessary properties. Decide what local directory you want indexed. For example:

/path/to/my/host/directory

and add it to the .env file.

You should then be able to run the indexer by running:

docker run --env-file .env -v /path/to/my/host/directory:/data/repo mbari/file-to-es-indexer:X.X.X

Development

Preparation

For development purposes, there is a docker-compose.yml that can be used to spin up a test environment for you to run the code against during development. It will create a RabbitMQ server (so that the event consumer can connect to something and listen) and an Elasticsearch/Kibana combination to serve as the database that will hold the indexed information. There is some preparation that needs to be done before starting up the Docker environment.

  1. First, decide what directory on your machine you want to use as the test directory for indexing. For this example, I will use: /my/path/to/index
  2. Second, decide what you want to name your repository, see the naming conventions described in the FTESI_REPO_NAME environment variable table above. For this example, I will use my-test-repo.
  3. Third, decide what directory on your machine you want to use as the storage for the files that are generated by the RabbitMQ, Elasticsearch and Kibana servers. For this example, I will use /my/services/data/directory. Now, create a .env file at the same level as the docker-compose.yml file and enter your services data directory as a variable named FTESI_BASEDIR_HOST. Your file should simply look something like this:
FTESI_BASEDIR_HOST=/my/services/data/directory

Start the docker compose network

Once the preparation steps are done, you can start up the supporting services using the following steps:

  1. Make sure you have Docker (and docker compose) installed on your machine and from the command line run

```bash docker-compose up


With the supporting services up and running, you can now start to develop and test the application code.  There are many different ways to run the code, but essentially, you either define a config.json file (see section above) or create environment variables in your development environment to configure the application (see Environment Variable section above).  As an example, I use IntelliJ for development and create a run/debug configuration that uses the org.mbari.resources.RepositoryIndexer as the main class and then I set the list of environment variables in that configuration.  Then I can step through, debug, etc. using the IDE.

### Once your changes are done and ready to be committed

The process for building the image is to run a Maven build which will construct a jar file with the proper classes and resource.  Then, a Docker image can be built using the Dockerfile that adds the Maven target artifact to the image.  You can then run the application by defining and mounting a config.json file or defining environment variables.

After making development changes and you are ready to release those changes, you need to do the following steps:

1. Come up with a version number that makes sense for the scope of changes you have made to this codebase.  If they are bug fixes, you typically increment the third number in the version, if they are feature enhancements or additions that are not breaking changes for clients of this code, you can increment the second number in the version.  If the code changes are substantial and/or will break client usage, you should increment the first digit in the version number.
1. Add a line at the bottom of this file describing the changes that are associated with this new version number.
1. Update the pom.xml version (around line 9) to be the version number you came up with.
1. Update the Dockerfile to have the correct version in the name/version of the jar file
1. Commit your changes into your local Git repository.
1. Create a Git tag in your local repository with the version number you came up with.
1. Push the HEAD to the remote repository.
1. Push the tag you created to the remote repository.
1. Run the maven build using:

```bash
mvn clean compile assembly:single
  1. Then build the Docker image using:
docker build -t mbari/file-to-es-indexer:x.x.x .
  1. You can push the image using (you will need to run 'docker login' first):
docker push mbari/file-to-es-indexer:x.x.x

Release Notes

  • v0.0.1 - Initial tag on 4/10/2019
  • v0.0.2 - Attempt to fix bug that would cause indexing to fail if symbolic link was hit.
  • v0.0.3 - The actual fix for the file indexing failure upon file not existing.
  • v0.0.4 - Replaced the default content handler in tika with a custom one that creates a more compact and clean content body to store in elasticsearch. The default one was grabbing all kinds of junk.
  • v0.0.5 - Added function to look at the data in CSV files and guess at the data types for each variable. This was necessary so that database tables could be created based on the metadata about the data in the CSV file. Also replaced custom date parsing with JODA time library.
  • v0.0.6 - Fixed NullPointerException bug introduced in v0.0.5.
  • v0.0.7 - Put dataType as a direct property of variables instead of as an attribute.
  • v0.0.8 - Fixed bugs in CSV variable parsing that was causing failures due to the way I thought CSVParser worked. I thought it could reset the iterator to roll through records, but it does not so it was causing parsing to fail on short files. Also added the ability to detect float data types from columns that started with decimal numbers, but had something (like units) in with the data itself (not sure I should be accommodating bad CSV structure and this may come back to bite me).
  • v0.0.9 - Added a CRC field to the file metadata to try and fix checksum problems I had with using MD5 hashes. They were very slow on large files and I am hoping the CRC calcs are quicker. I also added a 'contentTree' field to the Resource objects so that we can try and make the contents of tree-structured files, like JSON and XML, searchable.
  • v0.0.10 - Did a complete refactor of how to index the metadata and content_tree fields based on this article. This allows for searching in arbitrary JSON/XML hierarchies while not generating huge mapping files like with dynamic mapping.