This application provides fuzzy search server for data stored in JSON format.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Maxim Likhachev 29e2b3478a Fix commentary 2 years ago
app Added GPLv3 license and CoC 3 years ago
bin Search server for json files 3 years ago
src Fix commentary 2 years ago
.gitignore Search server for json files 3 years ago
COPYING GPLv3 3 years ago
CoC.md Added GPLv3 license and CoC 3 years ago
Makefile Search server for json files 3 years ago
README.md Added GPLv3 license and CoC 3 years ago
Setup.hs Search server for json files 3 years ago
Vagrantfile Search server for json files 3 years ago
package.yaml Search server for json files 3 years ago
sample.json Search server for json files 3 years ago
stack.yaml Search server for json files 3 years ago

README.md

Json Search Server

v0.1.0

This application provides fuzzy search server for data stored in JSON format.

The purpose of the development of this application was to use it with the website written on Jekyll.


License

Copyright (C) 2019, Maxim Lihachev, <envrm@yandex.ru>

This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation, version 3.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.

Algorithm

A logic of this application is pretty simple:

  • All fields in the file are divided into separate words.
  • Each unique word in the search string is compared to the words in the file.
  • If the match is exact, the highest score is awarded.
  • If there is an inaccurate coincidence, the Levenshtein distance is calculated and the word score is formed on its basis.
  • Results are sorted in descending order of accuracy.

How to use

Build

$ make build
or
$ make static

There is Vagrant file configured for starting Centos 7. This VM is used for making statically-linked binary file for target server.

host:~$ vagrant up
host:~$ vagrant ssh

vm:~$ cd /vagrant
vm:~$ make static

Static executable file is in bin/ directory.

Run

$ make exec

json-search-server v0.1.0: seach server for Json

json-search-server [OPTIONS]

Common flags:
  -p    --port=3000              Search server port
  -l    --logs=apache            apache | simple | json | disable | full
  -c    --cached                 Store Json data into memory
  -j -f --json=data.json --file  Json file name
  -?    --help                   Display help message
  -V    --version                Print version information
        --numeric-version        Print just the version number

Install

$ make install

Requests

Get Server Settings

$ curl -sq http://localhost:3000/info

{
  "cached": false,
  "logs": "full",
  "file": "sample.json",
  "port": 3000
}

Health Check

$ curl -sq http://localhost:3000/health

{
  "status": "ok",
  "message": "2019-06-19 09:03:26.402385 UTC"
}
$ chmod a-r sample.json
$ curl -sq http://localhost:3000/health

{
  "status": "fail",
  "message": "2019-06-19 09:05:32.950256 UTC sample.json: openFile: permission denied (Permission denied)"
}
$ curl -sq http://localhost:3000/search/article
$ curl -sq http://localhost:3000/search?query=article

[
  [
    {
      "url": "/tags/foo.html",
      "authors": "Author I, Author II",
      "content": "This is article about Foo and Bar.",
      "year": "1990",
      "title": "Page one"
    },
    1,
    -100
  ]
]

Logs

There are few formats for log messages. It is possible to disable logs completely passing argument --log disable.

--log apache (default)

127.0.0.1 - - [19/Jun/2019:12:14:16 +0300] "GET /info HTTP/1.1" 200 - "" "curl/7.54.0"
127.0.0.1 - - [19/Jun/2019:12:14:18 +0300] "GET /health HTTP/1.1" 200 - "" "curl/7.54.0"

--log simple

GET /health
  Accept: */*
  Status: 200 OK 0.00003s
GET /info
  Accept: */*
  Status: 200 OK 0.000018s

--log json

{"time":"19/Jun/2019:12:20:06 +0300","response":{"status":200,"size":null,"body":null},"request":{"httpVersion":"1.1","path":"/health","size":0,"body":"","durationMs":7.0e-2,"remoteHost":{"hostAddress":"127.0.0.1","port":63436},"headers":[["Host","localhost:3000"],["User-Agent","curl/7.54.0"],["Accept","*/*"]],"queryString":[],"method":"GET"}}
{"time":"19/Jun/2019:12:20:07 +0300","response":{"status":200,"size":null,"body":null},"request":{"httpVersion":"1.1","path":"/info","size":0,"body":"","durationMs":4.0e-2,"remoteHost":{"hostAddress":"127.0.0.1","port":63438},"headers":[["Host","localhost:3000"],["User-Agent","curl/7.54.0"],["Accept","*/*"]],"queryString":[],"method":"GET"}}

Or pretty-printed:

{
  "time": "19/Jun/2019:12:20:06 +0300",
  "response": {
    "status": 200,
    "size": null,
    "body": null
  },
  "request": {
    "httpVersion": "1.1",
    "path": "/health",
    "size": 0,
    "body": "",
    "durationMs": 0.07,
    "remoteHost": {
      "hostAddress": "127.0.0.1",
      "port": 63436
    },
    "headers": [
      [
        "Host",
        "localhost:3000"
      ],
      [
        "User-Agent",
        "curl/7.54.0"
      ],
      [
        "Accept",
        "*/*"
      ]
    ],
    "queryString": [],
    "method": "GET"
  }
}

--log full

"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
"2019-06-19 09:05:22.226024 UTC"
"------------------------------------------------------------------------------------------------------------------------"
"Query: /health"
"------------------------------------------------------------------------------------------------------------------------"
Request
    { requestMethod = "GET"
    , httpVersion = HTTP/1.1
    , rawPathInfo = "/health"
    , rawQueryString = ""
    , requestHeaders =
        [
            ( "Host"
            , "localhost:3000"
            )
        ,
            ( "User-Agent"
            , "curl/7.54.0"
            )
        ,
            ( "Accept"
            , "*/*"
            )
        ]
    , isSecure = False
    , remoteHost = 127.0.0.1:63212
    , pathInfo = [ "health" ]
    , queryString = []
    , requestBody = <IO ByteString>
    , vault = <Vault>
    , requestBodyLength = KnownLength 0
    , requestHeaderHost = Just "localhost:3000"
    , requestHeaderRange = Nothing
    }

Caching

It is possible to store all data in RAM and use it even if the file is not readable. To do this, specify the argument --cached.

NGINX

For using this service behind nginx web server might be used following configuration:

upstream search_backend {
        server 127.0.0.1:3000;
}

server {
        listen 8080;
        server_name search.server www.search.server;

        # ...

        location / {
                add_header 'Access-Control-Allow-Origin' "$http_origin";
                add_header 'Access-Control-Allow-Methods' 'GET, POST';
                add_header 'Access-Control-Allow-Credentials' 'true';
                add_header 'Access-Control-Allow-Headers' 'User-Agent,Keep-Alive,Content-Type';

                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Host $remote_addr;
                proxy_pass http://search_backend$uri?$args;
        }

		# This prevents intruders from obtaining information
        # about the internal structure of the server.
        location /info {
                proxy_pass http://search_backend/health;
        }
}

Jekyll

Creating a Json file using jekyll will look like this:

---
layout: none
search: none
---

{% assign all_pages = site.pages | where_exp: 'p', 'p.search != "none"' | where_exp: 'p', 'p.layout != "none"' %}

[

  {% for p in all_pages %}
    {% capture all_authors %}{{ p.authors | join: ',' }}, {{ p.translators | join: ',' }}, {{ p.editors | join: ','  }}{% endcapture %}
      {
        "title": "{{ p.title | split: '<br />' | join: ' ' | xml_escape }}{% if p.tag %} «{{ p.tag }}» {% endif %}",
        "authors": "{{ p.authors | join: ',' }}",
        "persons": "{{ all_authors }}",
        "content": {{ p.content | strip_html | jsonify }},
        "tags": "{{ p.tags | join: ', ' }}",
        "year": "{{ p.year }}",
        "url": "{{ p.url | xml_escape }}"
      }
      {% unless forloop.last %},{% endunless %}
  {% endfor %}

]