Browse Source

Merge pull request #1322 from j-frei/master

Add offline deployment support
pull/1334/head
Hiroki Nakayama 3 years ago
committed by GitHub
parent
commit
026470c4bc
No known key found for this signature in database GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 375 additions and 0 deletions
  1. 29
      docs/advanced/offline_deployment.md
  2. 28
      tools/offline_deployment/offline_01_1-optional_use_https.sh
  3. 40
      tools/offline_deployment/offline_01_2-patch_and_extract_Docker_images.sh
  4. 22
      tools/offline_deployment/offline_01_3-download_APT_packages.sh
  5. 30
      tools/offline_deployment/offline_02_1-install_APT_packages.sh
  6. 16
      tools/offline_deployment/offline_02_2-import_Docker_images.sh
  7. 7
      tools/offline_deployment/offline_03_1-runDoccano.sh
  8. 203
      tools/offline_deployment/offline_patcher.py

29
docs/advanced/offline_deployment.md

@ -0,0 +1,29 @@
# Doccano Offline Deployment
## Use Case
These offline deployment scripts are suited for deploying Doccano on an air gapped Ubuntu 18.04/20.04 virtual machine (VM 2) with no internet connectivity (such as in clinical environments).
The preparation requires another machine (VM 1) with internet access and `docker`/`docker-compose` preinstalled (with $USER in `docker` group) and running the same Ubuntu distribution as VM 2.
The focus is primarily on the `docker-compose`-based production deployment.
The files mentioned in this document are located in the `tools/offline_deployment/` directory.
## Setup Steps
Run the following steps on VM 1:
1. Clone this repository
2. Run the scripts `offline_01_*.sh` in ascending order
Skip OR modify and run the script `offline_01_1-optional_use_https`
Do NOT run these scripts as `sudo`! The scripts will ask for sudo-permissions when it is needed.
Now, move over to VM 2
3. Copy the repository folder from VM 1 to VM 2
4. Run the scripts `offline_02_*.sh` in ascending order
Do NOT run these scripts as `sudo`! The scripts will ask for sudo-permissions when it is needed.
5. Make minor changes on `docker-compose.prod.yml` to change the admin credentials
6. Run `docker-compose -f docker-compose.prod.yml up` in the repository root directory or use the script `offline_03_*.sh`
## Remarks
The setup was tested on Ubuntu 18.04 machines.

28
tools/offline_deployment/offline_01_1-optional_use_https.sh

@ -0,0 +1,28 @@
#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $DIR
cd ../..
unset DIR
# create certificate pair
sudo apt-get install -y openssl
openssl req -new -newkey rsa:4096 -sha256 -nodes -x509 -keyout ./nginx/cert.key -out ./nginx/cert.crt \
-subj "/C=US/ST=StateCode/L=LocationName/O=OrganizationName/OU=OrganizationUnit/CN=doccano.herokuapp.com"
# define cert paths inside container
ssl_cert="/certs/cert.crt"
ssl_cert_key="/certs/cert.key"
# edit nginx.conf
sed -i "s|listen 80;|listen 443 ssl;\n ssl_certificate $ssl_cert;\n ssl_certificate_key $ssl_cert_key;|g" nginx/nginx.conf
# edit nginx Dockerfile
echo "RUN mkdir -p /certs/" >> nginx/Dockerfile
echo "COPY nginx/cert.key /certs/cert.key" >> nginx/Dockerfile
echo "COPY nginx/cert.crt /certs/cert.crt" >> nginx/Dockerfile
# edit published port
sed -i "s|- 80:80|- 443:443|g" docker-compose.prod.yml
echo "Switched to HTTPS"

40
tools/offline_deployment/offline_01_2-patch_and_extract_Docker_images.sh

@ -0,0 +1,40 @@
#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $DIR
unset DIR
# WORKAROUND: Downgrade docker-compose version to match Ubuntu 18.04 default compose package
echo "Patching docker-compose to match Ubuntu 18.04 compose package"
sed -i 's|version: "3.7"|version: "3.3"|g' ../../docker-compose.prod.yml
sed -i 's^dockerfile: backend/Dockerfile.prod^dockerfile: backend/Dockerfile.prod\n image: doccano-backend:custom^g' ../../docker-compose.prod.yml
sed -i 's^dockerfile: nginx/Dockerfile^dockerfile: nginx/Dockerfile\n image: doccano-nginx:custom^g' ../../docker-compose.prod.yml
# Modify Dockerfile for nginx to add python3 and offline patch
sed -i 's|FROM nginx|COPY tools/offline_deployment/offline_patcher.py /patch.py\
RUN apk add -U --no-cache py3-requests \\\
\&\& mkdir -p /app/dist/offline \&\& python3 /patch.py /app/dist /app/dist/offline /offline\
\
FROM nginx|' ../../nginx/Dockerfile
# Modify Dockerfile for backend to add python3 and offline patch
# TODO: Remark: Not needed due to SPA frontend
#sed -i 's|COPY ./Pipfile\* /backend/|COPY ./Pipfile\* /backend/\
#COPY tools/offline_deployment/offline_patcher.py /patch.py\
#RUN apt-get update \
# \&\& apt-get install -y --no-install-recommends \
# python3 python3-requests \
# \&\& apt-get clean \\\
# \&\& rm -rf /var/lib/apt/lists/\*\
# \&\& mkdir -p /backend/server/static/offline \&\& python3 /patch.py /backend/server /server/static/offline\
#\
#|' ../../backend/Dockerfile.prod
docker-compose -f ../../docker-compose.prod.yml pull
docker-compose -f ../../docker-compose.prod.yml build
docker image save -o doccano-backend.tar doccano-backend:custom
docker image save -o doccano-nginx.tar doccano-nginx:custom
docker image save -o postgres.tar postgres:13.1-alpine
docker image save -o rabbitmq.tar rabbitmq:3.8

22
tools/offline_deployment/offline_01_3-download_APT_packages.sh

@ -0,0 +1,22 @@
#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $DIR
unset DIR
# Prepare and download packages
pdir="/offline_packages"
mkdir -p "$(pwd)${pdir}"
cd "$(pwd)${pdir}"
SELECTED_PACKAGES="wget unzip curl tar docker.io docker-compose"
apt-get download $(apt-cache depends --recurse --no-recommends --no-suggests \
--no-conflicts --no-breaks --no-replaces --no-enhances \
--no-pre-depends ${SELECTED_PACKAGES} | grep "^\w")
# Build package index
sudo apt-get install -y dpkg-dev
dpkg-scanpackages "." /dev/null | gzip -9c > Packages.gz
echo "Packages extracted to: $(pwd)${pdir}"

30
tools/offline_deployment/offline_02_1-install_APT_packages.sh

@ -0,0 +1,30 @@
#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $DIR
unset DIR
# Import APT packages
pdir="/offline_packages"
abs_pdir="$(pwd)${pdir}"
sudo mv /etc/apt/sources.list /etc/apt/sources.list.bak
cat <<EOF > sources.list
deb [trusted=yes] file://${abs_pdir} ./
EOF
sudo mv sources.list /etc/apt/sources.list
# Install APT packages
sudo apt-get update
SELECTED_PACKAGES="wget unzip curl tar docker.io docker-compose"
sudo apt-get install -y $SELECTED_PACKAGES
# Cleanup
sudo apt-get clean
sudo mv /etc/apt/sources.list.bak /etc/apt/sources.list
# Setup Docker
sudo usermod -aG docker $(whoami)
sudo systemctl enable docker.service
echo "Packages were installed. We need to reboot!"

16
tools/offline_deployment/offline_02_2-import_Docker_images.sh

@ -0,0 +1,16 @@
#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $DIR
unset DIR
# Info: Docker image name is already set in previous scripts
## Set image tag in Compose to avoid image build
#sed -i 's^dockerfile: backend/Dockerfile.prod^dockerfile: backend/Dockerfile.prod\n image: doccano-backend:custom^g' ../../docker-compose.prod.yml
#sed -i 's^dockerfile: nginx/Dockerfile^dockerfile: nginx/Dockerfile\n image: doccano-nginx:custom^g' ../../docker-compose.prod.yml
# Load docker images
docker image load -i doccano-backend.tar
docker image load -i doccano-nginx.tar
docker image load -i postgres.tar
docker image load -i rabbitmq.tar

7
tools/offline_deployment/offline_03_1-runDoccano.sh

@ -0,0 +1,7 @@
#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $DIR
unset DIR
docker-compose -f ../../docker-compose.prod.yml up -d

203
tools/offline_deployment/offline_patcher.py

@ -0,0 +1,203 @@
#!/usr/bin/env python3
import sys, os, re
import uuid
from urllib.parse import urljoin
import requests
"""Script Information
This script scans all files of a given directory [1] for URL addresses and
hyperlink references.
All found URLs are requested for Content-Type.
For certain Content-Types (like js, css, or fonts), the file is downloaded and
stored locally into a given directory [2] and the existing URLs are altered
to a local URL location (with a given URL prefix [3]).
Downloaded files are scanned for URLs recursively.
Relative references in CSS files are an edge case that is
handled separately by a specific regex pattern.
Arguments:
1. <root directory [1]>
2. <local offline storage directory [2]>
3. <HTTP URL location prefix [3]>
Example:
- Given:
- File ./webspace/index.html, containing URL: https://example.com/library.js
- Directory ./webspace/static, containing static files,
serving content on HTTP location: /staticfiles
- Call:
$> python3 offline_patcher.py webspace/ webspace/static /staticfiles
- Result:
- Library from https://example.com/library.js is stored as file:
webspace/static/offline_<uuid>.js
- Link in file webspace/index.html is replaced to:
/staticfiles/offline_<uuid>.js
- File webspace/static/offline_<uuid>.js is scanned recursively for URLs
Author: Johann Frei
"""
def main():
# root folder to scan for URLs
root_folder = sys.argv[1]
# offline folder to store static offline files
offline_folder = sys.argv[2]
# offline link prefix
offline_prefix = sys.argv[3]
offline_file = os.path.join(offline_folder, "offline_{}.{}")
offline_link = offline_prefix + "/offline_{}.{}"
mime_ptn = re.compile(r"(?P<mime>(?P<t1>[\w^\/]+)\/(?P<t2>[\S\.^\;]+))(\;|$)", re.IGNORECASE)
# regex to find matches like: "https://<host>[:<port>]/a/link/location.html"
link_ptn = re.compile(r"[\(\'\"\ ](?P<link>https?:\/\/(?P<host>(?P<h_host>((?=[^\(\)\'\"\ \:\/])(?=[\S]).)+))(?P<port>\:[0-9]+)?\/[^\(\)\'\"\ ]+)(?P<encl_stop>[\(\)\'\"\ ])")
# regex to find matches like: url(../relative/parent_directory/links/without/quotes/are/hard)
link_ptn_url = re.compile(r"url\([\"\']?(?P<link>((?=[^\)\"\'])(?=[\S]).)+)[\"\']?\)")
# block special hosts
forbidden_hosts = [
re.compile(r"^.*registry\.npmjs\.org$"), # No yarnpkg repository
re.compile(r"^.*yarnpkg\.com$"), # No yarnpkg repository
re.compile(r"^[0-9\.]+$"), # avoid IP addresses
re.compile(r"^[^\.]+$"), # needs a dot in host
]
# only support certain content types
supported_mime_types = [
# (filter function -> bool, file extension -> str)
(lambda m: m["t2"] == "javascript", lambda m: "js"),
(lambda m: m["t2"] == "css", lambda m: "css"),
(lambda m: m["t1"] == "font", lambda m: m["t2"]),
]
# load all initial files
files_to_check = []
for cur_dir, n_dir, n_files in os.walk(root_folder):
files_to_check += [ os.path.join(cur_dir, f) for f in n_files ]
cached_urls = {}
valid_urls = {}
file_origins = {}
i = 0
while i < len(files_to_check):
file_i = files_to_check[i]
try:
print("Inspect", file_i)
with open(file_i, "r", encoding="utf-8") as f:
t = f.read()
link_findings_default = [ {
"abs": match.group("link"),
"found": match.group("link"),
"host": match.group("host")
} for match in link_ptn.finditer(t) ]
# extract relative urls and convert them to absolute http urls
link_findings_url_prefix = []
for match in link_ptn_url.finditer(t):
if os.path.abspath(file_i) in file_origins and not match.group("link").startswith("http"):
link_abs = urljoin(file_origins[os.path.abspath(file_i)], match.group("link"))
item = {
"abs": link_abs,
"found": match.group("link"),
"host": link_ptn.match( "\"" + link_abs + "\"").group("host")
}
link_findings_url_prefix.append(item)
for spot in link_findings_default + link_findings_url_prefix:
absolute_link = spot["abs"]
found_link = spot["found"]
found_host = spot["host"]
if absolute_link not in valid_urls:
# check link
if True in [ True for fh in forbidden_hosts if fh.match(absolute_link) is not None ]:
# host is forbidden
valid_urls[absolute_link] = False
else:
# host is not forbidden
# check mime type
response = requests.head(absolute_link, allow_redirects=True)
mime = response.headers.get("Content-Type", None)
if mime is None:
valid_urls[absolute_link] = False
else:
mime_match = mime_ptn.match(mime)
if mime_match is None:
valid_urls[absolute_link] = False
else:
final_fext = None
# try supported content types
for smt, get_fext in supported_mime_types:
if smt(mime_match):
final_fext = get_fext(mime_match)
break
if final_fext is None:
# mime not supported
valid_urls[absolute_link] = False
else:
# mime is supported -> store and remember file
valid_urls[absolute_link] = True
file_unique = uuid.uuid4()
target_link = offline_link.format(file_unique, final_fext)
target_file = offline_file.format(file_unique, final_fext)
# download file
try:
file_response = requests.get(absolute_link, allow_redirects=True)
file_response.raise_for_status()
with open(target_file, 'wb') as download_file:
for chunk in file_response.iter_content(100000):
download_file.write(chunk)
# also check downloaded file for links later
files_to_check.append(target_file)
print("Downloaded file:", absolute_link)
except:
print("Link could not been downloaded:", absolute_link)
# register downloaded file
cached_urls[absolute_link] = {
"input_link": absolute_link,
"target_link": target_link,
"file": target_file,
"fext": final_fext,
"found": [ {"file": file_i, "found_link": found_link} ]
}
# store reverse lookup for recursive url("../rel/link") patterns
file_origins[os.path.abspath(target_file)] = absolute_link
if valid_urls[absolute_link]:
# add to cached urls entries
cached_urls[absolute_link]["found"].append({"file": file_i, "found_link": found_link})
print("Checked file:", file_i)
except UnicodeDecodeError:
print("Skip file (No unicode):", file_i)
except:
print("Unknown error... Skip file:", file_i)
# look at next file
i+= 1
# replace files with offline link
for _, cached in cached_urls.items():
for edit_file in cached["found"]:
with open(edit_file["file"], "r", encoding="utf-8") as f:
file_content = f.read()
with open(edit_file["file"], "w", encoding="utf-8") as f:
f.write(file_content.replace(edit_file["found_link"], cached["target_link"]))
print("Patched to", len(cached["found"]), "file with link:", cached["target_link"])
print("Done")
if __name__ == "__main__":
main()
Loading…
Cancel
Save