Browse Source

commit k8s branch

k8s
zhangshy 3 years ago
parent
commit
63187a6314
100 changed files with 1588 additions and 9272 deletions
  1. +19
    -0
      .gitattributes
  2. +21
    -0
      .gitignore
  3. +14
    -15
      README.md
  4. +13
    -14
      README_zh.md
  5. +0
    -0
      cambricon-integration/.gitkeep
  6. +6
    -0
      cambricon-integration/cambricon-image-upload
  7. +0
    -0
      cambricon-integration/cambricon-k8s/.gitkeep
  8. +26
    -0
      cambricon-integration/cambricon-k8s/cambricon-k8s-environment
  9. +6
    -0
      cambricon-integration/cambricon-k8s/cambricon-k8s-framework-image
  10. +65
    -0
      cambricon-integration/cambricon-k8s/cambricon-k8s-frameworkController
  11. +0
    -0
      cambricon-integration/cambricon-neuware-image/.gitkeep
  12. +67
    -0
      cambricon-integration/cambricon-neuware-image/cambricon-image
  13. +0
    -0
      cambricon-integration/cambricon-operate/.gitkeep
  14. +33
    -0
      cambricon-integration/cambricon-operate/cambricon-operate-manual
  15. +0
    -130
      cluster-configuration/cluster-configuration.yaml
  16. +0
    -83
      cluster-configuration/cluster-configuration.yaml.template
  17. +0
    -85
      cluster-configuration/k8s-role-definition.yaml
  18. +0
    -55
      cluster-configuration/kubernetes-configuration.yaml
  19. +0
    -182
      cluster-configuration/services-configuration.yaml
  20. +4
    -0
      deploy-script/.gitignore
  21. +97
    -0
      deploy-script/build.py
  22. +64
    -0
      deploy-script/config/dev.yaml
  23. +35
    -0
      deploy-script/config/prod.yaml
  24. +97
    -0
      deploy-script/deploy.py
  25. +0
    -0
      deploy-script/services/__init__.py
  26. +39
    -0
      deploy-script/services/image_factory_agent.py
  27. +36
    -0
      deploy-script/services/image_factory_shield.py
  28. +39
    -0
      deploy-script/services/log_service_bee.py
  29. +40
    -0
      deploy-script/services/log_service_queen.py
  30. +47
    -0
      deploy-script/services/rest_server.py
  31. +57
    -0
      deploy-script/template/image-factory-agent.yaml
  32. +44
    -0
      deploy-script/template/image-factory-shield.yaml
  33. +57
    -0
      deploy-script/template/log-service-bee.yaml
  34. +54
    -0
      deploy-script/template/log-service-queen.yaml
  35. +84
    -0
      deploy-script/template/rest-server.yaml
  36. +0
    -0
      deploy-script/utils/__init__.py
  37. +14
    -0
      deploy-script/utils/dir.py
  38. +14
    -0
      deploy-script/utils/docker.py
  39. +19
    -0
      deploy-script/utils/k8s.py
  40. +29
    -0
      deploy-script/utils/setting.py
  41. +89
    -0
      efk/README_zh.md
  42. +19
    -0
      efk/es-external-service.yaml
  43. +15
    -0
      efk/es-ingress.yaml
  44. +152
    -0
      efk/es-statefulset.yaml
  45. +173
    -0
      efk/filebeat-kubernetes.yaml
  46. +0
    -94
      frameworklauncher/README-zh.md
  47. +0
    -81
      frameworklauncher/README.md
  48. +0
    -30
      frameworklauncher/bin/start.bat
  49. +0
    -54
      frameworklauncher/build-internal.bat
  50. +0
    -21
      frameworklauncher/build.bat
  51. +0
    -32
      frameworklauncher/conf/frameworklauncher.yml
  52. +0
    -914
      frameworklauncher/doc/USERMANUAL.md
  53. +0
    -729
      frameworklauncher/doc/USERMANUAL_zh.md
  54. +0
    -19
      frameworklauncher/doc/example/ExampleFramework.json
  55. BIN
      frameworklauncher/doc/img/Architecture.png
  56. BIN
      frameworklauncher/doc/img/Pipeline.png
  57. +0
    -227
      frameworklauncher/pom.xml
  58. +0
    -1465
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/ApplicationMaster.java
  59. +0
    -25
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/Bootstrap.java
  60. +0
    -186
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/Configuration.java
  61. +0
    -131
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/FrameworkInfoPublisher.java
  62. +0
    -64
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/NMClientCallbackHandler.java
  63. +0
    -162
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/Node.java
  64. +0
    -66
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/RMClientCallbackHandler.java
  65. +0
    -84
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/RMResyncHandler.java
  66. +0
    -426
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/RequestManager.java
  67. +0
    -416
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/SelectionManager.java
  68. +0
    -78
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/SelectionResult.java
  69. +0
    -998
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/StatusManager.java
  70. +0
    -88
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/TaskEvent.java
  71. +0
    -71
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/applicationmaster/TaskStatusLocator.java
  72. +0
    -256
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/client/LauncherClient.java
  73. +0
    -70
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/GlobalConstants.java
  74. +0
    -74
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/definition/FrameworkStateDefinition.java
  75. +0
    -72
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/definition/TaskStateDefinition.java
  76. +0
    -54
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/AggregateException.java
  77. +0
    -38
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/AuthorizationException.java
  78. +0
    -38
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/BadRequestException.java
  79. +0
    -54
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/LauncherClientException.java
  80. +0
    -42
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/NonTransientException.java
  81. +0
    -38
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/NotAvailableException.java
  82. +0
    -38
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/NotFoundException.java
  83. +0
    -38
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/ThrottledRequestException.java
  84. +0
    -42
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exceptions/TransientException.java
  85. +0
    -450
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exit/ExitDiagnostics.java
  86. +0
    -82
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exit/ExitStatusKey.java
  87. +0
    -39
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exit/ExitStatusValue.java
  88. +0
    -136
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exts/CommonExts.java
  89. +0
    -66
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/exts/HadoopExts.java
  90. +0
    -86
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/log/ChangeAwareLogger.java
  91. +0
    -131
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/log/DefaultLogger.java
  92. +0
    -25
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AMType.java
  93. +0
    -60
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AccessControlList.java
  94. +0
    -59
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AclConfiguration.java
  95. +0
    -53
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AggregatedFrameworkRequest.java
  96. +0
    -51
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AggregatedFrameworkStatus.java
  97. +0
    -43
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AggregatedLauncherRequest.java
  98. +0
    -43
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AggregatedLauncherStatus.java
  99. +0
    -41
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AggregatedTaskRoleStatus.java
  100. +0
    -28
      frameworklauncher/src/main/java/com/microsoft/frameworklauncher/common/model/AntiAffinityLevel.java

+ 19
- 0
.gitattributes View File

@@ -0,0 +1,19 @@
# Auto detect text files and perform LF normalization
* text=auto
*.cs text diff=csharp
*.java text diff=java
*.html text diff=html
*.py text diff=python
*.pl text diff=perl
*.pm text diff=perl
*.css text eol=lf
*.js text eol=lf
*.sql text
*.sh text eol=lf
*.mustache text eol=lf
*.bat text eol=crlf
*.cmd text eol=crlf
*.vcxproj text merge=union eol=crlf
*.csproj text merge=union eol=crlf
*.sln text merge=union eol=crlf
*.tar.gz filter=lfs diff=lfs merge=lfs -text

+ 21
- 0
.gitignore View File

@@ -0,0 +1,21 @@
*.iml
*.gv
*.ipr
*.iws
*.orig
*.rej
*.sdf
*.suo
*.vcxproj.user
*.log
.idea
.svn
.classpath
.project
.settings
.vscode
target/
build/
out/
tmp/
dist/

+ 14
- 15
README.md View File

@@ -18,9 +18,11 @@ OPENI supports GPU scheduling, a key requirement of deep learning job.
For better performance, OPENI supports fine-grained topology-aware job placement that can request for the GPU with a specific location (e.g., under the same PCI-E switch).

OPENI embraces a [microservices](https://en.wikipedia.org/wiki/Microservices) architecture: every component runs in a container.
The system leverages [Kubernetes](https://kubernetes.io/) to deploy and manage static components in the system.
The more dynamic deep learning jobs are scheduled and managed by [Hadoop](http://hadoop.apache.org/) YARN with our [GPU enhancement](https://issues.apache.org/jira/browse/YARN-7481).
The training data and training results are stored in Hadoop HDFS.
The system leverages [Kubernetes](https://kubernetes.io/) to deploy and manage system service.
The latest version of OPENI,the scheduling engine of more dynamic deep learning jobs also uses Kubernetes,
which enables system services and deep learning jobs to be scheduled and managed using Kubernetes.
The storage of training data and results can be customized according to platform/equipment requirements.
Jobs logs are collected by [Filebeat](https://www.elastic.co/cn/products/beats/filebeat) and stored in [Elasticsearch](https://www.elastic.co/cn/products/elasticsearch) cluster.

## An Open AI Platform for R&D and Education

@@ -44,7 +46,7 @@ OPENI operates in an open model: contributions from academia and industry are al
### Prerequisite

The system runs in a cluster of machines each equipped with one or multiple GPUs.
Each machine in the cluster runs Ubuntu 16.04 LTS and has a statically assigned IP address.
Each machine in the cluster runs Ubuntu 18.04 LTS and has a statically assigned IP address.
To deploy services, the system further relies on a Docker registry service (e.g., [Docker hub](https://docs.docker.com/docker-hub/))
to store the Docker images for the services to be deployed.
The system also requires a dev machine that runs in the same environment that has full access to the cluster.
@@ -53,11 +55,10 @@ And the system need [NTP](http://www.ntp.org/) service for clock synchronization
### Deployment process
To deploy and use the system, the process consists of the following steps.

1. Build the binary for [Hadoop AI](./hadoop-ai/README.md) and place it in the specified path*
2. [Deploy kubernetes and system services](./openi-management/README.md)
1. [Deploy Kubernetes 1.13 and system services](./openi-management/README.md)
2. User Kubernetes to Deploy [FrameworkController](https://github.com/microsoft/frameworkcontroller)
3. Access [web portal](./webportal/README.md) for job submission and cluster management

\* If step 1 is skipped, a standard Hadoop 2.9.0 will be installed instead.

#### Kubernetes deployment

@@ -72,7 +73,7 @@ Please refer to service deployment [readme](./openi-management/README.md) for de
#### Job management

After system services have been deployed, user can access the web portal, a Web UI, for cluster management and job management.
Please refer to this [tutorial](job-tutorial/README.md) for details about job submission.
Please refer to this [tutorial](./user%20manual.pdf) for details about job submission.

#### Cluster management

@@ -88,12 +89,10 @@ The system architecture is illustrated above.
User submits jobs or monitors cluster status through the [Web Portal](./webportal/README.md),
which calls APIs provided by the [REST server](./rest-server/README.md).
Third party tools can also call REST server directly for job management.
Upon receiving API calls, the REST server coordinates with [FrameworkLauncher](./frameworklauncher/README.md) (short for Launcher)
to perform job management.
The Launcher Server handles requests from the REST Server and submits jobs to Hadoop YARN.
The job, scheduled by YARN with [GPU enhancement](https://issues.apache.org/jira/browse/YARN-7481),
can leverage GPUs in the cluster for deep learning computation. Other type of CPU based AI workloads or traditional big data job
Upon receiving API calls, the REST server coordinates with k8s ApiServer, k8s Scheduler will schedule the job to k8s node with CPU,GPU and other resources.
[FrameworkController](https://github.com/microsoft/frameworkcontroller) will monitor the job life cycle in k8s cluster.
Restserver retrieve the status of jobs from k8s ApiServer, and its status can display on Web portal.
Other type of CPU based AI workloads or traditional big data job
can also run in the platform, coexisted with those GPU-based jobs.
The platform leverages HDFS to store data. All jobs are assumed to support HDFS.
All the static services (blue-lined box) are managed by Kubernetes, while jobs (purple-lined box) are managed by Hadoop YARN.
The storage of training data and results can be customized according to platform/equipment requirements.


+ 13
- 14
README_zh.md View File

@@ -18,9 +18,10 @@ OPENI支持在GPU集群中运行AI任务作业(比如深度学习任务作业
为了能得到更好的性能,OPENI支持细粒度的拓扑感知任务部署,可以获取到指定位置的GPU(比如获取在相同的PCI-E交换机下的GPU)。

启智采用[microservices](https://en.wikipedia.org/wiki/Microservices) 结构:每一个组件都在一个容器中运行。
平台利用[Kubernetes](https://kubernetes.io/) 来部署和管理系统中的静态组件。
其余动态的深度学习任务使用[Hadoop](http://hadoop.apache.org/) YARN和[GPU强化](https://issues.apache.org/jira/browse/YARN-7481)进行调度和管理。
训练数据和训练结果储存在Hadoop HDFS上。
平台利用[Kubernetes](https://kubernetes.io/) 来部署和管理系统服务。
平台的最新版本,动态的深度学习任务的调度引擎也使用Kubernetes,使得系统服务和深度学习任务都使用Kubernetes进行调度和管理。
训练数据和训练结果储存可根据平台/设备需求自定义。任务日志采用[Filebeat](https://www.elastic.co/cn/products/beats/filebeat)收集,
[Elasticsearch](https://www.elastic.co/cn/products/elasticsearch)集群存储。

## 用于研发及教育的开源AI平台

@@ -44,18 +45,16 @@ OPENI以开源的模式运营:来自学术和工业界的贡献我们都非常
### 前提要求

该系统在一组机器集群上运行,每台机器都配有一块或多块GPU。
集群中的每台机器都运行Ubuntu 16.4 LTS,并有一个静态分配的IP地址。为了部署服务,系统进一步使用Docker注册服务 (例如[Docker hub](https://docs.docker.com/docker-hub/)) 来存储要部署的服务的Docker镜像。系统还需要一台可以完全访问集群的、运行有相同环境的开发机器。系统还需要[NTP](http://www.ntp.org/)服务进行时钟同步。
集群中的每台机器都运行Ubuntu 18.4 LTS,并有一个静态分配的IP地址。为了部署服务,系统进一步使用Docker注册服务 (例如[Docker hub](https://docs.docker.com/docker-hub/)) 来存储要部署的服务的Docker镜像。系统还需要一台可以完全访问集群的、运行有相同环境的开发机器。系统还需要[NTP](http://www.ntp.org/)服务进行时钟同步。

### 部署过程

执行以下几个步骤来部署和使用本系统。

1. 为[Hadoop AI](./hadoop-ai/README.md)构造二进制文件并将其放在指定路径中*
2. [部署kubernetes和系统服务](./openi-management/README.md)
1. [部署kubernetes 1.13和系统服务](./openi-management/README.md)
2. 使用kubernetes部署[FrameworkController服务](https://github.com/microsoft/frameworkcontroller)
3. 访问[web门户页面](./webportal/README.md) 进行任务提交和集群管理

\* 如果跳过步骤1,则将会安装标准版Hadoop 2.9.0。

#### Kubernetes部署

平台使用Kubernetes(k8s)来部署和管理系统服务。
@@ -69,7 +68,7 @@ OPENI以开源的模式运营:来自学术和工业界的贡献我们都非常
#### 作业管理

系统服务部署完成后, 用户可以访问Web门户页面(一个Web UI界面)来进行集群和作业管理。
关于任务作业的提交,请参阅[指南](job-tutorial/README.md)。
关于任务作业的提交,请参阅[指南](./user%20manual.pdf)。

#### 集群管理

@@ -78,12 +77,12 @@ Web门户上也提供了Web UI进行集群的管理。
## 系统结构

<p style="text-align: left;">
<img src="./sysarch-zh.png" title="System Architecture" alt="System Architecture" />
<img src="./sysarch.png" title="System Architecture" alt="System Architecture" />
</p>


系统的整体结构如上图所示。
用户通过[Web门户](./webportal/README.md)提交了任务作业或集群状态监视的申请,该操作会调用[REST服务器](./rest-server/README.md)提供的API。
第三方工具也可以直接调用REST服务器进行作业管理。收到API调用后,REST服务器与[FrameworkLauncher](./frameworklauncher/README.md)(简称Launcher)协同工作来进行作业管理。Launcher服务器处理来自REST服务器的请求,并将任务作业提交到Hadoop YARN。由YARN和[GPU强化](https://issues.apache.org/jira/browse/YARN-7481)调度的作业, 可以使用集群中的GPU资源进行深度学习运算。其他基于CPU的AI工作或者传统的大数据任务作业也可以在平台上运行,与那些基于GPU的作业共存。
平台使用HDFS来存储数据。我们假设所有任务作业都支持HDFS。 所有静态服务(蓝色框)都由Kubernetes管理,而任务作业(紫色框)则由Hadoop YARN管理。
用户通过[Web门户](./webportal/README.md)提交了任务作业或集群状态监视的申请,该操作会调用[Restserver服务](./rest-server/README.md)提供的API。
第三方工具也可以直接调用Restserver服务进行作业管理。收到API调用后,Restserver服务会将任务作业提交到k8s ApiServer,k8s的调度引擎负责对任务作业进行调度,调度完成后任务就可以使用集群节点中的GPU资源进行深度学习运算。
[FrameworkController服务](https://github.com/microsoft/frameworkcontroller)负责监控任务作业在K8s集群中的生命周期。Restserver服务向k8s ApiServer获取任务的状态,并且Web网页可以展示在界面上。
其他基于CPU的AI工作或者传统的大数据任务作业也可以在平台上运行,与那些基于GPU的作业共存。平台训练数据和训练结果储存可根据平台/设备需求自定义。


frameworklauncher/src/main/resources/webapps/frameworklauncher/.keep → cambricon-integration/.gitkeep View File


+ 6
- 0
cambricon-integration/cambricon-image-upload View File

@@ -0,0 +1,6 @@
上传镜像到harbor
- docker login -u openi -p OpenI:192.168.202.102:5000
- docker tags cambricon/test/ubuntu:v4.1 192.168.202.74:5000/openi/cambricon-office-ubuntu:v0.4
- docker push 192.168.202.74:5000/openi/cambricon-office-ubuntu:v0.4
- docker tags cambricon-test2:v0.4 192.168.202.74:5000/openi/cambricon-Neuware:v0.4
- docker push 192.168.202.74:5000/openi/cambricon-Neuware:v0.4

model-client/todo.md → cambricon-integration/cambricon-k8s/.gitkeep View File


+ 26
- 0
cambricon-integration/cambricon-k8s/cambricon-k8s-environment View File

@@ -0,0 +1,26 @@
本地k8s环境搭建(使用kubeadm)
- 关闭防火墙
- systemctl stop firewalld
- systemctl disable firewalld
- 禁用Selinux
- apt install selinux-utils
- setenforce 0
- 安装指定版本的docker
- 启动docker service
- systemctl enable docker
- systemctl start docker
- systemctl status docker
- 安装kubectl,kubelet,kubeadm
- cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
- deb http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial main
- EOF
- 安装
- apt-get update && apt-get install -y kubelet kubeadm kubectl
- systemctl enable kubelet
- 配置master
- export KUBECONFIG=/etc/kubernetes/admin.conf
- 重起kubelet
- systemctl daemon-reload
- systemctl restart kubele
- 在master节点上执行
- kubeadm init --pod-network-cidr=192.168.202.102/16 --apiserver-advertise-address=192.168.202.102 --kubernetes-version=v1.14.1 --ignore-preflight-errors=Swap

+ 6
- 0
cambricon-integration/cambricon-k8s/cambricon-k8s-framework-image View File

@@ -0,0 +1,6 @@
一、部署FrameworkController环境
1、进入https://github.com/Microsoft/frameworkcontroller网址,然后点击Run Controller
2、选择Run By Kubernetes StatefulSet,部署frameworkcontroller,具体操作见https://github.com/microsoft/frameworkcontroller/tree/master/example/run。
二、获取FrameworkController镜像
1、从这个网址https://hub.docker.com/r/yyrdl/frameworkcontroller,使用docker pull yyrdl/frameworkcontroller获取镜像。
2、从这个网址https://hub.docker.com/r/frameworkcontroller/frameworkbarrier,使用docker pull获取镜像。

+ 65
- 0
cambricon-integration/cambricon-k8s/cambricon-k8s-frameworkController View File

@@ -0,0 +1,65 @@
# 使用frameworkcontroller管理pod,生成yaml文件如下:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: nputest
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 2
taskRoles:
- name: ps
taskNumber: 1
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
pod:
spec:
restartPolicy: Never
hostNetwork: false
containers:
- name: nvidiatest
image: cambricon-test2:v0.4
command: [
"sh", "-c",
"/mnt/frameworkbarrier/injector.sh && sleep 10d"]
resources:
limits:
cambricon.com/mlu: 1
volumeMounts:
- name: cambricon-datasets
mountPath: /Cambricon-MLU100/datasets
- name: model
mountPath: /Cambricon-MLU100/models
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
- name: nvidia-runfile
mountPath: /home
serviceAccountName: frameworkbarrier
initContainers:
- name: framenameworkbarrier
image: frameworkcontroller/frameworkbarrier
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
volumes:
- name: frameworkbarrier-volume
emptyDir: {}
- name: nvidia-runfile
hostPath:
path: /home/amax
- name: cambricon-datasets
hostPath:
path: /home/cambricon/V7.3.2/Cambricon-MLU100/datasets
- name: model
hostPath:
path: /home/cambricon/V7.3.2/Cambricon-MLU100/models
# 使用kubectl create -f ...yaml生成pod
# 进入pod,运行./run_all.sh

+ 0
- 0
cambricon-integration/cambricon-neuware-image/.gitkeep View File


+ 67
- 0
cambricon-integration/cambricon-neuware-image/cambricon-image View File

@@ -0,0 +1,67 @@
FROM cambricon/test/ubuntu:v4.1

WORKDIR /home/Cambricon-Test
COPY Cambricon-MLU100.tar.gz /home/Cambricon-Test/Cambricon-MLU100.tar.gz
RUN tar zvxf /home/Cambricon-Test/Cambricon-MLU100.tar.gz -C /home/Cambricon-Test \
&& rm /home/Cambricon-Test/Cambricon-MLU100.tar.gz \
&& mv /home/Cambricon-Test/Cambricon-MLU100 ../ \
&& rm -rf /home/Cambricon-Test \
&& mv /home/Cambricon-MLU100 /home/Cambricon-Test

# GLOG_minloglevel Set log level which is output to stderr, 0: INFO/WARNING/ERROR/FATAL, 1: WARNING/ERROR/FATAL, 2: ERROR/FATAL, 3: FATAL,
ENV ROOT_HOME="/home/Cambricon-Test"
ENV NEUWARE_HOME=${ROOT_HOME} \
NEUWARE_PATH=${ROOT_HOME} \
CAMBRICON_HOME=${ROOT_HOME} \
TENSORFLOW_HOME=${ROOT_HOME}/tensorflow
ENV TENSORFLOW_MODEL_HOME=${TENSORFLOW_HOME}/models/online \
TENSORFLOW_OFFLINE_MODEL_HOME=${TENSORFLOW_HOME}/models/offline \
TENSORFLOW_MODELS_MODEL_HOME=${ROOT_HOME}/models/tensorflow_models/ \
TENSORFLOW_MODELS_DATA_HOME=${ROOT_HOME}/datasets/tensorflow_models/ \
PB_TO_CAMBRICON_PATH=${TENSORFLOW_HOME}/tools/pb_to_cambricon/ \
tensorflow=${TENSORFLOW_HOME}/src/tensorflow-v1.10 \
TF_SET_ANDROID_WORKSPACE=0 \
MXNET_HOME=${ROOT_HOME}/mxnet \
CAFFE_HOME=${ROOT_HOME}/caffe \
ONNX_HOME=${ROOT_HOME}/onnx \
CNRT_HOME=${CAMBRICON_HOME} \
CNML_HOME=${CAMBRICON_HOME} \
CNPERF_HOME=${CAMBRICON_HOME}/cnperf \
CNMON_HOME=${CAMBRICON_HOME}/cnmon
ENV CNDEV_HOME=${CNMON_HOME}/sdk \
CNSTREAM_HOME=${CAMBRICON_HOME}/cnstream \
CNCODEC_HOME=${CAMBRICON_HOME}/cncodec \
DRV_HOME=${CAMBRICON_HOME}/driver \
DATASET_HOME=${ROOT_HOME}/datasets \
PYTHONPATH=${PYTHONPATH}:${CAFFE_HOME}/src/caffe/python:${MXNET_HOME}/src/cambricon_mxnet/python \
PATH=${PATH}:${CAMBRICON_HOME}/bin
ENV LD_LIBRARY_PATH=${CNRT_HOME}/lib:${CNML_HOME}/lib:${CNDEV_HOME}/lib:${CAFFE_HOME}/lib:${CNSTREAM_HOME}/lib:${CNCODEC_HOME}/lib:${MXNET_HOME}/lib:${LD_LIBRARY_PATH} \
MXNET_ENGINE_TYPE="NaiveEngine" \
MXNET_EXEC_FUSE_MLU_OPS=true \
GLOG_alsologtostderr=true \
GLOG_minloglevel=0 \
MXNET_MODELS_DIR=${MXNET_HOME}/models \
MXNET_DATA_DIR=${DATASET_HOME} \
ONNX_MODELS_DIR=${ONNX_HOME}/models \
ONNX_SRC_DIR=${ONNX_HOME}/src/onnx \
ONNX_DATA_DIR=${DATASET_HOME}/imagenet \
OS_VERSION="ubuntu16.04"

RUN UNAME_V=`cat /etc/issue | head -n 1`
WORKDIR "${ROOT_HOME}/bin"
RUN if [ ! -L "${ROOT_HOME}/bin/cnmon" ]; then find ${OS_VERSION} -type f -exec ln -s {} \; ; fi

WORKDIR "${ROOT_HOME}/lib"
RUN if [ ! -L "${ROOT_HOME}/lib/libcnrt.so" ]; then \
find ${OS_VERSION} -name '*.so' -exec ln -s {} \; ; \
ln -s ${OS_VERSION}/libcnrt.so* libcnrt.so ; \
ln -s ${OS_VERSION}/libcnml.so* libcnml.so ; fi

WORKDIR ${ROOT_HOME}
RUN if [ ! -L lib64 ]; then ln -s lib lib64; fi

ADD configure.sh ${ROOT_HOME}/configure.sh
RUN chmod +x ${ROOT_HOME}/configure.sh && ${ROOT_HOME}/configure.sh

WORKDIR /home/Cambricon-Test
CMD ["/bin/bash"]

+ 0
- 0
cambricon-integration/cambricon-operate/.gitkeep View File


+ 33
- 0
cambricon-integration/cambricon-operate/cambricon-operate-manual View File

@@ -0,0 +1,33 @@
1. 在docker环境下运行寒武纪软件栈:
- 进入docker环境。
./run-cambricon-test-docker.sh
- 环境变量初始化。
source env.sh
- Caffe example 的编译与运行
- online example:
- 注释:online example模型中分类模型的运行,数据类型是float16
- cd Cambricon-Test/caffe/examples/online/c++/classification
./run_fp16.sh
- offline example:
- 注释: offline example模型中分类模型的运行,数据类型是float16
- cd Cambricon-Test/caffe/examples/offline/c++/classification
./run_fp16.sh
- Tensorflow example 的编译与运行
- online example:
- cd Cambricon-Test/tensorflow/examples/online/c++/classification
- ./tensorflow-v1.10_online_block.sh alexnet mlu float16 dense 2 4 1 0 1000
- offline example:
- cd Cambricon-Test/tensorflow/examples/offline/classification
- ./tensorflow-v1.10_online_block.sh alexnet mlu float16 dense 2 4 1 0 1000
- MXNet example 的编译与运行
- online example:
- ./run_all.sh
- offline example:
- ./run_all_pipe.sh
- ONNX example 的编译与运行
- online example:
- cd Cambricon-Test/onnx/examples/online/classification
- ./run_online.sh model is_sparse device_option other_option
- offline example:
- cd Cambricon-Test/onnx/examples/offline/classification
- ./run_offline.sh model option channel_num

+ 0
- 130
cluster-configuration/cluster-configuration.yaml View File

@@ -1,130 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
#
# Copyright (c) Peking University 2018
#
# The software is released under the Open-Intelligence Open Source License V1.0.
# The copyright owner promises to follow "Open-Intelligence Open Source Platform
# Management Regulation V1.0", which is provided by The New Generation of
# Artificial Intelligence Technology Innovation Strategic Alliance (the AITISA).


# If corresponding values aren't be set in the machine list, the default value will be filled in.
default-machine-properties:
# Account with root permission
username: username
password: password
sshport: port


machine-sku:

NC24R:
mem: 224
gpu:
# type: gpu{type}
type: teslak80
count: 4
cpu:
vcore: 24
#dataFolder: "/mnt"
#Note: Up to now, the only supported os version is Ubuntu16.04. Please do not change it here.
os: ubuntu16.04

D8SV3:
mem: 32
cpu:
vcore: 8
#dataFolder: "/mnt"
#Note: Up to now, the only supported os version is Ubuntu16.04. Pls don't change it here.
os: ubuntu16.04



machine-list:

- hostname: hostname (echo `hostname`)
hostip: IP
machine-type: D8SV3
etcdid: etcdid1
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
dashboard: "true"
zkid: "1"
openi-master: "true"



- hostname: hostname
hostip: IP
machine-type: D8SV3
etcdid: etcdid2
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
node-exporter: "true"


- hostname: hostname
hostip: IP
machine-type: D8SV3
etcdid: etcdid3
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
node-exporter: "true"


- hostname: hostname
hostip: IP
machine-type: NC24R
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: worker
openi-worker: "true"


- hostname: hostname
hostip: IP
machine-type: NC24R
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: worker
openi-worker: "true"


- hostname: hostname
hostip: IP
machine-type: NC24R
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: worker
openi-worker: "true"







+ 0
- 83
cluster-configuration/cluster-configuration.yaml.template View File

@@ -1,83 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
#
# Copyright (c) Peking University 2018
#
# The software is released under the Open-Intelligence Open Source License V1.0.
# The copyright owner promises to follow "Open-Intelligence Open Source Platform
# Management Regulation V1.0", which is provided by The New Generation of
# Artificial Intelligence Technology Innovation Strategic Alliance (the AITISA).

# If corresponding values aren't be set in the machine list, the default value will be filled in.
default-machine-properties:
{%- for key in root['default-machine-properties'] %}
{{key}}: {{root['default-machine-properties'][key]}}
{%- endfor %}


machine-sku:
{% for machine in root['machine-sku'] %}
{{machine}}:
mem: {{root['machine-sku'][machine]['mem']}}
{% if 'gpu' in root['machine-sku'][machine] -%}
gpu:
type: {{root['machine-sku'][machine]['gpu']['type']}}
count: {{root['machine-sku'][machine]['gpu']['count']}}
{% endif -%}
{% if 'cpu' in root['machine-sku'][machine] -%}
cpu:
vcore: {{root['machine-sku'][machine]['cpu']['vcore']}}
{% endif -%}
os: {{root['machine-sku'][machine]['os']}}
{% endfor %}


machine-list:
{% for host in root['machine-list'] %}
- hostname: {{ host['hostname'] }}
hostip: {{ host['hostip'] }}
machine-type: {{ host['machine-type']}}
{% if 'etcdid' in host -%}
etcdid: {{ host['etcdid'] }}
{% endif -%}
{% if 'username' in host -%}
username: {{ host['username'] }}
{% endif -%}
{% if 'password' in host -%}
password: {{ host['password'] }}
{% endif -%}
{% if 'sshport' in host -%}
sshport: {{ host['sshport'] }}
{% endif -%}
k8s-role: {{ host['k8s-role'] }}
{% if 'dashboard' in host -%}
dashboard: "{{ host['dashboard'] }}"
{% endif -%}
{% if 'zkid' in host -%}
zkid: "{{ host['zkid'] }}"
{% endif -%}
{% if 'openi-master' in host -%}
openi-master: "{{ host['openi-master'] }}"
{% endif -%}
{% if 'openi-worker' in host -%}
openi-worker: "{{ host['openi-worker'] }}"
{% endif -%}
{% if 'watchdog' in host -%}
watchdog: "{{ host['watchdog'] }}"
{% endif -%}
{% endfor %}

+ 0
- 85
cluster-configuration/k8s-role-definition.yaml View File

@@ -1,85 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
#
# Copyright (c) Peking University 2018
#
# The software is released under the Open-Intelligence Open Source License V1.0.
# The copyright owner promises to follow "Open-Intelligence Open Source Platform
# Management Regulation V1.0", which is provided by The New Generation of
# Artificial Intelligence Technology Innovation Strategic Alliance (the AITISA).


# the component should be bootstrapped remotely
component-list:

apiserver:
- src: apiserver.yaml
# the full dst path will be " template/generated/&{hostip}/ .... "
dst: src/etc/kubernetes/manifests

controller-manager:
- src: controller-manager.yaml
dst: src/etc/kubernetes/manifests

etcd:
- src: etcd.yaml
dst: src/etc/kubernetes/manifests

scheduler:
- src: scheduler.yaml
dst: src/etc/kubernetes/manifests

kubelet:
- src: kubelet.sh
dst: src/

kubeconfig:
- src: config
dst: src/etc/kubernetes

haproxy:
- src: haproxy.yaml
dst: src/etc/kubernetes/manifests
- src: haproxy.cfg
dst: src/haproxy



k8s-role:

master:
component:
- name: apiserver
- name: controller-manager
- name: etcd
- name: scheduler
- name: kubelet
- name: kubeconfig


worker:
component:
- name: kubelet
- name: kubeconfig


proxy:
component:
- name: kubelet
- name: haproxy
- name: kubeconfig

+ 0
- 55
cluster-configuration/kubernetes-configuration.yaml View File

@@ -1,55 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
#
# Copyright (c) Peking University 2018
#
# The software is released under the Open-Intelligence Open Source License V1.0.
# The copyright owner promises to follow "Open-Intelligence Open Source Platform
# Management Regulation V1.0", which is provided by The New Generation of
# Artificial Intelligence Technology Innovation Strategic Alliance (the AITISA).

kubernetes:
# Find the namesever in /etc/resolv.conf
cluster-dns: IP
# To support k8s ha, you should set an lb address here.
# If deploy k8s with single master node, please set master IP address here
load-balance-ip: IP

# specify an IP range not in the same network segment with the host machine.
service-cluster-ip-range: 169.254.0.0/16
# According to the etcdversion, you should fill a corresponding backend name.
# If you are not familiar with etcd, please don't change it.
storage-backend: etcd3
# The docker registry used in the k8s deployment. If you can access to gcr, we suggest to use gcr.
docker-registry: gcr.io/google_containers
# http://gcr.io/google_containers/hyperkube. Or the tag in your registry.
hyperkube-version: v1.9.4
# http://gcr.io/google_containers/etcd. Or the tag in your registry.
# If you are not familiar with etcd, please don't change it.
etcd-version: 3.2.17
# http://gcr.io/google_containers/kube-apiserver. Or the tag in your registry.
apiserver-version: v1.9.4
# http://gcr.io/google_containers/kube-scheduler. Or the tag in your registry.
kube-scheduler-version: v1.9.4
# http://gcr.io/google_containers/kube-controller-manager
kube-controller-manager-version: v1.9.4
# http://gcr.io/google_containers/kubernetes-dashboard-amd64
dashboard-version: v1.8.3




+ 0
- 182
cluster-configuration/services-configuration.yaml View File

@@ -1,182 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
#
# Copyright (c) Peking University 2018
#
# The software is released under the Open-Intelligence Open Source License V1.0.
# The copyright owner promises to follow "Open-Intelligence Open Source Platform
# Management Regulation V1.0", which is provided by The New Generation of
# Artificial Intelligence Technology Innovation Strategic Alliance (the AITISA).

cluster:

clusterid: openi-example

# Choose proper nvidia driver version from this url http://www.nvidia.com/object/linux-amd64-display-archive.html
nvidia-drivers-version: 384.111

# static docker-version
# https://download.docker.com/linux/static/stable/x86_64/docker-17.06.2-ce.tgz
# Docker client used by hadoop NM (node manager) to launch Docker containers (e.g., of a deep learning job) in the host env.
docker-verison: 17.06.2

# HDFS, zookeeper data path on your cluster machine.
data-path: "/datastorage"

# the docker registry to store docker images that contain system services like frameworklauncher, hadoop, etc.
docker-registry-info:

# If public, please fill it the same as your username
docker-namespace: your_registry_namespace

# E.g., gcr.io. If public,fill docker_registry_domain with word "public"
# docker_registry_domain: public
docker-registry-domain: your_registry_domain
# If the docker registry doesn't require authentication, please leave docker_username and docker_password empty
docker-username: your_registry_username
docker-password: your_registry_password

docker-tag: your_image_tag

# The name of the secret in kubernetes will be created in your cluster
# Must be lower case, e.g., regsecret.
secret-name: your_secret_name


hadoop:
# If custom_hadoop_binary_path is None, script will download a standard version of hadoop binary for you
# hadoop-version
# http://archive.apache.org/dist/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz
custom-hadoop-binary-path: None
hadoop-version: 2.9.0
# Step 1 of 4 to set up Hadoop queues.
# Define all virtual clusters, equivalent concept of Hadoop queues.
# The capacity of each virtual cluster is specified as the percentage of the whole resources in the system.
# All un-configured resources will go to an auto-generated virtual cluster called 'default'.
virtualClusters:
vc1:
description: VC for Alice's team.
capacity: 20
vc2:
description: VC for Bob's team.
capacity: 20
vc3:
description: VC for Charlie's team.
capacity: 20

volumeMounts:
- mountPath: /gpai
name: scriptdir
- mountPath: /ghome
name: userhome
- mountPath: /gshare
name: share
- mountPath: /gmodel
name: model

volumes:
- name: scriptdir
hostPath:
path: /gpai
- name: userhome
hostPath:
path: /ghome
- name: share
hostPath:
path: /gshare
- name: model
hostPath:
path: /gmodel


frameworklauncher:
frameworklauncher-port: 9086


restserver:
# port for rest api server
server-port: 9186
# secret for signing authentication tokens, e.g., "Hello OPENI!"
jwt-secret: your_jwt_secret
# database admin username
default-openi-admin-username: your_default_openi_admin_username
# database admin password
default-openi-admin-password: your_default_openi_admin_password
# openi database
openi_db_host : "db-host-ip"
openi_db_port : 3308
openi_db_user : "db-user"
openi_db_pwd : "db-user-password"
openi_db_database : "db-database"
templates_store_path: "/var/openi/rest-server/templates"
# iptable path
nat-path: "/var/pai/rest-server/natconfig.json"
volumeMounts:
- mountPath: /gpai
name: scriptdir

volumes:
- name: scriptdir
hostPath:
path: /gpai
webportal:
# port for webportal
server-port: 9286


grafana:
# port for grafana
grafana-port: 3000


prometheus:
# port for prometheus port
prometheus-port: 9091
# port for node exporter
node-exporter-port: 9100


pylon:
# port of pylon
port: 80

model-exchange:
port: 6023

volumeMounts:
- mountPath: /gmodel
name: scriptdir

volumes:
- name: scriptdir
hostPath:
path: /gmodel
model-hub:
server_port: 6024
mysql: root:root@tcp(192.168.113.221:3308)/modelhub
file_storage_path: /gmodel

volumeMounts:
- mountPath: /gmodel
name: scriptdir

volumes:
- name: scriptdir
hostPath:
path: /gmodel

+ 4
- 0
deploy-script/.gitignore View File

@@ -0,0 +1,4 @@

utils/__pycache__

services/__pycache__

+ 97
- 0
deploy-script/build.py View File

@@ -0,0 +1,97 @@
# -*- coding: UTF-8 -*-
import os
import argparse
import yaml
import utils.docker
import utils.dir
import utils.setting
import services.rest_server
import services.image_factory_agent
import services.image_factory_shield
import services.log_service_bee
import services.log_service_queen

workdir_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

def load_config(env):
config = None
if env == "":
env = "dev"

return utils.setting.load(env,os.path.join(workdir_root,"deploy-script/config"))

def get_service_ctx(service_name,config):

ctx = dict()
ctx["workdir"] = ""

if service_name == "rest-server":
ctx["workdir"] = os.path.join(workdir_root,"rest-server/")
ctx["tag"] = services.rest_server.getTag(config)
ctx["buildPrepare"] = services.rest_server.buildPrepare
ctx["buildEnd"] = services.rest_server.buildEnd
if service_name == "image-factory-agent":
ctx["workdir"] = os.path.join(workdir_root,"image-factory/agent")
ctx["tag"] = services.image_factory_agent.getTag(config)
ctx["buildPrepare"] = services.image_factory_agent.buildPrepare
ctx["buildEnd"] = services.image_factory_agent.buildEnd
if service_name == "image-factory-shield":
ctx["workdir"] = os.path.join(workdir_root,"image-factory/shield")
ctx["tag"] = services.image_factory_shield.getTag(config)
ctx["buildPrepare"] = services.image_factory_shield.buildPrepare
ctx["buildEnd"] = services.image_factory_shield.buildEnd

if service_name == "log-service-bee":
ctx["workdir"] = os.path.join(workdir_root,"log-service/bee")
ctx["tag"] = services.log_service_bee.getTag(config)
ctx["buildPrepare"] = services.log_service_bee.buildPrepare
ctx["buildEnd"] = services.log_service_bee.buildEnd
if service_name == "log-service-queen":
ctx["workdir"] = os.path.join(workdir_root,"log-service/queen")
ctx["tag"] = services.log_service_queen.getTag(config)
ctx["buildPrepare"] = services.log_service_queen.buildPrepare
ctx["buildEnd"] = services.log_service_queen.buildEnd
return ctx


def build_and_push_docker_image():

parser = argparse.ArgumentParser()
parser.add_argument('-s', '--service', required=True, help="the service will be deployed")
parser.add_argument('-e', '--env', required=False, help="env")

workdir_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
args = parser.parse_args()

config = load_config(args.env)
service_name = args.service

service = get_service_ctx(service_name,config)

if service["workdir"] == "" or service["tag"] == "":
print("Unknown Service :{}".format(service_name))
return 1
if None != service["buildPrepare"]:
service["buildPrepare"](workdir_root,config)
print("build",service_name)
utils.docker.build(service["tag"],service["workdir"])
utils.docker.push(service["tag"],service["workdir"])

if None != service["buildEnd"]:
service["buildEnd"](workdir_root,config)

print("Successfully")



build_and_push_docker_image()

+ 64
- 0
deploy-script/config/dev.yaml View File

@@ -0,0 +1,64 @@
env: "dev"

cluster: "openi-test"

common:
mysql:
host: '192.168.202.73'
port: 3308
user: "root"
pwd: "root"
dockerRegistry:
host: "192.168.202.74"
port: "5000"
user: "admin"
pwd: "harboradmin"
prometheus: "http://192.168.202.73:9091"

restServer:
jwtSecret: "helloworld"
serverPort: 9186
logService: "/log-service"
volumes:
- name: ghome
mountPath: /ghome
hostPath: /ghome
- name: gmodel
mountPath: /gmodel
hostPath: /gmodel
- name: kube-config
mountPath: /kube
hostPath: /home/amax/.kube
k8sApiServer:
host: "https://192.168.202.71:6443"
kubeConfigPath: "/kube/config"

imageFactory:
shield:
port: 9001
agent:
port: 9002
shield: "http://192.168.202.71:9001"
volumes:
- name: docker-run
mountPath: /var/run
hostPath: /var/run
- name: docker
mountPath: /var/lib/docker
hostPath: /var/lib/docker

logService:
bee:
port: 9003
containers: "/var/lib/docker/containers"
volumes:
- name: "container"
mountPath: "/var/lib/docker/containers"
hostPath: "/var/lib/docker/containers"
queen:
port: 9004
restServer:
host: "http://192.168.202.71:9186"
user: "test123"
pwd: "123456"

+ 35
- 0
deploy-script/config/prod.yaml View File

@@ -0,0 +1,35 @@
env: "prod"

cluster: "openi-test"

common:
mysql:
host: ""
port: ""
user: ""
pwd: ""
influxdb:
host: ""
port: ""
user: ""
pwd: ""
dockerRegistry:
host: ""
port: ""
user: ""
pwd: ""
prometheus: ""


restServer:
jwtSecret: ""
volumes:
- name: "ghome"
mountPath: "/ghome"
hostPath: "/ghome"
- name: "gpai"
mountPath: "/gpai"
hostPath: "/gpai"
k8sApiServer:
host: ""
kubeConfigPath: ""

+ 97
- 0
deploy-script/deploy.py View File

@@ -0,0 +1,97 @@
# -*- coding: UTF-8 -*-
import os
import codecs
import argparse
import yaml
import utils.k8s
import utils.dir
import utils.setting
import services.rest_server
import services.image_factory_agent
import services.image_factory_shield
import services.log_service_bee
import services.log_service_queen

from jinja2 import Template

workdir_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

def load_config(env):
config = None
if env == "":
env = "dev"

return utils.setting.load(env,os.path.join(workdir_root,"deploy-script/config"))


def get_service_ctx(service_name,config):
ctx = dict()
ctx["workdir"] = ""

if service_name == "rest-server":
ctx["workdir"] = os.path.join(workdir_root,"rest-server/")
ctx["deployTemplateName"] = services.rest_server.deployTemplateName
ctx["historyClean"] = services.rest_server.historyClean
ctx["getDeployConfig"] = services.rest_server.getDeployConfig

if service_name == "image-factory-agent":
ctx["workdir"] = os.path.join(workdir_root,"image-factory/agent")
ctx["deployTemplateName"] = services.image_factory_agent.deployTemplateName
ctx["historyClean"] = services.image_factory_agent.historyClean
ctx["getDeployConfig"] = services.image_factory_agent.getDeployConfig

if service_name == "image-factory-shield":
ctx["workdir"] = os.path.join(workdir_root,"image-factory/shield")
ctx["deployTemplateName"] = services.image_factory_shield.deployTemplateName
ctx["historyClean"] = services.image_factory_shield.historyClean
ctx["getDeployConfig"] = services.image_factory_shield.getDeployConfig
if service_name == "log-service-bee":
ctx["workdir"] = os.path.join(workdir_root,"log-service/bee")
ctx["deployTemplateName"] = services.log_service_bee.deployTemplateName
ctx["historyClean"] = services.log_service_bee.historyClean
ctx["getDeployConfig"] = services.log_service_bee.getDeployConfig

if service_name == "log-service-queen":
ctx["workdir"] = os.path.join(workdir_root,"log-service/queen")
ctx["deployTemplateName"] = services.log_service_queen.deployTemplateName
ctx["historyClean"] = services.log_service_queen.historyClean
ctx["getDeployConfig"] = services.log_service_queen.getDeployConfig
return ctx

def deploy_service():

parser = argparse.ArgumentParser()
parser.add_argument('-s', '--service', required=True, help="the service will be deployed")
parser.add_argument('-e', '--env', required=False, help="env")
args = parser.parse_args()
config = load_config(args.env)

service_name = args.service

service = get_service_ctx(service_name,config)

if service["workdir"] == "" or service["deployTemplateName"] == "" or service["historyClean"] == None:
print("Unknown Service :{}".format(service_name))
return 1
deploy_template = codecs.open(os.path.join(workdir_root,"deploy-script","template",service["deployTemplateName"]),"r","utf-8").read()

deploy_config = service["getDeployConfig"](config)

deploy_yaml = Template(deploy_template).render(deploy_config)

codecs.open(os.path.join(service["workdir"],"deploy.yaml"),"w","utf-8").write(deploy_yaml)

service["historyClean"]()

utils.k8s.deploy("deploy.yaml",service["workdir"])

print("Successfully")



deploy_service()

+ 0
- 0
deploy-script/services/__init__.py View File


+ 39
- 0
deploy-script/services/image_factory_agent.py View File

@@ -0,0 +1,39 @@
# -*- coding: UTF-8 -*-

import utils.k8s

service_name = "image-factory-agent"

deployTemplateName = "image-factory-agent.yaml"

def getTag(config):
docker_registry_host = config.get("common").get("dockerRegistry").get("host")
docker_registry_port = config.get("common").get("dockerRegistry").get("port")
return "{}:{}/openi/{}:v1".format(docker_registry_host,docker_registry_port,service_name)

def getDaemonsetName():
return "{}-ds".format(service_name)

def getDeployConfig(config):
agent_config = config.get("imageFactory").get("agent")
return {
"ENV":config.get("env"),
"DAEMONSET_NAME":getDaemonsetName(),
"IMAGE_ADDRESS":getTag(config),
"VOLUME_MOUNTS":agent_config.get("volumes"),
"PORT": agent_config.get("port"),
"SHIELD_ADDRESS": agent_config.get("shield")
}


def historyClean():
ds_name = getDaemonsetName()
utils.k8s.removeDaemonset(ds_name)


def buildPrepare(root,config):
pass


def buildEnd(root,config):
pass

+ 36
- 0
deploy-script/services/image_factory_shield.py View File

@@ -0,0 +1,36 @@
# -*- coding: UTF-8 -*-

import utils.k8s

service_name = "image-factory-shield"

deployTemplateName = "image-factory-shield.yaml"

def getTag(config):
docker_registry_host = config.get("common").get("dockerRegistry").get("host")
docker_registry_port = config.get("common").get("dockerRegistry").get("port")
return "{}:{}/openi/{}:v1".format(docker_registry_host,docker_registry_port,service_name)

def getDaemonsetName():
return "{}-ds".format(service_name)

def getDeployConfig(config):
agent_config = config.get("imageFactory").get("shield")
return {
"ENV":config.get("env"),
"DAEMONSET_NAME":getDaemonsetName(),
"IMAGE_ADDRESS":getTag(config),
"PORT": agent_config.get("port")
}

def historyClean():
ds_name = getDaemonsetName()
utils.k8s.removeDaemonset(ds_name)


def buildPrepare(root,config):
pass


def buildEnd(root,config):
pass

+ 39
- 0
deploy-script/services/log_service_bee.py View File

@@ -0,0 +1,39 @@
# -*- coding: UTF-8 -*-

import utils.k8s

service_name = "log-service-bee"

deployTemplateName = "log-service-bee.yaml"

def getTag(config):
docker_registry_host = config.get("common").get("dockerRegistry").get("host")
docker_registry_port = config.get("common").get("dockerRegistry").get("port")
return "{}:{}/openi/{}:v1".format(docker_registry_host,docker_registry_port,service_name)

def getDaemonsetName():
return "{}-ds".format(service_name)

def getDeployConfig(config):
bee_config = config.get("logService").get("bee")
return {
"ENV":config.get("env"),
"DAEMONSET_NAME":getDaemonsetName(),
"IMAGE_ADDRESS":getTag(config),
"VOLUME_MOUNTS":bee_config.get("volumes"),
"PORT": bee_config.get("port"),
"CONTAINERS": bee_config.get("containers")
}


def historyClean():
ds_name = getDaemonsetName()
utils.k8s.removeDaemonset(ds_name)


def buildPrepare(root,config):
pass


def buildEnd(root,config):
pass

+ 40
- 0
deploy-script/services/log_service_queen.py View File

@@ -0,0 +1,40 @@
# -*- coding: UTF-8 -*-

import utils.k8s

service_name = "log-service-queen"

deployTemplateName = "log-service-queen.yaml"

def getTag(config):
docker_registry_host = config.get("common").get("dockerRegistry").get("host")
docker_registry_port = config.get("common").get("dockerRegistry").get("port")
return "{}:{}/openi/{}:v1".format(docker_registry_host,docker_registry_port,service_name)

def getDaemonsetName():
return "{}-ds".format(service_name)

def getDeployConfig(config):
queen_config = config.get("logService").get("queen")
return {
"ENV":config.get("env"),
"DAEMONSET_NAME":getDaemonsetName(),
"IMAGE_ADDRESS":getTag(config),
"PORT": queen_config.get("port"),
"REST_SERVER": queen_config.get("restServer").get("host"),
"REST_SERVER_USER": queen_config.get("restServer").get("user"),
"REST_SERVER_PWD": queen_config.get("restServer").get("pwd")
}


def historyClean():
ds_name = getDaemonsetName()
utils.k8s.removeDaemonset(ds_name)


def buildPrepare(root,config):
pass


def buildEnd(root,config):
pass

+ 47
- 0
deploy-script/services/rest_server.py View File

@@ -0,0 +1,47 @@
# -*- coding: UTF-8 -*-

import utils.k8s

service_name = "rest-server"

deployTemplateName = "rest-server.yaml"

def getTag(config):
docker_registry_host = config.get("common").get("dockerRegistry").get("host")
docker_registry_port = config.get("common").get("dockerRegistry").get("port")
return "{}:{}/openi/{}:v1".format(docker_registry_host,docker_registry_port,service_name)

def getDaemonsetName():
return "{}-ds".format(service_name)

def getDeployConfig(config):
rest_server = config.get("restServer")
common_config = config.get("common")
return {
"ENV":config.get("env"),
"DAEMONSET_NAME":getDaemonsetName(),
"IMAGE_ADDRESS":getTag(config),
"VOLUME_MOUNTS":rest_server.get("volumes"),
"SERVER_PORT": rest_server.get("serverPort"),
"JWT_SECRET": rest_server.get("jwtSecret"),
"MYSQL_HOST": common_config.get("mysql").get("host"),
"MYSQL_PORT": common_config.get("mysql").get("port"),
"MYSQL_USER": common_config.get("mysql").get("user"),
"MYSQL_PWD": common_config.get("mysql").get("pwd"),
"K8S_API_SERVER":rest_server.get("k8sApiServer").get("host"),
"K8S_CONFIG": rest_server.get("k8sApiServer").get("kubeConfigPath"),
"LOG_SERVICE": rest_server.get("logService")
}


def historyClean():
ds_name = getDaemonsetName()
utils.k8s.removeDaemonset(ds_name)


def buildPrepare(root,config):
pass


def buildEnd(root,config):
pass

+ 57
- 0
deploy-script/template/image-factory-agent.yaml View File

@@ -0,0 +1,57 @@
# Copyright (c) PCL
# All rights reserved.
#
# MIT License
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: {{DAEMONSET_NAME}}
spec:
selector:
matchLabels:
app: image-factory-agent
template:
metadata:
labels:
app: image-factory-agent
name: image-factory-agent
spec:
hostNetwork: true
hostPID: true
containers:
- name: image-factory-agent
image: {{IMAGE_ADDRESS}}
env:
- name: SHIELD_ADDRESS
value: {{SHIELD_ADDRESS}}
{% if VOLUME_MOUNTS %}
volumeMounts:
{% for volumeinfo in VOLUME_MOUNTS %}
- mountPath: {{ volumeinfo['mountPath'] }}
name: {{ volumeinfo['name'] }}
{% endfor %}
{% endif %}
ports:
- name: agent-port
containerPort: {{PORT}}
hostPort: {{PORT}}
{% if VOLUME_MOUNTS %}
volumes:
{% for volumeinfo in VOLUME_MOUNTS %}
- name: {{ volumeinfo['name'] }}
hostPath:
path: {{ volumeinfo['hostPath'] }}
{% endfor %}
{% endif %}

+ 44
- 0
deploy-script/template/image-factory-shield.yaml View File

@@ -0,0 +1,44 @@
# Copyright (c) PCL
# All rights reserved.
#
# MIT License
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: {{DAEMONSET_NAME}}
spec:
selector:
matchLabels:
app: image-factory-shield
template:
metadata:
name: image-factory-shield
labels:
app: image-factory-shield
spec:
hostNetwork: false
hostPID: false
nodeSelector:
noderole: "master"
containers:
- name: image-factory-shield
image: {{IMAGE_ADDRESS}}
imagePullPolicy: Always
ports:
- name: shield-port
containerPort: {{PORT}}
hostPort: {{PORT}}

+ 57
- 0
deploy-script/template/log-service-bee.yaml View File

@@ -0,0 +1,57 @@
# Copyright (c) PCL
# All rights reserved.
#
# MIT License
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: {{DAEMONSET_NAME}}
spec:
selector:
matchLabels:
app: log-service-bee
template:
metadata:
labels:
app: log-service-bee
name: log-service-bee
spec:
hostNetwork: false
hostPID: false
containers:
- name: log-service-bee
image: {{IMAGE_ADDRESS}}
env:
- name: CONTAINERS
value: {{CONTAINERS}}
{% if VOLUME_MOUNTS %}
volumeMounts:
{% for volumeinfo in VOLUME_MOUNTS %}
- mountPath: {{ volumeinfo['mountPath'] }}
name: {{ volumeinfo['name'] }}
{% endfor %}
{% endif %}
ports:
- name: bee-port
containerPort: {{PORT}}
hostPort: {{PORT}}
{% if VOLUME_MOUNTS %}
volumes:
{% for volumeinfo in VOLUME_MOUNTS %}
- name: {{ volumeinfo['name'] }}
hostPath:
path: {{ volumeinfo['hostPath'] }}
{% endfor %}
{% endif %}

+ 54
- 0
deploy-script/template/log-service-queen.yaml View File

@@ -0,0 +1,54 @@
# Copyright (c) PCL
# All rights reserved.
#
# MIT License
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: {{DAEMONSET_NAME}}
spec:
selector:
matchLabels:
app: log-service-queen
template:
metadata:
name: log-service-queen
labels:
app: log-service-queen
spec:
hostNetwork: false
hostPID: false
nodeSelector:
noderole: "master"
containers:
- name: log-service-queen
image: {{IMAGE_ADDRESS}}
imagePullPolicy: Always
ports:
- name: port
containerPort: {{PORT}}
env:
- name: ENV
value: prod
- name: REST_SERVER
value: "{{REST_SERVER}}"
- name: REST_SERVER_USER
value: {{REST_SERVER_USER}}
- name: REST_SERVER_PWD
value: "{{REST_SERVER_PWD}}"
- name: PORT
value: "{{PORT}}"


+ 84
- 0
deploy-script/template/rest-server.yaml View File

@@ -0,0 +1,84 @@
# Copyright (c) PCL
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: {{DAEMONSET_NAME}}
spec:
selector:
matchLabels:
app: rest-server
template:
metadata:
name: rest-server
labels:
app: rest-server
spec:
hostNetwork: false
hostPID: false
nodeSelector:
noderole: "master"
containers:
- name: rest-server
image: {{ IMAGE_ADDRESS }}
imagePullPolicy: Always
securityContext:
privileged: true
{% if VOLUME_MOUNTS %}
volumeMounts:
{% for volumeinfo in VOLUME_MOUNTS %}
- mountPath: {{ volumeinfo['mountPath'] }}
name: {{ volumeinfo['name'] }}
{% endfor %}
{% endif %}
env:
- name: EGG_SERVER_ENV
value: prod
- name: ENV
value: {{ENV}}
- name: SERVER_PORT
value: "{{SERVER_PORT}}"
- name: JWT_SECRET
value: {{JWT_SECRET}}
- name: MYSQL_HOST
value: "{{ MYSQL_HOST }}"
- name: MYSQL_PORT
value: "{{ MYSQL_PORT }}"
- name: MYSQL_USER
value: {{ MYSQL_USER }}
- name: MYSQL_PWD
value: {{ MYSQL_PWD }}
- name: K8S_API_SERVER
value: "{{K8S_API_SERVER}}"
- name: K8S_CONFIG
value: {{K8S_CONFIG}}
- name: LOG_SERVICE
value: {{LOG_SERVICE}}
ports:
- name: rest-server
containerPort: {{SERVER_PORT}}
hostPort: {{SERVER_PORT}}
{% if VOLUME_MOUNTS %}
volumes:
{% for volumeinfo in VOLUME_MOUNTS %}
- name: {{ volumeinfo['name'] }}
hostPath:
path: {{ volumeinfo['hostPath'] }}
{% endfor %}
{% endif %}

+ 0
- 0
deploy-script/utils/__init__.py View File


+ 14
- 0
deploy-script/utils/dir.py View File

@@ -0,0 +1,14 @@
# -*- coding: UTF-8 -*-
import subprocess


def copy(src,target,workdir):
cmd = "cp -r {} {}".format(src,target)
subprocess.check_call(cmd,shell=True,cwd=workdir)


def rm(target,workdir):
cmd = "rm -rf {}".format(target)
subprocess.check_call(cmd,shell=True,cwd=workdir)

+ 14
- 0
deploy-script/utils/docker.py View File

@@ -0,0 +1,14 @@
# -*- coding: UTF-8 -*-
import subprocess


def build(tag,workdir):
cmd = "docker build -t {} ./".format(tag)
subprocess.check_call(cmd,shell=True,cwd=workdir)


def push(tag,workdir):
cmd = "docker push {}".format(tag)
subprocess.check_call(cmd,shell=True,cwd=workdir)


+ 19
- 0
deploy-script/utils/k8s.py View File

@@ -0,0 +1,19 @@
# -*- coding: UTF-8 -*-
import subprocess

def isDaemonsetExist(name):
cmd = "kubectl get daemonset"
output = subprocess.check_output(cmd,shell=True)
return output.find(name) > -1

def removeDaemonset(name):
if isDaemonsetExist(name):
cmd = "kubectl delete daemonset {}".format(name)
subprocess.check_call(cmd,shell=True)


def deploy(yaml,workdir):
cmd = "kubectl create -f {}".format(yaml)
subprocess.check_call(cmd,shell=True,cwd=workdir)

+ 29
- 0
deploy-script/utils/setting.py View File

@@ -0,0 +1,29 @@
# -*- coding: UTF-8 -*-
import os
from jinja2 import Template
import socket
import yaml

def getHostIp():
try:
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(('8.8.8.8', 80))
ip = s.getsockname()[0]
finally:
s.close()
return ip


def load(env,config_template_path):

file = env + ".yaml"

path = os.path.join(config_template_path,file)

temp = open(path,"r").read()

return yaml.load(temp)




+ 89
- 0
efk/README_zh.md View File

@@ -0,0 +1,89 @@
# 简介

openi章鱼filebeat + elasticsearch方案收集子任务对应的容器的日志

# 方案

1. 在每个节点上部署elasticsearch分布式文本存储服务,组成es集群

2. 配置es服务的ingress,通过网关服务访问es集群服务

3. 部署filebeat收集每个节点的容器日志,配置filebeat的日志接收后端为es集群

3. webportal访问es集群,给定容器ID参数,可以搜索出特定的容器日志

请求例子:

POST http://$GatewayIP/es/_search
```
{

method:'POST',
body:{
query: {
match:{
"log.file.path": "/var/lib/docker/containers/$containerId/$containerId-json.log"
}
},
size:pageSize,
from:logIndex,
sort: "log.offset"
}
```

# 镜像

1. es-$nodename-statefulset.yml => elastic/elasticsearch:7.1.0

2. filebeat-kubernetes.yaml => docker.elastic.co/beats/filebeat:7.1.0


# 前提

1.Kubernetes version >= 1.13

2.设置节点的主机名,本文档假设集群有两个节点

* 设置master节点的主机名

[root@host1 ~]# hostname xp001

* 设置第二个加入节点的主机名

[root@host1 ~]# hostname v001

* 重启kubelet

[root@host1 ~]# systemctl restart kubelet


# [部署](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html)

1. 节点准备用户组:用户和处理对外映射的es数据文件夹

```
# sudo su
# mkdir /usr/share/elasticsearch
# chmod 0775 /usr/share/elasticsearch
# chown -R 1000:0 /usr/share/elasticsearch
```

如果没有这个1000用户id,需要创建一个用户

```
# adduser -u 1000 -G 0 -d /usr/share/elasticsearch elasticsearch
# chown -R 1000:0 /usr/share/elasticsearch
```

2. cd openi

3. kubectl apply -f ./efk

+ 19
- 0
efk/es-external-service.yaml View File

@@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: es-external-service
namespace: kube-system
labels:
k8s-app: elasticsearch-logging
spec:
ports:
- name: es-db
port: 9200