Dockerized collabora suddenly extremely slow

Dear all, I have been using dockerized collabora on our server for years without issues but a while ago, initialization has started to take ages. This is about a bare metal octacore server with 32 MB RAM running Ubuntu 20.04 LTS and the docker-ce package from docker.com

I have described the issue also in the Nextcloud collabora forum, but I was hoping that the experts would be here.

All has been working just fine until some weeks ago. But something changed and now it takes several minutes to initialize each time I try to open a document in Nextcloud.

The docker instance is running behind a reverse proxy provided by NGINX. I can’t pinpoint the exact moment it changed, I think I initially thought this to be a temporary hickup, therefore I don’t know if it correlated with any particular upgrade of the collabora docker image or of docker itself. (Everything else should not have changed except for security fixes as the server is running Ubuntu 20.04 LTS).

Now, both the external demo server available as a Nextcloud app as well as the CODE app run fine. Only the docker version has these problems. In the web server log for the docker collabora instance I see the message


2021/08/29 12:00:09 [error] 3381458#3381458: *12115 connect() failed (111: Connection refused) while connecting to upstream, client: *****, server: office.****.org, request: "
GET /hosting/capabilities HTTP/1.1", upstream: "http://[::1]:9980/hosting/capabilities", host: "office.uferwerk.org"
2021/08/29 12:00:09 [warn] 3381458#3381458: *12115 upstream server temporarily disabled while connecting to upstream, client: *****, server: office.****.org, request: "GET /ho
sting/capabilities HTTP/1.1", upstream: "http://[::1]:9980/hosting/capabilities", host: "office.***.org"

docker container logs collabora looks like this, not very revealing:

frk-00044-00044 2021-08-29 09:53:22.369268 [ forkit ] ERR  Failed to unmount [/opt/lool/child-roots/DZprHqJzWhKFWr88].| common/JailUtil.cpp:68
wsd-00007-00048 2021-08-29 09:53:39.584546 [ websrv_poll ] WRN  client - server version mismatch, disabling browser cache. Expected: d12ab86| wsd/FileServer.cpp:288
wsd-00007-00048 2021-08-29 09:53:39.986969 [ websrv_poll ] WRN  client - server version mismatch, disabling browser cache. Expected: d12ab86| wsd/FileServer.cpp:288
wsd-00007-00048 2021-08-29 09:53:41.927800 [ websrv_poll ] WRN  client - server version mismatch, disabling browser cache. Expected: d12ab86| wsd/FileServer.cpp:288
wsd-00007-00048 2021-08-29 09:53:42.365004 [ websrv_poll ] WRN  client - server version mismatch, disabling browser cache. Expected: d12ab86| wsd/FileServer.cpp:288
wsd-00007-00095 2021-08-29 09:56:07.675928 [ docbroker_004 ] WRN  Waking up dead poll thread [HttpSynReqPoll], started: false, finished: false| ./net/Socket.hpp:682
wsd-00007-00095 2021-08-29 09:56:07.860668 [ docbroker_004 ] WRN  Waking up dead poll thread [HttpSynReqPoll], started: false, finished: false| ./net/Socket.hpp:682

I have double and triple checked that the reverse proxy config is identical to the official instructions. I also tried older docker images, so far no change. Does anyone have an idea what might be causing the slowdown? Again, it has worked fine for years before.

I’d start looking at network issues like: name resolution for both IPV4 and IPV6, connectivity on both protocols, availability of IPV6 in docker (or not), firewall or selinux, apparmor blocking one of the protocols, nginx proxy configuration for both protocols. Maybe you can even reproduce something using other services on your collabora machine.

Thanks, that looks like quite a programme of work. :wink:

Strangely though, for no discernable reason it has just now resumed working flawlessly (after several weeks of dysfunction). I am completely dumbfounded as to why that is. So I’ll have to wait and see whether the issue returns.

UPDATE: I rejoiced prematurely. The issue is back already. Opening a document takes about 5 minutes.

First I see these messages in the log:

wsd-00008-00277 2021-08-29 12:08:02.118837 [ docbroker_011 ] WRN  Tracker tileID 0:7680:7680:3840:3840:0 was dropped because of time out (5338ms). Tileprocessed message did not arrive in time.| wsd/ClientSession.cpp:1969
wsd-00008-00277 2021-08-29 12:08:02.118844 [ docbroker_011 ] WRN  Tracker tileID 0:11520:7680:3840:3840:0 was dropped because of time out (5338ms). Tileprocessed message did not arrive in time.| wsd/ClientSession.cpp:1969

After the long wait, when it has finished loading, this is what I see:

wsd-00008-00283 2021-08-29 12:13:25.938298 [ docbroker_012 ] WRN  Waking up dead poll thread [HttpSynReqPoll], started: false, finished: false| ./net/Socket.hpp:682
wsd-00008-00283 2021-08-29 12:13:26.058314 [ docbroker_012 ] WRN  Waking up dead poll thread [HttpSynReqPoll], started: false, finished: false| ./net/Socket.hpp:682

As for the server’s connectivity, we have a ton of websites there, and the server has a static public IP (v4 and v6). The issue is only affecting the dockerized collabora instance and it only started occurring a few weeks ago. (it does not affect a native collabora instance installed as a nextcloud app, here, documents open instantaneously)

I wonder if running collabora not through a reverse proxy would change anything, but for that to happen, I would have the docker container to use the letsencrypt certificate. (I understand, it does not have the capability (yet?!?) of obtaining certificates from Letsencrypt by itself)

I seem to have the same issue, currently on NC 22.2.0 and Collabora 6.4.11.3 (also test latest 6.4.13.2).
Opening a doc takes ages and I see 2 warnings after it finnally loads:

wsd-00008-00045 2021-10-16 15:03:33.111494 [ docbroker_001 ] WRN  Waking up dead poll thread [HttpSynReqPoll], started: false, finished: false| ./net/Socket.hpp:682
wsd-00008-00045 2021-10-16 15:03:33.192734 [ docbroker_001 ] WRN  Waking up dead poll thread [HttpSynReqPoll], started: false, finished: false| ./net/Socket.hpp:682

Are there more logs somewhere ? - it sounds nasty. Slowness on-load depends which stage you’re at =) it might be nice to have a picture of the browser piece - what shows up - do we get the iframe ? does it have content ? did loleaflet.html load ? that sort of thing. Is it possible that the slowness is getting the document from your Nextcloud ?

Thanks !