I was working on-call shift and one of our websites started giving error 520 from CloudFlare, so I received a notification about it and started working on getting an idea what is going on, since there was no deploy on that project for a really long time. What was weirder was that I actually could open the website, but other people couldn't. The only thing that we made as a change actually was to implement OneTrust cookie policy bar in to the website - one JS snipped few days ago and that was all. Naturally I started with following the logs:
OK so, the problem should be somewhere in the NGINX config, but where... So I started reading some posts about such issues, and finally I came around someone mentioning that if the size of the Cookie is too big, there might be such issues, so I asked the people that have the problem to send me a .har
dump from the request. From there I saw that the Cookie really is huge, and that is because of OneTrust. Unfortunately there was no way I can change anything about OneTrust, and also there was no way I can clear the user cookies, so the request can go through.
I had to make some changes in the NGINX setting in order to fix it. I've read some more articles about it and found that I can fix this by tweaking these 2 things:
http2_max_field_size 16k;
http2_max_header_size 64k;
I changed those 2 things and the request started going through NGINX.
It turned out finally that everything was caused by OneTrust together with our Ads that we are using. Our legal department had set dynamic vendors in OneTrust for the cookies and ads, but with dynamic ads like ours we had hundreds of different providers and so the cookie size was getting bigger and bigger. This was also individual for each person visiting the website, so the more time you spend on it without AdBlocker, the bigger the cookie gets, and eventually it stopped the requests.
Couple of weeks later the same problem occurred again, when I was on-call, but this time I had an idea what exactly to do. So naturally I went to tweak those 2 settings again and doubled the limit, but this didn't help :(
After that I went back to the logs and noticed that this time my request was passing just fine through NGINX, but something after that was messing up with it. The upstream for the NGINX was Varnish, so I went to check the Varnish logs for some traces. Here I've hit the road block. There were no logs from my IP in Varnish. So this cleared things up for me and I determined that the problem is Varnish, which is fine, but what should I do next? I started reading some questions in Stack Overflow, Reddit and so on, but nothing really useful. I finally found some parameter that I can change and will potentially fix my issue, and it was executed via varnishadm
command, so it was applied immediately to the running instance. The command was: varnishadm param.set vsl_reclen 4084b
I've put some big number at the end (actually the maximum allowed), just to see if it's gonna work.
IT DIDN'T
Since I already knew why I have the problem - Cookie size and our website doesn't have a login and is not using cookies for anything useful on the backend, I just went there and removed the cookies from the request in NGINX, so that there were no Cookies passed to Varnish, and so there were no more problems :)
I did it with the following things in the NGINX config:
proxy_hide_header Set-Cookie;
proxy_ignore_headers Set-Cookie;
proxy_set_header Cookie "";