Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
After change to http.sys, errors on production servers after a few days of operation
#1
Hi,

let me give you some background info first: We have a couple of production servers powering a SaaS solution, one of them serving as a master and the other a slave (continuously getting data backups from the master). These both contain three IntraWeb standalone applications (running as services) as well as a number of other backend services (some of which expose APIs). Previously, these services listened on different ports (for https traffic), with an external firewall translating calls on the standard 443 port for different external IP addresses (connected to unique URLs) to the port used internally on the server. The three IW services as well as a couple of APIs written in Delphi were all using OpenSSL, whereas one .NET service used Windows own SSL functionality for its API. As we were running out of IP addresses on the hosting site, we decided to refactor our applications to all using http.sys, so that all services could share the same port and external IP address (being distinguished by different paths in the URL, such as '/api', '/m' etc.), . A secondary reason was for not being stuck with OpenSSL 1.0.2 as this no longer receives updates.

This all worked well in test environments and also initially after upgrading our production services to the new version using http.sys. However, after a few days of operation the master server started getting various errors, which could not be resolved by anything else than a restart of the server. This repeated itself every 2 or 3 days. We tried switching operation to the other (slave) server, but then the same errors occurred there (after running a few days). We also tried upgrading the OS of one of the servers to Windows Server 2019 (previously both were running Windows Server 2012), but it did not solve the problem. The errors we saw in our maintenance logs had not occurred before the upgrade. They affect different services and you do not see any obvious connection between them, but once one of the errors started occurring, the others soon followed. My guess is that they are different symptoms of the same underlying issue, but not necessarily a clue to what the root problem is. These are the errors we have seen:


  1. Exception EMenuError with message 'Out of system resources' raised when trying to set Enabled to False for menu item (from an IntraWeb application)
  2. A command-line application sox.exe used for audio file conversions fails, and you could (at least in Win 2012) see a corresponding error for a kernel dll in the Windows Event log.
  3. Parsing of XML files using MSXML (Microsoft XDOM) fails.
  4. Failing to set up a new connection to a NexusDB server using a so called SharedMemory transport.
  5. Seeing error event 10010 in Windows event log (timeout error for DistributedCOM), however not seemingly connected to an event in any of our services.   

We have improved error handling in our code, removed unnecessary usage of the sox application, replaced MSXML with OXML, using another type of transport towards nxServer etc. making our applications more robust against the errors, but the errors 1, 2 and 5 still do occur typically after 2-10 days of operation since the last restart and then we need to restart again. Has anyone else experienced similar problems? Any insights to as to why this is happening, and possible solutions or workarounds would be much appreciated.

Best regards

Magnus Oskarsson
Reply
#2
Hi Magnus,

My guess regarding 2, 3 and 4 seems related to user mode x kernel mode or even to different permissions required to run the processes.

Can you describe what error do you get in (2) ? I'd guess that the sox.exe application needs access to something that is not available to the kernel mode app (or to the user running it).

Is this sox.exe the same as this (https://sox.sourceforge.net/) ?

How are you executing sox.exe? Can you try, for instance, to run another external application (it can be something simple as xcopy c:\temp\file.txt c:\temp2\) and see what you get?
Also, can you create a simple test application using the sox app so I can test it in my environment?

Regarding (1), it's hard to tell. As it seems, it always happens from the exact same point in your code, is that correct?

Finally, regarding (5), I don't think it is related to IntraWeb at all, unless the sox.exe process (or some other secondary process) uses some form of COM (or DCOM).
Reply
#3
Hi Alexandre and thanks for your feedback!

For your question on 1: Yes, it is when we set the Enabled property on the menu items after login to our system (based on e.g. user access level). After I added better error handling and logging for this, I could see that it seems like if you get this for one menu item in a particular (typically newly created) session, you will get it for the rest of the menu items we do this for as well. But then it is still possible that you can have a login from another (newer) session without this error.

As for 2, we have minimized its usage to when it is really needed, so I doubt that it is the actual source of the issues. Also, an important note is that the usage is not from within the web applications, but lies in a separate service. And this service was not modified as part of the http.sys upgrade we made. Still, I went to the code in this service (for which I did add some extra error handling and logging earlier when these problems first came), and I see that it is using TStSpawnApplication (from the old SysTools library).

As I said, I suspect 1-5 are symptoms rather than causes for the problems.

I could also mention that we besides the web applications also updated a couple of services with APIs powered by RemObjects, by changing their server components from Indy/OpenSSL-based to Windows-based (so that they also use http.sys). But I doubt that they are the source of the problems, I have made simulations in our test environment where I have stressed the APIs much harder than in the real production environment (with https calls, using the same kind of setup with regards to SSL certificate, ciphers etc.), without getting any issues. It is harder to simulate the actual usage over time done by real users of the web applications in the production environment.

So still, what stands out with this software upgrade we made is the changes to the web applications with Indy/OpenSSL replaced by http.sys. As part of this we migrated IW to version 15.2.49 (We also had the CGDevTools package upgraded to a matching version, but this is not used in our main web application).
Reply
#4
Hi again. I have tried to dig deeper and looked at memory usage for different services in our maintenance logs. Interestingly, our main web application came up to around the same level (350 MB) before restart in an interval that contains 5 server restart (after errors ocurred). When I look at the graph and check a few steep "jumps" in memory usage (Working Set) I found at least 3 places where a new session followed by an 'Invalid page context' exception leads to an instant increase in 50 MB. No other activity in the web application at this time. Extract from our logs:

Code:
"23141";"2022-11-30 04:56:57.119";"EmWebClient";"\Information";"\GenClient\ServerController";"";"BrowserCheck: ChromeMobile 94,00, fully supported by IW framework. UserAgent = Mozilla/5.0 (Linux; Android 7.0; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4590.2 Mobile Safari/537.36 Chrome-Lighthouse. IP = ?"
"23142";"2022-11-30 04:56:57.135";"EmWebClient";"\Information";"\GenClient\ServerController";"";"Updating StatusInfo, setting SessionCount to 1"
"23143";"2022-11-30 04:56:57.135";"EmWebClient";"\Information";"\BaseClient\UserSession";"";"New session: AppID=IjDBkYnqhJz4WK5pQAFvOR7l31y, SessionGUID={6B0E4C9F-36B7-48BD-8B21-F3DE31346D95}"
"23144";"2022-11-30 04:56:57.431";"EmWebClient";"\Debug\UI";"\BaseClient\Web\Form";"";"Form created: BaseClientLoginWebDialog"
"23145";"2022-11-30 04:57:02.466";"EmWebClient";"\Error\Exception";"\GenClient\ServerController";"";"Exception EIWSecurityException with message 'Invalid page context' raised when processing in TBaseClientLoginWebDialog"
"23146";"2022-11-30 04:57:02.466";"EmWebClient";"\Error";"\GenClient\ServerController";"";"Stack trace for exception has been output to IWExceptions directory"

In another case, the time between session creation and the exception is much longer (5 minutes). In this case it is followed by a second exception 'Session context is not valid':

Code:
"10047";"2022-12-03 16:48:04.413";"EmWebClient";"\Information";"\GenClient\ServerController";"";"BrowserCheck: SafariMobile 16,10, fully supported by IW framework. UserAgent = Mozilla/5.0 (iPhone; CPU iPhone OS 16_1_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Mobile/15E148 Safari/604.1. IP = ?"
"10048";"2022-12-03 16:48:04.413";"EmWebClient";"\Information";"\GenClient\ServerController";"";"Updating StatusInfo, setting SessionCount to 1"
"10049";"2022-12-03 16:48:04.429";"EmWebClient";"\Information";"\BaseClient\UserSession";"";"New session: AppID=YDSZtQ38npJA-5LcIQBABdjTW~a, SessionGUID={C3799D99-46D2-4575-8A6B-6B5A9EE6FF03}"
"10050";"2022-12-03 16:48:04.726";"EmWebClient";"\Debug\UI";"\BaseClient\Web\Form";"";"Form created: BaseClientLoginWebDialog"
"10051";"2022-12-03 16:53:26.079";"EmWebClient";"\Error\Exception";"\GenClient\ServerController";"";"Exception EIWSecurityException with message 'Invalid page context' raised when processing in TBaseClientLoginWebDialog"
"10052";"2022-12-03 16:53:26.079";"EmWebClient";"\Error";"\GenClient\ServerController";"";"Stack trace for exception has been output to IWExceptions directory"
"10053";"2022-12-03 16:53:26.376";"EmWebClient";"\Hint";"\GenClient\ServerController";"";"Exception EInvalidAppID with message 'Session context is not valid' raised when processing in TBaseClientLoginWebDialog"

Do you have any explanation for this? We see the 'Invalid page context' exceptions now and then in our maintenance logs, from all three web applications, but we have never been able to reproduce them ourselves. Let me know if the stack traces are of interest, and/or if the IntraWeb session log could contain anything of interest (I am not sure how far back they go, also, you need some kind of tool to read them, right?)

/Magnus
Reply
#5
I'll split my response into more than one post because I think it will be simpler.

2)

I have never used myself TStSpawnApplication. The code looks solid, although it uses ShellExecuteEx() instead of the (IMO the preferred way to start another application) CreateProcess() Windows API.

In the old days we used to recommend ShellExecute(Ex) to start other processes, but a long time ago we found that it used to fail (mainly in ISAPI apps) under some special circumstances. Instead of banging our heads against the wall trying to fix it we just replaced it by CreateProcess() and the issues went away.
That's how the routine ExecProcess() in IWUtils was born. We don't use it ourselves in IW but we recommend our uses to use it whenever they want to run other processes from within an IW app. It's simple and don't have all the bells and whistles but works great.

AFAIK, ShellExecuteEx() does much more than CreateProcess() (including some Registry checks) and in the end it calls CreateProcess to do the actual job. The fact that it does some extra stuff (including Registry access) suggests me that it is more prone to fail.

Maybe it's worth a try and see if the problems go away. If you want, feel free to copy our ExecProcess() routine to the other application codebase and use it from there independently from IntraWeb.
Reply
#6
Regarding the "Invalid Page Context" error.

It is caused by a mismatch when checking the FormID received by the IntraWeb application. I find it hard that it will suffer any influence from the type of application (Http.sys, Indy, ISAPI...) because it is just a content field like any other. If there was something wrong receiving content fields you would see an error right before starting the IW session...

From the top of my head I can't tell you if there was any issue in IW 15.2.49 regarding FormID validation, but we indeed improved it in subsequent releases. The FormID validation will just raise an exception and interrupt the response but this is a treated exception and shouldn't cause any memory increase. Can you tell me if this is a Sync or Async request?

You can turn off the FormID check (in ServerController.SecurityOptions) and see how it behaves. The error should be completely gone then. We would need to investigate what is the request that is causing that and find out why it is happening. But, again, I would expect that Indy and Http.sys behave 100% the same in that regard.
Reply
#7
Hi Alexandre and thanks for your feedback!

As for ExecProcess, I might try that later, although I will give not give it the highest priority. The reason is as I said earlier that it's within a (non-Intraweb) service that did not undergo any particular changes for the upgrade, plus that we have also already minimized the usage of the sox application.

As for the "Invalid page context" exception, one observation is that we in most (but not all) cases get it on the first form loaded for the session. Now that I think of it, I have actually gotten it myself once. That was when I went to the login page of our main web application and noticed that an image containing our logo was not properly loaded. I then hit F5 or perhaps the refresh button in the browser, and I think this is when the exception came. If I recall correctly, the page was properly loaded after the refresh, and I think I could log on to the application after that. So one theory could be that it occurs when a page for some reason is not properly loaded in the browser. Another observation is that in the two log examples I provided, there should normally be another log line after "Form created:..." triggered by an async timer event (a one-time thing striking after 50 msec) and it is not there. Probably the user has tried to refresh the page later, and that's when the exception comes. My guess is that the request that triggers the exception is synchronous.

Also, I checked some other cases and found one completely different exception where there also was a huge jump in memory (this time 100 MB). Could it be the call stack generation that can take a lot of memory (rather than the particular exception)?. It looks like we have a number of known exception types (such as EExpiredSession) in our code for ServerController.OnException where we don't log the occurrence as an error, and don't generate a call stack (Is that done by setting the Handled parameter to False? I have not written this code and don't fully understand it. Is it the same parameter that controls if you go to the exception renderer or not?). If you think the 'Invalid page context' case is to be considered as harmless, I could do the same for this. Or I could do it for the entire exception class EIWSecurityException (but I am guessing this is not so wise?). Of course I could also try with CheckFormId = False if you think this is better (although we do want to run with maximum web security if possible).

But you are right that this exception probably has little to do with http.sys, we had seen it also before the upgrade of IntraWeb version and the change to using http.sys instead of Indy. I just noted the memory jump associated to it. As it seems, our main web application is the only service that looks a bit suspicious when you look at the memory usage just before the system restarts: My next step could be to try to detect memory and other resource leeks.

Best regards

Magnus Oskarsson
Reply
#8
Quote:Could it be the call stack generation that can take a lot of memory (rather than the particular exception)?


I don't think the call stack generation would cause a jump in memory usage that could be perceived like that


Quote:It looks like we have a number of known exception types (such as EExpiredSession) in our code for ServerController.OnException where we don't log the occurrence as an error, and don't  generate a call stack (Is that done by setting the Handled parameter to False? I have not written this code and don't fully understand it. Is it the same parameter that controls if you go to  the exception renderer or not?)


When an exception happens, IW will do this:
1- Log the exception using the standard ExceptionLogger (if enabled)
2- Call any other registered 3rd party exception handlers (E.g. EurekaLog). If any exist and returns TRUE, then IW considers the exception as Handled (see 3 below)
3- Generate the ServerController.OnException event. This event contains the Handled var parameter as you mentioned. If Handled returns TRUE, IW will stop any processing related to this exception, otherwise it will:
4- Call the registered exception renderer class which is responsible for generating that standard error page. The call stack is obtained during this stage.

So, you are correct. If Handled somehow returns True (because of some 3rd party exception handler or your own code) the call stack is not obtained.

Quote:If you think the 'Invalid page context' case is to be considered as harmless, I could do the same for this

It is harmless as long as you only get it on this especial circumstances (i.e. a user refreshes the page after the page loading somehow got broken). If you get several exceptions of that type during normal usage it may indicate a problem with your code, a 3rd party library (e.g. TMS, CG DevTools) or a bug in IntraWeb itself which hasn't been detected.

I would suggest you to force an "Invalid Page Context" error in your dev environment (while debugging) and see if FastMM can detect any memory leak.

It is kind of simple to force this. Run the application, open the browser dev tools (F12). Scroll to the bottom of the "inspector" view and find the "IW_FormID_" hidden field. Edit its content and change the value. Now you just need to post this page (any button with an associated event should do it) or even refresh it pressing F5.
Now, after generating the "Invalid Page Context" error, all you need to do is to terminate the application and see if FastMM reports anything.

If you need some help on using FastMM4 to find memory leaks we have created a nice how to here: 

https://www.atozed.com/2021/07/detecting...-fastmm-4/

And also with Deleaker, which is another nice tool that works with any memory manager: https://www.atozed.com/2022/07/detecting...-deleaker/
Reply
#9
Hi Alexandre and thanks for the feedback! I will look deeper into the things you mention. My suspicions regarding our main web application has strengthened. Besides the memory usage pattern I saw, by checking all times this service was restarted (e.g. due to a patch being applied) and putting it on the same timeline as occurrence of errors and server restarts, I can see that when we have had longer periods of operation without errors, this coincides with having at least one restart of the main web application service in this time interval. So it might be that some resources are returned to the system when the process is terminated (but also that it might be too late if errors have started to occur in the system). Actually, I have already started memory and resource leak analysis (using CodeWatch from the Nexus Quality Suite). Although I see some that looks to be purely IW related, I am focusing on leaks in our own code to begin with. I have found and fixed a few cases so far, but all of those should have been present also before the http.sys upgrade (and thus likely not the cause of the problems we had).

I will post an update here when I have more information.
Reply
#10
Hi again, I did some tests with F12 developer tools, changing FormID and then refreshing, and yes, it was easy to reproduce 'Invalid page context' errors this way. I have tested adding the following to our code to avoid call stacks being generated for this error:

Code:
  ExceptionLogger.RegisterIgnoreException(EIWSecurityException);

This seems to work and in the tests with developer tools, the page recovered nicely and "silently" after such a case. But I want to make sure this only affects logging, and does not introduce any security risks?


I deployed a version to the production server a couple of days ago with this version plus some fixed memory leaks in our own code, but I have to wait longer to see if it had any effect (since the web application was restarted which I believe will postpone the error state in the system). I still see the effect of a 50 MB jump in memory when an exception occurs (this time it was a previously unseen exception in our own code). Possibly the memory jump comes only for the first exception thrown and handled since the web application was started, but I cannot say that for certain without putting a lot of time into analyzing logs.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)