Summary: USB endpoints based around the USB 2.0 highspeed Cypress FX2LP (EZ-USB FX2LP™ | Cypress) using bulk transfers from endpoint to Intel PCH results in FX2LP buffer overflows due to insufficient polling rate by PCH. Previous generation of Intel processors and also "less powerful" solutions (such as Broadcom SoC based host controllers) did not present this issue. Product is existing in the marketplace.
More Detail: USB endpoint is a data streaming device. The expectation is that the data will not be lost as even "small" 16bit microcontroller based USB hosts have been shown to consume the data fast enough to prevent data loss. The theory is that some power/thermal/management/etc process is preventing PCH from polling fast enough to keep up with the required data rate (approx. 20MB/s). I do have Lecroy USB protocol analyzer traces available which demonstrate a request with lost data and the ability to provide more traces if needed.
Our endpoint uses a "bundled" solution with the Microsoft Surface + endpoint. The issue was first found when the (Skylake based) Surface Pro 4 was tested (Surface Pro 3 - which is based on Gen5 Intel is fine). A second Skylake solution, the Lenovo P70 (Xeon), was tested and found to have the same issue. Otherwise, the solution has been tested on everything from SoC/ARM/previous gen Intel and similar failures did not occur. Therefore, the issue has been identified as specific to Skylake.
Over time, the ability to reproduce the issue has gone from trivial (failures "left and right") in earlier firmware to more difficult. Likely some changes in the SP4 firmware/BIOS/OS/etc have improved performance. The Lenovo is not refreshed as often so some regression testing may be called for to see if it still more easily demonstrated to fail.
The request is for some guided debug which allows for shutting off throttling/state changes/etc. which are unique to Skylake which are affecting the polling frequency of USB. Since my solution is the endpoint and not the host controller, I do not have XDP access on the host side in order to force register changes before the BIOS takes control.
Note that "bonehead" issues have been eliminated as we have been working this issue for several months also working with Cypress. The channel is good (not an SI issue) and the problem has been proven to data not extracted fast enough by the host controller resulting in data loss rather than any other means. Using known test patterns sent by our endpoint, we can identify which data is lost.