In its most raw form, camera sensors only see illumination not color.
In front of the sensor is a bayer filter which results in each physical pixel seeing illumination filtered R G or B.
From there the software onboard the camera or in your RAW converter does interpolation to create RGB values at each pixel. For example if the local pixel is R filtered, it then interpolates its G & B values from nearby pixels of that filter.
This is also why Leica B&W sensor cameras have higher apparently sharpness & ISO sensitivity than the related color sensor models because there is no filter in front or software interpolation happening.
That's how the earliest color photography worked. "Making color separations by reloading the camera and changing the filter between exposures was inconvenient", notes Wikipedia.
I think they are both more asking about 'per pixel color filters'; that is, something like a sensor filter/glass but the color separators could change (at least 'per-line') fast enough to get a proper readout of the color in formation.
AKA imagine a camera with R/G/B filters being quickly rotated out for 3 exposures, then imagine it again but the technology is integrated right into the sensor (and, ideally, the sensor and switching mechanism is fast enough to read out with rolling shutter competitive with modern ILCs)
Works for static images, but if there's motion the "changing the filters" part is never fast enough, there will always be colour fringing somewhere.
Edit or maybe it does work? I've watched at least one movie on a DLP type video projector with sequential colour and not noticed colour fringing. But still photos have much higher demand here.
You can use sets of exotic mirrors and/or prisms to split incoming images into separate RGB beams into three independent monochrome sensors, through the same singular lens and all at once. That's what "3CCD" cameras and their predecessors did.
The sensor outputs a single value per pixel. A later processing step is needed to interpret that data given knowledge about the color filter (usually Bayer pattern) in front of the sensor.
The raw sensor output is a single value per sensor pixel, each of which is behind a red, green, or blue color filter. So to get a usable image (where each pixel has a value for all three colors), we have to somehow condense the values from some number of these sensor pixels. This is the "Debayering" process.
And this is important because our perception is more sensitive to luminance changes than color, and with our eyes being most sensitive to green, luminance is also. So, higher perceived spatial resolution by using more green [1]. This is also why JPG has lower resolution red and green channels, and why modern OLED usually use a pentile display, with only green being at full resolutio [2].
Pentile displays are acceptable for photos and videos, but look really horrible displaying text and fine detail --- which looks almost like what you'd see on an old triad-shadow-mask colour CRT.
Is the output produced by the sensor RGB or a single value per pixel?