Discussion:
[FFmpeg-devel] Performance of P010LE/BE pixel convertion
Ali KIZIL
2016-09-01 05:52:07 UTC
Permalink
Hi all,

I tested P010LE pixel convertion from YUV420P in NVENC Main 10 HEVC UHD 50
fps encoding on Nvidia Pascal Titan X GPU:

Nvidia Pascal Titan X GPU can not reach to 50 fps on Main 10 P010LE HEVC
encoding:

ffmpeg -loglevel verbose -i
/media/usb1/4k_sampels/Samsung_SUHD_Picture_Quality\ Demo_Nano_Crystal\
Display_UK-Version.mp4 -c:v:0 nvenc_hevc -preset hp -cbr 1 -2pass 0 -r 50
-vb 28000k -minrate 28000k -maxrate 28000k -bufsize 28000k -muxrate 30000k
-c:a:0 aac -b:a:0 192k -pix_fmt p010le 'udp://233.33.33.1:5001'

FPS waves around 41-43 fps. If same command with YUV420P, it reaches to 120
- 130 fps.

GPU NVENC Load:
nvidia-smi dmon -i 0
gpu pwr temp sm mem enc dec mclk pclkIdx W C % % % % MHz MHz

0 81 67 9 2 41 0 4513 1809
0 80 66 9 2 41 0 4513 1809
0 81 67 9 2 42 0 4513 1809
0 80 67 10 2 41 0 4513 1809
0 81 67 9 2 44 0 4513 1809

I think bottleneck is not at GPU side, pixel convertion maybe needs speed
up improvement.


If codec changed to rawvideo to test pixel format convertion performance
testing, FPS again waves around 39-40 fps


I wanted to state these test reults.


Kind Regards,
Oliver Collyer
2016-09-01 06:56:52 UTC
Permalink
What CPU are you using? It's presumably going to vary wildly from one CPU to another?
Post by Ali KIZIL
Hi all,
I tested P010LE pixel convertion from YUV420P in NVENC Main 10 HEVC UHD 50
Nvidia Pascal Titan X GPU can not reach to 50 fps on Main 10 P010LE HEVC
ffmpeg -loglevel verbose -i
/media/usb1/4k_sampels/Samsung_SUHD_Picture_Quality\ Demo_Nano_Crystal\
Display_UK-Version.mp4 -c:v:0 nvenc_hevc -preset hp -cbr 1 -2pass 0 -r 50
-vb 28000k -minrate 28000k -maxrate 28000k -bufsize 28000k -muxrate 30000k
-c:a:0 aac -b:a:0 192k -pix_fmt p010le 'udp://233.33.33.1:5001'
FPS waves around 41-43 fps. If same command with YUV420P, it reaches to 120
- 130 fps.
nvidia-smi dmon -i 0
gpu pwr temp sm mem enc dec mclk pclkIdx W C % % % % MHz MHz
0 81 67 9 2 41 0 4513 1809
0 80 66 9 2 41 0 4513 1809
0 81 67 9 2 42 0 4513 1809
0 80 67 10 2 41 0 4513 1809
0 81 67 9 2 44 0 4513 1809
I think bottleneck is not at GPU side, pixel convertion maybe needs speed
up improvement.
If codec changed to rawvideo to test pixel format convertion performance
testing, FPS again waves around 39-40 fps
I wanted to state these test reults.
Kind Regards,
_______________________________________________
ffmpeg-devel mailing list
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Ali KIZIL
2016-09-01 10:35:03 UTC
Permalink
What CPU are you using? It's presumably going to vary wildly from one
CPU to another?
*
* Hi all,
*
* I tested P010LE pixel convertion from YUV420P in NVENC Main 10 HEVC UHD 50
*
*
* Nvidia Pascal Titan X GPU can not reach to 50 fps on Main 10 P010LE HEVC
*
*
* ffmpeg -loglevel verbose -i
*
* /media/usb1/4k_sampels/Samsung_SUHD_Picture_Quality\ Demo_Nano_Crystal\
*
* Display_UK-Version.mp4 -c:v:0 nvenc_hevc -preset hp -cbr 1 -2pass 0 -r 50
*
* -vb 28000k -minrate 28000k -maxrate 28000k -bufsize 28000k -muxrate 30000k
*
* -c:a:0 aac -b:a:0 192k -pix_fmt p010le 'udp://233.33.33.1:5001'
*
* FPS waves around 41-43 fps. If same command with YUV420P, it reaches to 120
*
* - 130 fps.
*
*
* nvidia-smi dmon -i 0
*
* gpu pwr temp sm mem enc dec mclk pclkIdx W C % % % % MHz MHz
*
* 0 81 67 9 2 41 0 4513 1809
*
* 0 80 66 9 2 41 0 4513 1809
*
* 0 81 67 9 2 42 0 4513 1809
*
* 0 80 67 10 2 41 0 4513 1809
*
* 0 81 67 9 2 44 0 4513 1809
*
* I think bottleneck is not at GPU side, pixel convertion maybe needs speed
*
* up improvement.
*
* If codec changed to rawvideo to test pixel format convertion performance
*
* testing, FPS again waves around 39-40 fps
*
* I wanted to state these test reults.
*
* Kind Regards,
*
* _______________________________________________
*
* ffmpeg-devel mailing list
*
* ffmpeg-devel at ffmpeg.org <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>
*
* http://ffmpeg.org/mailman/listinfo/ffmpeg-devel <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>
*


The test is done on "Intel(R) Core(TM) i7-4960X CPU @ 3.60GHz" with 32 GB
DDR3 (8pcs. x 4GB Kingston KHX2133C11D3) with Linux kizil105
3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015
x86_64 x86_64 x86_64 GNU/Linux.

For memory dmidecode shows 1333 Mhz. I will check my BIOS settings if RAM
speed set wrong and update the mail list.
sudo dmidecode --type memory
# dmidecode 2.12
# SMBIOS entry point at 0x000f04c0
SMBIOS 2.7 present.

Handle 0x0029, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 96 GB
Error Information Handle: Not Provided
Number Of Devices: 4

Handle 0x002B, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0029
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: Node0_Dimm0
Bank Locator: Node0_Bank0
Type: DDR3
Type Detail: Unbuffered (Unregistered)
Speed: 1333 MHz
Manufacturer: Kingston
Serial Number: 6A2AF2C5
Asset Tag: Dimm0_AssetTag
Part Number: KHX2133C11D3/
Rank: 2
Configured Clock Speed: 1333 MHz

As a final note, I will test same settings on a server with Dual Xeon E5-
2630 V4 with 64 GB (4 pcs x 16 GB) DDR4 2133 Mhz Ram. I will update for
this as well.
Ali KIZIL
2016-09-01 11:00:33 UTC
Permalink
Hi Oliver,

I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE HEVC
Main 10 encoding.

Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while FPS
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps is
like 75-76:

Stream #0:0: Video: rawvideo, 1 reference frame (Y3[11][10] /
0xA0B3359), yuv420p10le, 3840x2160 [SAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 60
fps, 61440 tbn, 60 tbc (default)
Metadata:
creation_time : 2013-12-17T16:40:26.000000Z
X-Language : und
handler_name : GPAC ISO Video Handler
encoder : Lavc57.54.101 rawvideo
Stream #0:1: Audio: pcm_s16le (PSD[16] / 0x10445350), 48000 Hz,
5.1(side), s16, 4608 kb/s (default)
Metadata:
creation_time : 2013-12-17T16:40:28.000000Z
X-Language : und
handler_name : GPAC ISO Audio Handler
encoder : Lavc57.54.101 pcm_s16le
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> rawvideo (native))
Stream #0:2 -> #0:1 (ac3 (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
[h264 @ 0x1c4e940] Reinit context to 3840x2160, pix_fmt: yuv420p
frame= 2289 fps= 76 q=-0.0 size=55279649kB time=00:00:38.16
bitrate=11865083.7kbits/s speed=1.26x

So, bottleneck for P010LE encoding is not coming from RAM. I will do the
tests on a more powerful server as I mentioned before (Dual Xeon E5-2630 v4
with DDR 64 GB Ram)

And is it a mendetoary for Nvidia SDK to work with P010LE for 10bits
encoding HEVC?

Regards,
Oliver Collyer
2016-09-01 11:27:10 UTC
Permalink
Hi Ali
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE HEVC
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while FPS
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps is
At a guess this is because YUV420P to P010 involves interleaving the UV planes, which is slightly more complex?

Seems a pretty big difference with your results though. Hopefully someone more knowledgeable can comment.
Post by Ali KIZIL
Stream #0:0: Video: rawvideo, 1 reference frame (Y3[11][10] /
0xA0B3359), yuv420p10le, 3840x2160 [SAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 60
fps, 61440 tbn, 60 tbc (default)
creation_time : 2013-12-17T16:40:26.000000Z
X-Language : und
handler_name : GPAC ISO Video Handler
encoder : Lavc57.54.101 rawvideo
Stream #0:1: Audio: pcm_s16le (PSD[16] / 0x10445350), 48000 Hz,
5.1(side), s16, 4608 kb/s (default)
creation_time : 2013-12-17T16:40:28.000000Z
X-Language : und
handler_name : GPAC ISO Audio Handler
encoder : Lavc57.54.101 pcm_s16le
Stream #0:0 -> #0:0 (h264 (native) -> rawvideo (native))
Stream #0:2 -> #0:1 (ac3 (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
frame= 2289 fps= 76 q=-0.0 size=55279649kB time=00:00:38.16
bitrate=11865083.7kbits/s speed=1.26x
So, bottleneck for P010LE encoding is not coming from RAM. I will do the
tests on a more powerful server as I mentioned before (Dual Xeon E5-2630 v4
with DDR 64 GB Ram)
And is it a mendetoary for Nvidia SDK to work with P010LE for 10bits
encoding HEVC?
Here is what is defined in the SDK for 10-bit encoding formats:

/**
* Input buffer formats
*/
typedef enum _NV_ENC_BUFFER_FORMAT
{
NV_ENC_BUFFER_FORMAT_UNDEFINED = 0x00000000, /**< Undefined buffer format */

NV_ENC_BUFFER_FORMAT_NV12 = 0x00000001, /**< Semi-Planar YUV [Y plane followed by interleaved UV plane] */
NV_ENC_BUFFER_FORMAT_YV12 = 0x00000010, /**< Planar YUV [Y plane followed by V and U planes] */
NV_ENC_BUFFER_FORMAT_IYUV = 0x00000100, /**< Planar YUV [Y plane followed by U and V planes] */
NV_ENC_BUFFER_FORMAT_YUV444 = 0x00001000, /**< Planar YUV [Y plane followed by U and V planes] */
NV_ENC_BUFFER_FORMAT_YUV420_10BIT = 0x00010000, /**< 10 bit Semi-Planar YUV [Y plane followed by interleaved UV plane]. Each pixel of size 2 bytes. Most Significant 10 bits contain pixel data. */
NV_ENC_BUFFER_FORMAT_YUV444_10BIT = 0x00100000, /**< 10 bit Planar YUV444 [Y plane followed by U and V planes]. Each pixel of size 2 bytes. Most Significant 10 bits contain pixel data. */
NV_ENC_BUFFER_FORMAT_ARGB = 0x01000000, /**< 8 bit Packed A8R8G8B8 */
NV_ENC_BUFFER_FORMAT_ARGB10 = 0x02000000, /**< 10 bit Packed A2R10G10B10. Each pixel of size 2 bytes. Most Significant 10 bits contain pixel data. */
NV_ENC_BUFFER_FORMAT_AYUV = 0x04000000, /**< 8 bit Packed A8Y8U8V8 */
NV_ENC_BUFFER_FORMAT_ABGR = 0x10000000, /**< 8 bit Packed A8B8G8R8 */
NV_ENC_BUFFER_FORMAT_ABGR10 = 0x20000000, /**< 10 bit Packed A2B10G10R10. Each pixel of size 2 bytes. Most Significant 10 bits contain pixel data. */
} NV_ENC_BUFFER_FORMAT;


Of the 10-bit formats, only NV_ENC_BUFFER_FORMAT_YUV420_10BIT and NV_ENC_BUFFER_FORMAT_YUV444_10BIT are currently implemented in FFmpeg.

Also NV_ENC_BUFFER_FORMAT_YUV420_10BIT above is actually YUVP010 according to the description (i.e. interleaved UV planes).

So if you want to avoid P010 then you can try encoding with YUV444 10 bit by using the FFmpeg pixel format yuv444p16; what sort of encoding speeds do you get with this?
Post by Ali KIZIL
Regards,
_______________________________________________
ffmpeg-devel mailing list
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Ronald S. Bultje
2016-09-01 11:23:20 UTC
Permalink
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE HEVC
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while FPS
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps is
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).

To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I assume
is the internal format of your actual source after decoding?) to p010le,
first C and then later x86 SIMD.

I have no idea why you would want to convert from yuv420p to p010le or
yuv420p10le. I understand swscale supports it (it should) but I doubt
that's how you want to generate 10 bits content.

Ronald
Timo Rothenpieler
2016-09-01 11:34:03 UTC
Permalink
Post by Ronald S. Bultje
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE HEVC
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while FPS
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps is
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).
To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I assume
is the internal format of your actual source after decoding?) to p010le,
first C and then later x86 SIMD.
I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
implementation.

Also, isn't the internal representation of YUV 10bit in swscale
essentially yuv420p10 anyway, so the conversion already is as direct as
it gets?
Post by Ronald S. Bultje
I have no idea why you would want to convert from yuv420p to p010le or
yuv420p10le. I understand swscale supports it (it should) but I doubt
that's how you want to generate 10 bits content.
P010 is the only YUV420 10bit format NVENC supports.
Post by Ronald S. Bultje
Ronald
_______________________________________________
ffmpeg-devel mailing list
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Ronald S. Bultje
2016-09-01 11:44:09 UTC
Permalink
Hi Timo,
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE
HEVC
Post by Ronald S. Bultje
Post by Ali KIZIL
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while
FPS
Post by Ronald S. Bultje
Post by Ali KIZIL
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps
is
Post by Ronald S. Bultje
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).
To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I
assume
Post by Ronald S. Bultje
is the internal format of your actual source after decoding?) to p010le,
first C and then later x86 SIMD.
I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
implementation.
I agree. I didn't mean to offend you for writing bad C code, or for not
writing SIMD code. I simply meant to point out that if you want to go from
40-50fps to 100+fps, SIMD is probably the easiest way to move in that
direction.

Also, isn't the internal representation of YUV 10bit in swscale
Post by Ronald S. Bultje
essentially yuv420p10 anyway, so the conversion already is as direct as
it gets?
There is probably no conversion at all, right. But given that there's also
a video being decoded, which is much more CPU-intensive than colorspace
conversion, you wouldn't expect the colorspace conversion to slow it down
by >2x. (Unless it's C, of course. :-).)
Post by Ronald S. Bultje
I have no idea why you would want to convert from yuv420p to p010le or
Post by Ronald S. Bultje
yuv420p10le. I understand swscale supports it (it should) but I doubt
that's how you want to generate 10 bits content.
P010 is the only YUV420 10bit format NVENC supports.
His source in the given example was yuv420p. If your source is 8bit, encode
8bits, not 10bits. For 10bit encoding, use 10bit source.

Right?

So even if this is only a performance test, we need to think about whether
the test tells us something meaningful. In particular, to repeat what I
said earlier, if the source is represented as yuv420p10le after decoding, a
direct yuv420p10le to p010le conversion in C and SIMD is probably going to
be even-more-efficient than a SIMD implementation of the p010le (or be)
input/output that you wrote earlier, since that's the "slow" conversion
path.

If this is confusing, poke me at VDD (QtCon) and I'll explain in more
detail.

Ronald
Oliver Collyer
2016-09-01 11:52:49 UTC
Permalink
Post by Ronald S. Bultje
Hi Timo,
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE
HEVC
Post by Ronald S. Bultje
Post by Ali KIZIL
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while
FPS
Post by Ronald S. Bultje
Post by Ali KIZIL
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps
is
Post by Ronald S. Bultje
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).
To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I
assume
Post by Ronald S. Bultje
is the internal format of your actual source after decoding?) to p010le,
first C and then later x86 SIMD.
I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
implementation.
I agree. I didn't mean to offend you for writing bad C code, or for not
writing SIMD code. I simply meant to point out that if you want to go from
40-50fps to 100+fps, SIMD is probably the easiest way to move in that
direction.
Also, isn't the internal representation of YUV 10bit in swscale
Post by Ronald S. Bultje
essentially yuv420p10 anyway, so the conversion already is as direct as
it gets?
There is probably no conversion at all, right. But given that there's also
a video being decoded, which is much more CPU-intensive than colorspace
conversion, you wouldn't expect the colorspace conversion to slow it down
by >2x. (Unless it's C, of course. :-).)
Post by Ronald S. Bultje
I have no idea why you would want to convert from yuv420p to p010le or
Post by Ronald S. Bultje
yuv420p10le. I understand swscale supports it (it should) but I doubt
that's how you want to generate 10 bits content.
P010 is the only YUV420 10bit format NVENC supports.
His source in the given example was yuv420p. If your source is 8bit, encode
8bits, not 10bits. For 10bit encoding, use 10bit source.
Right?
When I did some tests of this a week or so ago I found that taking an 8-bit source, converting to 10-bit and encoding as 10-bit could actually save space. I posted my results to this list.

I tried it after reading this...

http://x264.nl/x264/10bit_02-ateme-why_does_10bit_save_bandwidth.pdf <http://x264.nl/x264/10bit_02-ateme-why_does_10bit_save_bandwidth.pdf>

…and was curious to see if it applied to NVENC HEVC.

I only tried one sample file, a yuv420p Slingbox capture but when I set global quality constant I saved a fair bit on the output file size.

Interestingly (or not) I couldn’t reproduce anything similar using x265 using a similar approach.
Post by Ronald S. Bultje
So even if this is only a performance test, we need to think about whether
the test tells us something meaningful. In particular, to repeat what I
said earlier, if the source is represented as yuv420p10le after decoding, a
direct yuv420p10le to p010le conversion in C and SIMD is probably going to
be even-more-efficient than a SIMD implementation of the p010le (or be)
input/output that you wrote earlier, since that's the "slow" conversion
path.
If this is confusing, poke me at VDD (QtCon) and I'll explain in more
detail.
Ronald
_______________________________________________
ffmpeg-devel mailing list
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Timo Rothenpieler
2016-09-01 11:59:36 UTC
Permalink
Post by Ronald S. Bultje
Hi Timo,
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE
HEVC
Post by Ronald S. Bultje
Post by Ali KIZIL
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while
FPS
Post by Ronald S. Bultje
Post by Ali KIZIL
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps
is
Post by Ronald S. Bultje
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).
To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I
assume
Post by Ronald S. Bultje
is the internal format of your actual source after decoding?) to p010le,
first C and then later x86 SIMD.
I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
implementation.
I agree. I didn't mean to offend you for writing bad C code, or for not
writing SIMD code. I simply meant to point out that if you want to go from
40-50fps to 100+fps, SIMD is probably the easiest way to move in that
direction.
Didn't take it like that, was more a general remark.
The C implementation is as straight forward as it gets.
I wonder if re-arranging the code, could make it more efficient though.
Stuff like moving some if() checks out of the loop, and duplicating the
loop instead, or other tricks that lead to gcc generating faster code.
Oliver Collyer
2016-09-01 12:08:16 UTC
Permalink
Post by Timo Rothenpieler
Post by Ronald S. Bultje
Hi Timo,
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE
HEVC
Post by Ronald S. Bultje
Post by Ali KIZIL
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while
FPS
Post by Ronald S. Bultje
Post by Ali KIZIL
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps
is
Post by Ronald S. Bultje
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).
To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I
assume
Post by Ronald S. Bultje
is the internal format of your actual source after decoding?) to p010le,
first C and then later x86 SIMD.
I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
implementation.
I agree. I didn't mean to offend you for writing bad C code, or for not
writing SIMD code. I simply meant to point out that if you want to go from
40-50fps to 100+fps, SIMD is probably the easiest way to move in that
direction.
Didn't take it like that, was more a general remark.
The C implementation is as straight forward as it gets.
I wonder if re-arranging the code, could make it more efficient though.
Stuff like moving some if() checks out of the loop, and duplicating the
loop instead, or other tricks that lead to gcc generating faster code.
I’m not sure it’ll make much difference - you may recall my original patch had code in nvenc.c that took a YUV420P input and converted it to P010 as it fed the frames into the encoder. Out of curiosity I did some quick testing of this versus the code that has since been added in swscale to support P010 conversions and could find no difference in the time it took to encode my 60s sample. Not an exhaustive test by any means, but if there was any obvious inefficiency in the swscale code then I’d have expected to see some difference but I tested my sample three times with each version of the code and the time taken to encode was virtually identical every time.

Oliver
Post by Timo Rothenpieler
_______________________________________________
ffmpeg-devel mailing list
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Ronald S. Bultje
2016-09-01 15:08:36 UTC
Permalink
Hi Timo,
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Hi Timo,
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It
dosnt
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Post by Ali KIZIL
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE
HEVC
Post by Ronald S. Bultje
Post by Ali KIZIL
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while
FPS
Post by Ronald S. Bultje
Post by Ali KIZIL
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps
is
Post by Ronald S. Bultje
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).
To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I
assume
Post by Ronald S. Bultje
is the internal format of your actual source after decoding?) to
p010le,
Post by Ronald S. Bultje
Post by Ronald S. Bultje
Post by Ronald S. Bultje
first C and then later x86 SIMD.
I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
implementation.
I agree. I didn't mean to offend you for writing bad C code, or for not
writing SIMD code. I simply meant to point out that if you want to go
from
Post by Ronald S. Bultje
40-50fps to 100+fps, SIMD is probably the easiest way to move in that
direction.
Didn't take it like that, was more a general remark.
The C implementation is as straight forward as it gets.
I wonder if re-arranging the code, could make it more efficient though.
Stuff like moving some if() checks out of the loop, and duplicating the
loop instead, or other tricks that lead to gcc generating faster code.
So, partially. I just saw your other patch, and it indeed does very little,
but you'll still be able to get some speedups out of SIMD. SIMD is simply
faster because it allows you to do 8 or so pixels per
iteration-of-instructions (instead of just 1).

If you're wondering how to get started with SIMD in ffmpeg, I highly
recommend x264 asm intro:
https://wiki.videolan.org/X264_asm_intro/

Ronald
Hendrik Leppkes
2016-09-01 12:52:13 UTC
Permalink
Post by Timo Rothenpieler
Post by Ronald S. Bultje
Hi,
Post by Ali KIZIL
Hi Oliver,
I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
make a much difference. FPS is still waiving 41-44 fps for UHD P010LE HEVC
Main 10 encoding.
Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while FPS
waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps is
I think this is expected, the p010le conversion is C (no SIMD). The
yuv420p10le conversion is using x86 SIMD (probably AVX).
To fix this, add x86 SIMD implementations of the p010le conversions in
swscale. Better yet, add direct conversions from yuv420p10 (which I assume
is the internal format of your actual source after decoding?) to p010le,
first C and then later x86 SIMD.
I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
implementation.
Also, isn't the internal representation of YUV 10bit in swscale
essentially yuv420p10 anyway, so the conversion already is as direct as
it gets?
The "generic" step using the internal format is still slower then
using a "special" converter that directly converts the input to the
output without the generic intermediate step.
This would probably be relatively easy to build for yuv420p10le ->
p010le and save some performance.

- Hendrik
Ali KIZIL
2016-09-01 11:54:31 UTC
Permalink
Hi All,

I want to give answers to some questions:

1) @Oliver, thank you for explanations. I tried yuv444p16le, fps is a bit
less to 32-34 fps. Here is a short log:

Stream #0:0(und): Video: hevc (nvenc_hevc) (Main 10), 1 reference
frame, yuv444p16le, 3840x2160 [SAR 1:1 DAR 16:9], q=-1--1, 28000 kb/s, 60
fps, 90k tbn, 60 tbc (default)
Metadata:
creation_time : 2013-12-17T16:40:26.000000Z
handler_name : GPAC ISO Video Handler
encoder : Lavc57.54.101 nvenc_hevc
Side data:
cpb: bitrate max/min/avg: 28000000/0/28000000 buffer size: 28000000
vbv_delay: -1
Stream #0:1(und): Audio: mp2, 48000 Hz, stereo, s16, delay 481, padding
0, 384 kb/s (default)
Metadata:
creation_time : 2013-12-17T16:40:28.000000Z
handler_name : GPAC ISO Audio Handler
encoder : Lavc57.54.101 mp2
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> hevc (nvenc_hevc))
Stream #0:2 -> #0:1 (ac3 (native) -> mp2 (native))
Press [q] to stop, [?] for help
[h264 @ 0x1549cc0] Reinit context to 3840x2160, pix_fmt: yuv420p
[AVBSFContext @ 0x15da860] The input looks like it is Annex B alreadyA
speed= 0x
frame= 1225 fps= 34 q=16.0 Lsize= 74899kB time=00:00:20.43
bitrate=30027.8kbits/s speed=0.565x

2) @Timo, I just want to share my test results on the work done to see if
we can catch a chance to increase performance by solving the bottleneck. If
you target to do real time UHD HEVC 10 bits encoding via Nvidia Pascal
GPUs, 50 fps is a need to reach as standart currently stands there.

3) @Ronald, you are totally right 8 bits to 10 bits convertion makes no
sense. I did it as my sample in hand was so. Now, I did the same test from
YUV420P10LE to P010LE as below. FPS waves from 41-42 fps:

***@kizil105:/opt/ffmpeg# /opt/ffmpeg/bin/ffmpeg -loglevel verbose
-ignore_unknown -probesize 100000000 -async 1 -thread_queue_size 2048
-err_detect compliant -i
/media/usb1/4K_TS/SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv
-c:v:0 rawvideo -c:a:0 pcm_s16le -f nut -pix_fmt p010le -y /dev/null
ffmpeg version N-81508-g99882d0 Copyright (c) 2000-2016 the FFmpeg
developers
built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
configuration: --prefix=/opt/ffmpeg --enable-shared --enable-static
--enable-nonfree --enable-gpl --extra-cflags='-I/opt/ffmpeg/include
-I/usr/local/include' --extra-ldflags=-L/opt/ffmpeg/lib
--bindir=/opt/ffmpeg/bin --extra-libs=-ldl --enable-libx264
--enable-libx265 --enable-nonfree --enable-gpl --enable-nvenc
--enable-vdpau --enable-libzvbi --enable-libfdk-aac --enable-libzimg
--enable-avresample --enable-libnpp --enable-cuda
libavutil 55. 29.100 / 55. 29.100
libavcodec 57. 54.101 / 57. 54.101
libavformat 57. 48.101 / 57. 48.101
libavdevice 57. 0.102 / 57. 0.102
libavfilter 6. 58.100 / 6. 58.100
libavresample 3. 0. 0 / 3. 0. 0
libswscale 4. 1.100 / 4. 1.100
libswresample 2. 1.100 / 2. 1.100
libpostproc 54. 0.100 / 54. 0.100
Routing option err_detect to both codec and muxer layer
Input #0, matroska,webm, from
'/media/usb1/4K_TS/SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv':
Metadata:
encoder : libebml v1.3.1 + libmatroska v1.4.2
creation_time : 2015-10-03T13:49:42.000000Z
Duration: 00:01:49.29, start: 0.816000, bitrate: 18484 kb/s
Stream #0:0: Video: hevc (Main 10), 1 reference frame, yuv420p10le(tv),
3840x2160 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 1k tbn, 60 tbc (default)
Metadata:
BPS : 18497251
BPS-eng : 18497251
DURATION : 00:01:48.450000000
DURATION-eng : 00:01:48.450000000
NUMBER_OF_FRAMES: 6507
NUMBER_OF_FRAMES-eng: 6507
NUMBER_OF_BYTES : 250753360
NUMBER_OF_BYTES-eng: 250753360
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die')
64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That I
Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
Stream #0:1: Audio: aac (LC), 44100 Hz, stereo, fltp (default)
Metadata:
BPS : 124607
BPS-eng : 124607
DURATION : 00:01:49.267000000
DURATION-eng : 00:01:49.267000000
NUMBER_OF_FRAMES: 4669
NUMBER_OF_FRAMES-eng: 4669
NUMBER_OF_BYTES : 1701940
NUMBER_OF_BYTES-eng: 1701940
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die')
64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That I
Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
[graph 0 input from stream 0:0 @ 0x20822a0] w:3840 h:2160
pixfmt:yuv420p10le tb:1/1000 fr:60/1 sar:1/1 sws_param:flags=2
[auto-inserted scaler 0 @ 0x20836a0] w:iw h:ih flags:'bicubic' interl:0
[format @ 0x2082880] auto-inserting filter 'auto-inserted scaler 0' between
the filter 'Parsed_null_0' and the filter 'format'
[auto-inserted scaler 0 @ 0x20836a0] w:3840 h:2160 fmt:yuv420p10le sar:1/1
-> w:3840 h:2160 fmt:p010le sar:1/1 flags:0x4
[graph 1 input from stream 0:1 @ 0x2076080] tb:1/44100 samplefmt:fltp
samplerate:44100 chlayout:0x3
-async is forwarded to lavfi similarly to -af
aresample=async=1:min_hard_comp=0.100000:first_pts=0.
[graph 1 aresample for input stream 0:1 @ 0x20b4460] ch:2 chl:stereo
fmt:fltp r:44100Hz -> ch:2 chl:stereo fmt:s16 r:44100Hz
[nut @ 0x208f9a0] Using AVStream.codec to pass codec parameters to muxers
is deprecated, use AVStream.codecpar instead.
Last message repeated 1 times
Output #0, nut, to '/dev/null':
Metadata:
encoder : Lavf57.48.101
Stream #0:0: Video: rawvideo, 1 reference frame (RGB[15] / 0xF424752),
p010le, 3840x2160 [SAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 60 fps, 61440 tbn,
60 tbc (default)
Metadata:
BPS : 18497251
BPS-eng : 18497251
DURATION : 00:01:48.450000000
DURATION-eng : 00:01:48.450000000
NUMBER_OF_FRAMES: 6507
NUMBER_OF_FRAMES-eng: 6507
NUMBER_OF_BYTES : 250753360
NUMBER_OF_BYTES-eng: 250753360
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die')
64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That I
Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
encoder : Lavc57.54.101 rawvideo
Stream #0:1: Audio: pcm_s16le (PSD[16] / 0x10445350), 44100 Hz, stereo,
s16, 1411 kb/s (default)
Metadata:
BPS : 124607
BPS-eng : 124607
DURATION : 00:01:49.267000000
DURATION-eng : 00:01:49.267000000
NUMBER_OF_FRAMES: 4669
NUMBER_OF_FRAMES-eng: 4669
NUMBER_OF_BYTES : 1701940
NUMBER_OF_BYTES-eng: 1701940
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die')
64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That I
Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
encoder : Lavc57.54.101 pcm_s16le
Stream mapping:
Stream #0:0 -> #0:0 (hevc (native) -> rawvideo (native))
Stream #0:1 -> #0:1 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
[graph 1 aresample for input stream 0:1 @ 0x20b4460] [SWR @ 0x20efe60]
adding 1014 audio samples of silence
frame= 603 fps= 41 q=-0.0 Lsize=14654712kB time=00:00:10.30
bitrate=11644810.9kbits/s speed=0.702x
video:14652900kB audio:1776kB subtitle:0kB other streams:0kB global
headers:0kB muxing overhead: 0.000243%
Input file #0
(/media/usb1/4K_TS/SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv):
Input stream #0:0 (video): 617 packets read (21889901 bytes); 604 frames
decoded;
Input stream #0:1 (audio): 443 packets read (161636 bytes); 443 frames
decoded (453632 samples);
Total: 1060 packets (22051537 bytes) demuxed
Output file #0 (/dev/null):
Output stream #0:0 (video): 603 frames encoded; 603 packets muxed
(15004569600 bytes);
Output stream #0:1 (audio): 443 frames encoded (454646 samples); 443
packets muxed (1818584 bytes);
Total: 1046 packets (15006388184 bytes) muxed
Exiting normally, received signal 2.

(Sorry for cutting previous discussion texts as I wanted to make email
message to be not too long.)
Andy Furniss
2016-09-01 17:22:59 UTC
Permalink
Post by Ali KIZIL
frame= 603 fps= 41 q=-0.0 Lsize=14654712kB time=00:00:10.30
Random thoughts from benching ffmpeg but not with nvenc.

For short tests fps may read low as (IME) it takes time converge with
reality.

Maybe use time ffmpeg .... as a double check.

-f null on my old box is faster than -y /dev/null.

Slight chance forcing CPUs to perf will do better than cpufreq on_demand.
Ali KIZIL
2016-09-01 12:36:57 UTC
Permalink
*
*
* Hi Timo,
*
* On Thu, Sep 1, 2016 at 7:34 AM, Timo Rothenpieler <timo at rothenpieler.org <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>>
*
*
* Hi,
*
*
* Hi Oliver,
*
* I just setup my DDR3 RAM speed to 2133 Mhz on i7 4960x server. It dosnt
*
* make a much difference. FPS is still waiving 41-44 fps for UHD P010LE
*
* HEVC
*
* Main 10 encoding.
*
* Also, rawvideo P010LE encodding waiving 39-42 fps. For your note;while
*
* FPS
*
* waves from 39-42 fps for YUV420P to P010LE, YUV420P to YUV420P10LE fps
*
* is
*
*
* I think this is expected, the p010le conversion is C (no SIMD). The
*
* yuv420p10le conversion is using x86 SIMD (probably AVX).
*
* To fix this, add x86 SIMD implementations of the p010le conversions in
*
* swscale. Better yet, add direct conversions from yuv420p10 (which I
*
* assume
*
* is the internal format of your actual source after decoding?) to p010le,
*
* first C and then later x86 SIMD.
*
* I think 40-50 FPS is quite a nice result for UHD with the plain stupid C
*
* implementation.
*
* I agree. I didn't mean to offend you for writing bad C code, or for not
*
* writing SIMD code. I simply meant to point out that if you want to go from
*
* 40-50fps to 100+fps, SIMD is probably the easiest way to move in that
*
* direction.
*
* Didn't take it like that, was more a general remark.
*
* The C implementation is as straight forward as it gets.
*
* I wonder if re-arranging the code, could make it more efficient though.
*
* Stuff like moving some if() checks out of the loop, and duplicating the
*
* loop instead, or other tricks that lead to gcc generating faster code.
*

I’m not sure it’ll make much difference - you may recall my original
patch had code in nvenc.c that took a YUV420P input and converted it
to P010 as it fed the frames into the encoder. Out of curiosity I did
some quick testing of this versus the code that has since been added
in swscale to support P010 conversions and could find no difference in
the time it took to encode my 60s sample. Not an exhaustive test by
any means, but if there was any obvious inefficiency in the swscale
code then I’d have expected to see some difference but I tested my
sample three times with each version of the code and the time taken to
encode was virtually identical every time.

Oliver

Hi Oliver,

I followed your comment and tried your original patch. It works much
much better. FPS goes up to 88 - 92 fps for UHD HEVC Main10
YUV420P10LE.

I attached the nvenc.c file for your check as well (just I added the 2
convertion functions and did the change in YUV420P10LE pixel format
selection part).


ffmpeg version N-81508-g99882d0 Copyright (c) 2000-2016 the FFmpeg developers
built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
configuration: --prefix=/opt/ffmpeg --enable-shared --enable-static
--enable-nonfree --enable-gpl --extra-cflags='-I/opt/ffmpeg/include
-I/usr/local/include' --extra-ldflags=-L/opt/ffmpeg/lib
--bindir=/opt/ffmpeg/bin --extra-libs=-ldl --enable-libx264
--enable-libx265 --enable-nonfree --enable-gpl --enable-nvenc
--enable-vdpau --enable-libzvbi --enable-libfdk-aac --enable-libzimg
--enable-avresample --enable-libnpp --enable-cuda
libavutil 55. 29.100 / 55. 29.100
libavcodec 57. 54.101 / 57. 54.101
libavformat 57. 48.101 / 57. 48.101
libavdevice 57. 0.102 / 57. 0.102
libavfilter 6. 58.100 / 6. 58.100
libavresample 3. 0. 0 / 3. 0. 0
libswscale 4. 1.100 / 4. 1.100
libswresample 2. 1.100 / 2. 1.100
libpostproc 54. 0.100 / 54. 0.100
Routing option err_detect to both codec and muxer layer
Input #0, matroska,webm, from
'/media/usb1/4K_TS/SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv':
Metadata:
encoder : libebml v1.3.1 + libmatroska v1.4.2
creation_time : 2015-10-03T13:49:42.000000Z
Duration: 00:01:49.29, start: 0.816000, bitrate: 18484 kb/s
Stream #0:0: Video: hevc (Main 10), 1 reference frame,
yuv420p10le(tv), 3840x2160 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 1k tbn,
60 tbc (default)
Metadata:
BPS : 18497251
BPS-eng : 18497251
DURATION : 00:01:48.450000000
DURATION-eng : 00:01:48.450000000
NUMBER_OF_FRAMES: 6507
NUMBER_OF_FRAMES-eng: 6507
NUMBER_OF_BYTES : 250753360
NUMBER_OF_BYTES-eng: 250753360
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
Stream #0:1: Audio: aac (LC), 44100 Hz, stereo, fltp (default)
Metadata:
BPS : 124607
BPS-eng : 124607
DURATION : 00:01:49.267000000
DURATION-eng : 00:01:49.267000000
NUMBER_OF_FRAMES: 4669
NUMBER_OF_FRAMES-eng: 4669
NUMBER_OF_BYTES : 1701940
NUMBER_OF_BYTES-eng: 1701940
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
[graph 0 input from stream 0:0 @ 0x17a00a0] w:3840 h:2160
pixfmt:yuv420p10le tb:1/1000 fr:60/1 sar:1/1 sws_param:flags=2
[scaler for output stream 0:0 @ 0x17cf980] w:3840 h:2160
flags:'bicubic' interl:0
[scaler for output stream 0:0 @ 0x17cf980] w:3840 h:2160
fmt:yuv420p10le sar:1/1 -> w:3840 h:2160 fmt:yuv420p10le sar:1/1
flags:0x4
[graph 1 input from stream 0:1 @ 0x17d0140] tb:1/44100 samplefmt:fltp
samplerate:44100 chlayout:0x3
-async is forwarded to lavfi similarly to -af
aresample=async=1:min_hard_comp=0.100000:first_pts=0.
[graph 1 aresample for input stream 0:1 @ 0x17d0be0] ch:2 chl:stereo
fmt:fltp r:44100Hz -> ch:2 chl:stereo fmt:s16 r:44100Hz
[nvenc_hevc @ 0x17dca80] This encoder is deprecated, use 'hevc_nvenc' instead
[nvenc_hevc @ 0x17dca80] Loaded Nvenc version 7.0
[nvenc_hevc @ 0x17dca80] Nvenc initialized successfully
[nvenc_hevc @ 0x17dca80] 1 CUDA capable devices found
[nvenc_hevc @ 0x17dca80] [ GPU #0 - < TITAN X (Pascal) > has Compute SM 6.1 ]
[nvenc_hevc @ 0x17dca80] supports NVENC
[mpegts @ 0x17e78c0] Using AVStream.codec to pass codec parameters to
muxers is deprecated, use AVStream.codecpar instead.
Last message repeated 1 times
[mpegts @ 0x17e78c0] muxrate 30000000, pcr every 398 pkts, sdt every
9973, pat/pmt every 1994 pkts
Output #0, mpegts, to '/tmp/test1.ts':
Metadata:
service_name : PikoEncoder
service_provider: PikoEncoder
encoder : Lavf57.48.101
Stream #0:0: Video: hevc (nvenc_hevc) (Main 10), 1 reference
frame, yuv420p10le, 3840x2160 [SAR 1:1 DAR 16:9], q=-1--1, 28000 kb/s,
60 fps, 90k tbn, 60 tbc (default)
Metadata:
BPS : 18497251
BPS-eng : 18497251
DURATION : 00:01:48.450000000
DURATION-eng : 00:01:48.450000000
NUMBER_OF_FRAMES: 6507
NUMBER_OF_FRAMES-eng: 6507
NUMBER_OF_BYTES : 250753360
NUMBER_OF_BYTES-eng: 250753360
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
encoder : Lavc57.54.101 nvenc_hevc
Side data:
cpb: bitrate max/min/avg: 28000000/0/28000000 buffer size:
28000000 vbv_delay: -1
Stream #0:1: Audio: mp2, 44100 Hz, stereo, s16, delay 481, padding
0, 384 kb/s (default)
Metadata:
BPS : 124607
BPS-eng : 124607
DURATION : 00:01:49.267000000
DURATION-eng : 00:01:49.267000000
NUMBER_OF_FRAMES: 4669
NUMBER_OF_FRAMES-eng: 4669
NUMBER_OF_BYTES : 1701940
NUMBER_OF_BYTES-eng: 1701940
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
encoder : Lavc57.54.101 mp2
Stream mapping:
Stream #0:0 -> #0:0 (hevc (native) -> hevc (nvenc_hevc))
Stream #0:1 -> #0:1 (aac (native) -> mp2 (native))
Press [q] to stop, [?] for help
[graph 1 aresample for input stream 0:1 @ 0x17d0be0] [SWR @ 0x17ea060]
adding 1014 audio samples of silence
[AVBSFContext @ 0x4b67660] The input looks like it is Annex B already
frame= 818 fps= 89 q=16.0 Lsize= 50894kB time=00:00:13.91
bitrate=29968.0kbits/s speed=1.51x
video:19561kB audio:653kB subtitle:0kB other streams:0kB global
headers:0kB muxing overhead: 151.775177%
Input file #0 (/media/usb1/4K_TS/SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv):
Input stream #0:0 (video): 832 packets read (26952754 bytes); 819
frames decoded;
Input stream #0:1 (audio): 599 packets read (218347 bytes); 599
frames decoded (613376 samples);
Total: 1431 packets (27171101 bytes) demuxed
Output file #0 (/tmp/test1.ts):
Output stream #0:0 (video): 818 frames encoded; 818 packets muxed
(20030971 bytes);
Output stream #0:1 (audio): 533 frames encoded (614016 samples); 533
packets muxed (668316 bytes);
Total: 1351 packets (20699287 bytes) muxed
[nvenc_hevc @ 0x17dca80] Nvenc unloaded
Exiting normally, received signal 2.
Oliver Collyer
2016-09-01 13:01:23 UTC
Permalink
Post by Oliver Collyer
I’m not sure it’ll make much difference - you may recall my original
patch had code in nvenc.c that took a YUV420P input and converted it
to P010 as it fed the frames into the encoder. Out of curiosity I did
some quick testing of this versus the code that has since been added
in swscale to support P010 conversions and could find no difference in
the time it took to encode my 60s sample. Not an exhaustive test by
any means, but if there was any obvious inefficiency in the swscale
code then I’d have expected to see some difference but I tested my
sample three times with each version of the code and the time taken to
encode was virtually identical every time.
Oliver
Hi Oliver,
I followed your comment and tried your original patch. It works much
much better. FPS goes up to 88 - 92 fps for UHD HEVC Main10
YUV420P10LE.
I attached the nvenc.c file for your check as well (just I added the 2
convertion functions and did the change in YUV420P10LE pixel format
selection part).
Thanks Ali, that’s interesting as I’ve not been able to get any difference in my tests...unless I’ve made a mistake. I will try again to be sure. Maybe resolution becomes a factor, mine is a 1080i source.

It seems in your setup that allowing swscale to convert from yuv420p to yuv420p10 and then the code in nvenc.c to convert from yuv420p10 to p010 gives a better result, for whatever reason.
Post by Oliver Collyer
ffmpeg version N-81508-g99882d0 Copyright (c) 2000-2016 the FFmpeg developers
built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
configuration: --prefix=/opt/ffmpeg --enable-shared --enable-static
--enable-nonfree --enable-gpl --extra-cflags='-I/opt/ffmpeg/include
-I/usr/local/include' --extra-ldflags=-L/opt/ffmpeg/lib
--bindir=/opt/ffmpeg/bin --extra-libs=-ldl --enable-libx264
--enable-libx265 --enable-nonfree --enable-gpl --enable-nvenc
--enable-vdpau --enable-libzvbi --enable-libfdk-aac --enable-libzimg
--enable-avresample --enable-libnpp --enable-cuda
libavutil 55. 29.100 / 55. 29.100
libavcodec 57. 54.101 / 57. 54.101
libavformat 57. 48.101 / 57. 48.101
libavdevice 57. 0.102 / 57. 0.102
libavfilter 6. 58.100 / 6. 58.100
libavresample 3. 0. 0 / 3. 0. 0
libswscale 4. 1.100 / 4. 1.100
libswresample 2. 1.100 / 2. 1.100
libpostproc 54. 0.100 / 54. 0.100
Routing option err_detect to both codec and muxer layer
Input #0, matroska,webm, from
encoder : libebml v1.3.1 + libmatroska v1.4.2
creation_time : 2015-10-03T13:49:42.000000Z
Duration: 00:01:49.29, start: 0.816000, bitrate: 18484 kb/s
Stream #0:0: Video: hevc (Main 10), 1 reference frame,
yuv420p10le(tv), 3840x2160 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 1k tbn,
60 tbc (default)
BPS : 18497251
BPS-eng : 18497251
DURATION : 00:01:48.450000000
DURATION-eng : 00:01:48.450000000
NUMBER_OF_FRAMES: 6507
NUMBER_OF_FRAMES-eng: 6507
NUMBER_OF_BYTES : 250753360
NUMBER_OF_BYTES-eng: 250753360
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
Stream #0:1: Audio: aac (LC), 44100 Hz, stereo, fltp (default)
BPS : 124607
BPS-eng : 124607
DURATION : 00:01:49.267000000
DURATION-eng : 00:01:49.267000000
NUMBER_OF_FRAMES: 4669
NUMBER_OF_FRAMES-eng: 4669
NUMBER_OF_BYTES : 1701940
NUMBER_OF_BYTES-eng: 1701940
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
pixfmt:yuv420p10le tb:1/1000 fr:60/1 sar:1/1 sws_param:flags=2
flags:'bicubic' interl:0
fmt:yuv420p10le sar:1/1 -> w:3840 h:2160 fmt:yuv420p10le sar:1/1
flags:0x4
samplerate:44100 chlayout:0x3
-async is forwarded to lavfi similarly to -af
aresample=async=1:min_hard_comp=0.100000:first_pts=0.
fmt:fltp r:44100Hz -> ch:2 chl:stereo fmt:s16 r:44100Hz
muxers is deprecated, use AVStream.codecpar instead.
Last message repeated 1 times
9973, pat/pmt every 1994 pkts
service_name : PikoEncoder
service_provider: PikoEncoder
encoder : Lavf57.48.101
Stream #0:0: Video: hevc (nvenc_hevc) (Main 10), 1 reference
frame, yuv420p10le, 3840x2160 [SAR 1:1 DAR 16:9], q=-1--1, 28000 kb/s,
60 fps, 90k tbn, 60 tbc (default)
BPS : 18497251
BPS-eng : 18497251
DURATION : 00:01:48.450000000
DURATION-eng : 00:01:48.450000000
NUMBER_OF_FRAMES: 6507
NUMBER_OF_FRAMES-eng: 6507
NUMBER_OF_BYTES : 250753360
NUMBER_OF_BYTES-eng: 250753360
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
encoder : Lavc57.54.101 nvenc_hevc
28000000 vbv_delay: -1
Stream #0:1: Audio: mp2, 44100 Hz, stereo, s16, delay 481, padding
0, 384 kb/s (default)
BPS : 124607
BPS-eng : 124607
DURATION : 00:01:49.267000000
DURATION-eng : 00:01:49.267000000
NUMBER_OF_FRAMES: 4669
NUMBER_OF_FRAMES-eng: 4669
NUMBER_OF_BYTES : 1701940
NUMBER_OF_BYTES-eng: 1701940
_STATISTICS_WRITING_APP: mkvmerge v8.0.0 ('Til The Day That I Die') 64bit
_STATISTICS_WRITING_APP-eng: mkvmerge v8.0.0 ('Til The Day That
I Die') 64bit
_STATISTICS_WRITING_DATE_UTC: 2015-10-03 13:49:42
_STATISTICS_WRITING_DATE_UTC-eng: 2015-10-03 13:49:42
_STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
_STATISTICS_TAGS-eng: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
encoder : Lavc57.54.101 mp2
Stream #0:0 -> #0:0 (hevc (native) -> hevc (nvenc_hevc))
Stream #0:1 -> #0:1 (aac (native) -> mp2 (native))
Press [q] to stop, [?] for help
adding 1014 audio samples of silence
frame= 818 fps= 89 q=16.0 Lsize= 50894kB time=00:00:13.91
bitrate=29968.0kbits/s speed=1.51x
video:19561kB audio:653kB subtitle:0kB other streams:0kB global
headers:0kB muxing overhead: 151.775177%
Input stream #0:0 (video): 832 packets read (26952754 bytes); 819
frames decoded;
Input stream #0:1 (audio): 599 packets read (218347 bytes); 599
frames decoded (613376 samples);
Total: 1431 packets (27171101 bytes) demuxed
Output stream #0:0 (video): 818 frames encoded; 818 packets muxed
(20030971 bytes);
Output stream #0:1 (audio): 533 frames encoded (614016 samples); 533
packets muxed (668316 bytes);
Total: 1351 packets (20699287 bytes) muxed
Exiting normally, received signal 2.
<nvenc.c>_______________________________________________
ffmpeg-devel mailing list
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Timo Rothenpieler
2016-09-01 15:04:38 UTC
Permalink
Can you test again with this patch applied:

https://github.com/BtbN/FFmpeg/commit/54cf5500720c9b701d4fe16c2c6ff2e3cc1508d7.patch
Loading...