his article was written almost 2 years ago, it's content may not reflect the latest state of the code which is currently available.
Please check https://net7mma.codeplex.com/ for the latest information and downloads.
https://net7mma.codeplex.com/SourceControl/latest
Preface
I will say before I start the main article that Audio data support seems to be fairly covered in .NET and if a developer wanted to perform various tasks related to Audio such as Decoding, Encoding and Transcoding there are definitely enough resources as far as information and libraries with complete implementations to do that.. Things get utterly confusing when you start to deal with video - which is the real problem domain - both encoding and decoding it.... this is for multiple reasons with the biggest myth being that managed code is just not fast enough for working with video data.
The fact is that each codec is vastly different from one another and the compression / decompression utilized in the codec is usually encumbered by some type of patent. This unfortunately means someone has to pay royalties for the code in use in that codec. The other problem is that the decoding of video is usually standardized, but the encoding is not. This means you are free to encode in any way as long as the end result conforms to the specification for decoding the stream. This yields a lot of freedom to developers, but also provides such individuals with many ways to encode the data. However, some are more efficient than others and results are variable in amount of time used as well as the results produced.
There are a few libraries that can help, such as VLC or FFMPEG, both of which use LibAvCodec. However, that library is in written in "C++" and there are some considerations to take into account with the license of the library - not to mention it will introduce external dependencies and platform invocation into your code.
In short if you need a quick analogy you can compare video decoding and encoding to zipping and unzipping a file... when you decode you unzip the data, and when you encode you zip the data. Compression and Decompression, just how Zip compression is not the same as Rar compression, MPEG4 compression is not H264compression.
This is also true with Audio encoding / decoding for, say, "mp3" vs "wav" formats. The big difference between the two is primarily concerned with audio and video. The fact that audio data is far less complex than video data in the sense that there are far less bytes to parse, introduces a separate problem: audio - in the sense that a small 'glitch' or error in the data will result in a larger problem while decoding - versus decoding video data with a 'glitch' or error, which will NOT result in any substaintial artifacts during playback.
If you are interested about the format of wave audio and how to manipulate it check out this article on CodeProject which has a bunch of great examples.
If you are interested in how audio and video data are similar yet different on a high level check out thispresentation.
Typically the biggest factor in decoding the video for display on a computer screen is the color space conversion from YUV to RGB which has to be performed for every pixel is the resulting video which means that at Quarter Common Interlace Format or more simply QCIF resolution (which is 176x220) this calculation and conversion needs to be performed about 25,000 times while decoding. (Once for every pixel, 127 * 220 = 29840) or a bit less if you use a few tricks.
Single images are far less complex than series of images or videos due to the fact that you have BMP, GIF, JPEG, PNG, etc. at your side to do the work for you Each one has it's own list of pros and cons, and each has a place and purpose.
The JPEG image format is not as encumbered by such intellectual properties (anymore). Also, it is no longer the optimal way to transmit or even store pictures. Nonetheless, due to its wide adoptions over the past 20 something years, Jpeg files are quite accessible in the sense there are a lot of resources explaining how to work with Jpeg files and their contained data.
MPEG (among JPEG2000 and many other candidates) succeeds JPEG in the sense that it allows the data transmitted to be far smaller, and thus uses less bandwidth. This increases the complexity required to compress and finally view the image. Unfortunately this results in higher CPU utilization if you want to convert the compressed data completely to a format like RGB. It is also encumbered by patents for the latest versions.
Typically conversion to RGB is performed because that is the format used for displaying images on a computer screen in most set top boxes or televisions. This is not the case, and thus far less information is converted to RGB and is simply rendered in the native format for that platform (e.g. YUV), which is why the processors of such devices can be far less powerful then a modern or even an old computer.
Some Graphics Processing Units (GPU) or even some modern Central Processing Units (CPU) can even accelerate the processing by either using advanced math and processor intrinsics known an SIMD or Parallel Excution, others can work directly with YUV encoded data or throw multiple cores at the decoding process which have been highly optomized to work against such data and in some cases specfic codecs which require such advanced routines use specialized hardware, sort of how the Networking Stack in your own computer utilizes the NIC processor to perform Checksum Verification and other operations when allowed to by the operating system and subsequently achieve better performance.
This article and library don’t have much to do with encoding, decoding and or Transcoding and it’s not centered around Audio or Video so let’s find out exactly what this article is all about….
Introduction
This library provides packet for packet Rtp aggregation agnostic of the underlying video or audio formats utilized over Rtp. This means it does not depend on or expose details about the underlying media beyond what is obtained in the dialog necessary to play the media, it also makes it easy to derive the given classes to support application specific features such as QOS and throttling or other features which are required by your Video On Demand Server.
It can be used to source a single (low bandwidth) Media over Tcp/Udp to hundreds over users through the included RtspServer
by aggregating the packets. (Rtcp packets are not aggregated and will be calculated independently and sent appropriately for each Rtsp/Rtp session in the server.) When I use the term aggregated I mean repeated (not forwarded) to another client with the modifications necessary to relay the data to another EndPoint besides the one it was destined to.
It utilizes RFC2326 and RFC3550 compliant processes to provide this functionality among many others.
It can also be used if you want to broadcast from a device such as a Set Top Box without opening it up to the internet by connecting to it through the LAN and then establishing a SourceMedia
(at which point you could then also add a password) and then then broadcasts the stream OnDemand to anyone who connects.
This means that you can have many different types of source streams such as JPEG, MPEG4, H263, H264 etc and your clients will all receive the same video as the source stream is transmitting.
This also means that you can use popular tools with this libraries included RtspServer
such as FFMPEG, VLC,Quicktime, Live555, Darwin Streaming Media Server, etc .
This enables a developer to playback / transcode streams or save them to a file or even extract frames using those tools on the included RtspServer
so you do not have to bog down the bandwidth or CPU of your actual source streams / devices.
You could also add a 3rd tier - for example: separate this process and do the work on a different server without having to worry about interop between libraries. Just use this library to create the RtspServer
and source the stream. Additionally, from another process / server use AForge or another wrapper library to communicate to theRtspServer.
The RtspServer then communicates to the device. Run your processing and then create a RFC2435 frame for every image required in the stream. After you have completed that step, send it out via Rtp using the RtspServer.
This library also gives you added scaling and customization because your transcoding and transport are done on two separate servers. If that didn't make sense hopefully this diagram will aid you in understanding.
Besides providing a RtspServer
, it also provides an RtspClient
and RtpClient
allowing a developer to connect to any RtspServer
or EndPoint for which a SessionDescription is availble, providing developers an easy way to consume a stream.
An Example in Brief
In the above diagram you will notice that the conference call is being participated in by several cell participants (some local callers). This is fine and requires usually no more then the hardware which is already present to send and receive the calls. However, if you add the interesting twist that remote users need to be able to view the call as well, you suddenly increase the complexity of the problem 10 fold due to the bandwidth and processing requirements for the added remote users.
Immediately you think (or should think), "How can this small device handle more then the people who have participated in this session?" The answer is this: the device's processor can only handle so many viewers per session, and so, only so much bandwidth is available for sending to participants. Not to mention any end users...
You can replace the call medium in the above analogy with any other device, such as a web camera, and the topology still applies.
You may be thinking, "Why do I need to do this? My camera can already support X number of users..." I would reply that if your current needs are met today and suddenly tomorrow you grow exponentially, then your user requirements are going to change, and it is better to have more padding than not - especially when delivering media to end users.
This is where the RtspServer
comes in and provides a free, fast, and standards compliant implementation which allows you to repeat the call to the public optionally, at which point you could add a password. It would also allow outside users to participate in the session as well if required.
Where as a load balancer device would redirect network load, this software enables a server to act as a centralized source for consuming media and then has the capability to re-produce it elsewhere, thus removing the load from the end device and allowing the processing to be aggregated to as many tiers as desired.
If your saying, " I will never need to do anything like this, I will just use VLC or this software or that software," then this article is not for you.
You may also be thinking you could write such a client / server as I said I could write, and then all of the sudden realize there was a lot more to the standard(s) required to implement it then initially caught your eye... this article will probably help you.
You might also have come here because you have hit some type of barrier with another library and are hoping to find something more flexible to replace your current implementation. If this is the case, then this article will also help you!
Lets get some background on the problem domain...
Problem Domain
I needed to provide a way for multiple users to view Rtsp streams coming from a low bandwidth link such as modem or cell phone.
The sources already supported the Rtsp protocol but when multiple users connected to view the resources the bandwidth was not sufficient to support them; for each user that connected to the camera the bandwidth was subsequently halved, and thus, this was not sufficient to support more then 2 or 3 users at most - even with the quality setting tweaked.
The only solution was to aggregate the source media using a Media Server. This is because the Media Server would have a better processor and more bandwidth to utilize, which would allow the source stream to only have to be consumed by a single connection (The Media Server), and when clients wanted to consume the stream instead of connecting to the source media they would connect to the Media Server.
I researched around for a bit and found that there are other existing solutions such as DarwinStreamingServer, Live555, VideoLan or FFMPEG. Still, they are all written in C++ and are rather heavy weight for this type of project. Plus, they would require external dependencies from the managed code which I did not really want, not to mention possible licensing issues involved with that scenario.
I then came up with a crazy idea to build my own RtspServer
which would take in the source Rtsp streams and deliver the stream data to RtspClient
s using the RtspServer.
After all, I was familiar with socket communication and I had already built an HttpServer (among many others), so I knew I had the experience to do it. I just needed to actually start.
Before reading anything at all I searched around to see if there were any existing libraries out there that I could utilize and I found some partial implementations of Rtsp and Rtp, but nothing in C# that I could just take and use as it was... Fortunately, there were a few useful methods and concepts in the following libraries:
The problem with these implementations is that they either are not cross platform (will not work on Linux), or they are made for a specific purpose and utilize a proprietary video codec which makes them incompatible with a majority of standards based players and they are made to only interface with specific systems e.g. VoIP systems.
Having a task at hand and being confident in my abilities, I decided to go ahead and dissect the standard... I knew in advance that the video decoding and encoding would be the hardest only due to lack of domain experience, but I also realized that a good transport stack should be agnostic of such variances and thus it was my duty to build a stack that would be reusable under all circumstances... One transport stack to rule them all!(Using 100% Managed Code specifically C#
)
I read up on RFC2326 which describes the Real Time Streaming Protocol or Rtsp. This is where everything starts, the Rtsp Protocol’s purpose is to get the details about the underlying media such as the format and how it will be sent back to the client. It also enables the client to control the state of the stream such as if it is playing or recording.
It turns out Rtsp requests are similar to Http by design but are not directly compatible with Http. (Unless tunneled over Http appropriately)
Take for example this 'OPTIONS' Rtsp request
C->S: OPTIONS rtsp://example.com/media.mp4 RTSP/1.0 CSeq: 1 Require: implicit-play Proxy-Require: gzipped-messages S->C: RTSP/1.0 200 OK CSeq: 1 Public: DESCRIBE, SETUP, TEARDOWN, PLAY, PAUSE
Some of the status codes may be familiar and also how the data is formatted. You can see that this protocol is not much more different / difficult than working with Http.
If you can implement an Http Server then you can implement an Rtsp Server, the main difference is that all Rtsp requests usually require some type of 'State', where as some Http requests do not.
If fact if you want to support all variations of Rtsp you must support Http because Rtsp requests can be tunneled over Rtsp. The included RtspServer
supports Rtsp over Http but if you want to know more about how it is tunneled you can check out this developer page from Apple
Now, back to 'State', when I say state I mean things like session variables in Http that persist with the connection even after close, in Rtsp an example of this would be the SessionId which is assigned to clients from the Server during the 'SETUP' Rtsp reqeust.
C->S: SETUP rtsp://example.com/media.mp4/streamid=0 RTSP/1.0 CSeq: 3 Transport: RTP/AVP;unicast;client_port=8000-8001 S->C: RTSP/1.0 200 OK CSeq: 3 Transport: RTP/AVP;unicast;client_port=8000-8001;server_port=9000-9001 Session: 12345678
The other main difference is that the requests can come over Udp or Tcp requiring the Server to have 2 or more listening sockets. The default port for Rtsp over Tcp is 555 and for Rtsp over Udp is 554.
The Uri scheme for Tcp is 'rtsp://' and for Udp 'rtspu://'
The semantics of each scheme are the same, the only difference is the transport protocol being TCP or UDP underlying the IP Header.
While getting through Rtsp I discovered that I needed another protocol, RFC2326 also references the RFC for theSession Description Protocol or Sdp.
Sdp is one of the smallest parts of this project however it is equally as important to understand to get a complaint server functioning. It is only used in the 'DESCRIBE' request of the Rtsp Communication from server to client but is essential in providing the necessary information to the decode the data in the stream.
C->S: DESCRIBE rtsp://example.com/media.mp4 RTSP/1.0 CSeq: 2 S->C: RTSP/1.0 200 OK CSeq: 2 Content-Base: rtsp://example.com/media.mp4 Content-Type: application/sdp Content-Length: 460 m=video 0 RTP/AVP 96 a=control:streamid=0 a=range:npt=0-7.741000 a=length:npt=7.741000 a=rtpmap:96 MP4V-ES/5544 a=mimetype:string;"video/MP4V-ES" a=AvgBitRate:integer;304018 a=StreamName:string;"hinted video track" m=audio 0 RTP/AVP 97 a=control:streamid=1 a=range:npt=0-7.712000 a=length:npt=7.712000 a=rtpmap:97 mpeg4-generic/32000/2 a=mimetype:string;"audio/mpeg4-generic" a=AvgBitRate:integer;65790 a=StreamName:string;"hinted audio track"
RFC4566 – The Session Description Protocol, it is used in many other places besides Rtsp and is responsible for describing media. It usually provides information on the streams which are available and the information required to start decoding them.
In my opinion this protocol is a bit weird in the sense that I believe that XML would have served a better purpose for what it does and how to validate it however that is not relevant and the format is required for streaming media so it must be implemented as per the standard. (XML Probably wasn't mature enough at the time and it required more characters in the output and thus more bandwidth which may not be desired). Another approach would have been to encapsulate this data in the SourceDescription
RtcpPackets
however for whatever reason it was chosen to be what it is and must be implemented.
Sdp was designed as a "Line" based protocol which was also meant to be basically human readable and easy to parse for underlying tokens which provide information which is usually determined by the first character in a 'Line'. Some "Line" items can carry a specific encoding which is different than the others in the document but the included parser handles this well.
Regardless of how you establish the Rtsp connection with the RtspServer e.g., TCP, HTTP, or UDP, the RtspServer you are connecting to must send the Media data using yet another protocol...
Real-time Transport Protocol or Rtp a.k.a / RFC3550
Again this is not a complex protocol, it defines packets and a frame construct and various algorithms used to transmit them and calculate loss in the transmissions as well as what ports to use. It also outlines how the stream data can be played out by the receiver.
The main format of the an RTP Packet is as follows (Thanks to Wikipedia)
bit offset | 0-1 | 2 | 3 | 4-7 | 8 | 9-15 | 16-31 | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Version | P | X | CC | M | PT | Sequence Number | |||||||||||||||||||||||||
32 | Timestamp | |||||||||||||||||||||||||||||||
64 | SSRC identifier | |||||||||||||||||||||||||||||||
96 | CSRC identifiers ... |
|||||||||||||||||||||||||||||||
96+32×CC | Profile-specific extension header ID | Extension header length | ||||||||||||||||||||||||||||||
128+32×CC | Extension header ... |
The RTP header has a minimum size of 12 bytes. After the header, optional header extensions may be present. This is followed by the RTP payload, the format of which is determined by the particular class of application. The fields in the header are as follows:
- Version: (2 bits) Indicates the version of the protocol. Current version is 2.
- P (Padding): (1 bit) Used to indicate if there are extra padding bytes at the end of the RTP packet. A padding might be used to fill up a block of certain size, for example as required by an encryption algorithm. The last byte of the padding contains the number of how many padding bytes were added (including itself).
- X (Extension): (1 bit) Indicates presence of an Extension header between standard header and payload data. This is application or profile specific.
- CC (CSRC Count): (4 bits) Contains the number of CSRC identifiers (defined below) that follow the fixed header.
- M (Marker): (1 bit) Used at the application level and defined by a profile. If it is set, it means that the current data has some special relevance for the application.
- PT (Payload Type): (7 bits) Indicates the format of the payload and determines its interpretation by the application. This is specified by an RTP profile. For example, see RTP Profile for audio and video conferences with minimal control (RFC 3551).
- Sequence Number: (16 bits) The sequence number is incremented by one for each RTP data packet sent and is to be used by the receiver to detect packet loss and to restore packet sequence. The RTP does not specify any action on packet loss; it is left to the application to take appropriate action. For example, video applications may play the last known frame in place of the missing frame. According to RFC 3550, the initial value of the sequence number should be random to make known-plaintext attacks on encryptionmore difficult. RTP provides no guarantee of delivery, but the presence of sequence numbers makes it possible to detect missing packets.
- Timestamp: (32 bits) Used to enable the receiver to play back the received samples at appropriate intervals. When several media streams are present, the timestamps are independent in each stream, and may not be relied upon for media synchronization. The granularity of the timing is application specific.
- SSRC: (32 bits) Synchronization source identifier uniquely identifies the source of a stream. The synchronization sources within the same RTP session will be unique. *This one is really important*
- CSRC: Contributing source IDs enumerate contributing sources to a stream which has been generated from multiple sources.
- Extension header: (optional) The first 32-bit word contains a profile-specific identifier (16 bits) and a length specifier (16 bits) that indicates the length of the extension (EHL=extension header length) in 32-bit units, excluding the 32 bits of the extension header.
After the RTP header there are a variable amount of bytes up to the maximum packet size of 1500 bytes and in some networks the packet size can exceed 1500 bytes - the IP and TCP / UDP Header unless there is adequate support on the network.
These bytes make up the payload of the RtpPacket
Also known as the Coeffecients
The important part of the standard is obviously the packet structure and frame concept however there are also terms like JitterBuffer and Lip-synch which I will briefly explain.
A JitterBuffer and Lip-synch are just fancy words for ensuring there are no gaps in your RtpFrames
by making sure the sequence numbers of the contained RtpPackets (
in a RtpFrame
) increment one by one. (without skipping) and are played with the proper delay, in a Translator, Mixer or Proxy implementation there is not much use for such.
Not to minimize work done in this area by others however there is a significance in their values in regard to encoding or decoding the data for use however it is much less relevant in the transport area, for example when encoding or decoding the time-stamps are used to ensure audio and video packets are within reasonable playing distance of each other resulting in the lip's being synced or in other words the audio matching the video with very little drift or lag.
I am not going to delve into all of these internals I just wanted to explain how they work at a high level, you can read the RFC's if you are interested further.
The main thing to take away when continuing is that RtpPacket
have a field called the 'Ssrc'. The fully typed name of this field is SynchronizationSourceIdentifier and it identifies the stream AND who the stream is being sent from / to. This is an important distinction to make as you will read below.s
After becoming equipped with my understanding of the protocols and having gotten as far to be able to create streams myself I started next where any other reverse engineer would and I started going through a fewWireShark dumps of existing players working with existing servers.
My goal was to compare the traffic of my client to other players to determine what exactly changes per client who connects to a server.
After some analyzation of the dumps I realized that my client’s traffic was almost exactly the same and that the stream data was not changing only the single field ‘Ssrc’ in the RtspServer
s 'SETUP' request and subsequently the ‘Ssrc’ field in all RtcpPackets
RtpPacket
s going out.
I realized that since the stream data (RTP Payload) was not changing nor any bytes inside it... only the ‘Ssrc’ field (which represented the stream and who the Packet was being sent to) that this was going to be an easy task; I just needed to modify that field to effectively re-target the packet when sending back out from the server to the client by using a different 'Ssrc'.
I could have also used the old 'ssrc' by adding it tothe 'ContributingSources' of the outgoing packet but I leave this as an option thing to do for people creating mixers with a specific purpose because it increases the packet size by a few bytes for each RtpPacket
you send out.
I originally was going to do the same with RtcpPackets
however the overhead to calculate their reports according to the standard was small plus there was existing code in the other projects I could reference and utilize in my implementation so I added the ability for my RtpClient
to generate and respond with its ownRtcpPackets
per the standard.
The result was efficient and pleasing and is compatible with VLC, Quicktime, FFMPEG and any other standards compliant player. The code works as follows...
Any clients consuming a stream from the included RtspServer
get an exact copy of the Video / Audio streams being transmitted in the exact same Format and at the exact same Frames Per Second.
If your source stream drops a packet then all clients consuming that stream will likely drop a packet.
If a session drops a packet it will not effect other sessions on the server and finally no matter how many streams connect to the included server the source stream only has the bandwidth utilized as if a single stream is consuming it, the rest of the bandwidth is utilized by the RtspServer
and the clients who connect to it.
Each RtpClient
instance has a single Thread to handle all Audio and Video Transport which occurs and thus the life of the buffer is shared between audio and video streams, packet instances are short lived and usually are complete in the buffer however if a packet is larger then allocated in the buffer it will be completed accurately based on the information required in the header fields.
Now I was at the home stretch, I would have been at home plate had it not have been for Rtp over Tcp being slightly different from Rtp over Udp and to complicate matters we had to deal with Rtsp concurrently (all on the same socket in TCP). It was with a bit of research I came across the following RFC as it is not directly referenced by RFC 2326 for some reason.
When video is transmitted over the internet to home users or users in a workplace it usually has problems due toNetwork Address Translation. A common example of this is a firewall or router blocking incoming traffic on certain UDP ports which result in the player not receiving video. The only way to work around this issue is to have the player tunnel the connection through TCP hole punching. This allows the firewall to allow the traffic to and from the client and server if it is on a designated port.
In such a case there cannot be different ports for Rtp and Rtcp traffic, this additionally complicates matters because we are already using Rtsp to communicate on the TCP socket.
Enter RFC4571 / Independent TCP
RFC4571 – "Framing Real-time Transport Protocol (RTP) and RTP Control Protocol (RTCP) Packets
over Connection-Oriented Transport" Or more succinctly "Interleaving"
This RFC explains how data can be sent and received with RTP / RTCP communication when using TCP oriented sockets. It specifies that there should be a 2 byte length field preceding the Rtp or Rtcp packet.
Figure 1 defines the framing method. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------------------------------------------------------+ | LENGTH | RTP or RTCP packet ... | +---------------------------------------------------------------+
Combine this with section Section 10.12 of RFC2326 which adds adds to this mechanism by adding a framing character and channel character to delimit Rtsp and Rtp / Rtcp traffic preceding the length field.
Take this example:
C->S: PLAY rtsp://foo.com/bar.file RTSP/1.0 CSeq: 3 Session: 12345678 S->C: RTSP/1.0 200 OK CSeq: 3 Session: 12345678 Date: 05 Jun 1997 18:59:15 GMT RTP-Info: url=rtsp://foo.com/bar.file; seq=232433;rtptime=972948234 S->C: $ 00{2 byte length}{"length" bytes data, w/RTcP header} S->C: $ 00{2 byte length}{"length" bytes data, w/RTP header} C->S: GET PARAMETER rtsp://foo.com/bar.file RTSP/1.0 CSeq: 5 Session: 12345678 S->C: RTSP/1.0 200 OK CSeq: 5 Session: 12345678 S->C: $ 00{2 byte length}{"length" bytes data, w/RTP header} S->C: $ 00{2 byte length}{"length" bytes data, w/RTP header}
As you can see the example given =>'$ 00{2 byte length}{"length" bytes data, w/RTP header}' was not really diagrammed out in the RFC so I will do a bit of explaining for you here and give you a real world example.
In the instance that Rtp is being 'interleaved' on the same socket that Rtsp communication is sent and received on so we need a way to differentiate the start, and end of the Rtp data and the Rtsp data within a contagious allocation of memory (a buffer)
This is where RFC2336 adds the magic character '$' as a framing control character to indicate RTP Data is coming on the socket
When '$' is encountered the channel and length follow along with the actual RTP data packet.
$ - is the control character.