Developers Club geek daily blog

2 years ago
In the previous note Linux kernel module code sketch for creation of the additional virtual network interface was shown. It was the simplified fragment from the real project which fulfilled several years without failures and claims so it can quite serve as a template for further improvement, correction and development.

But such approach to implementation, first, not only, and, secondly, in some situations it can be and unacceptable (for example, in the built-in system with a kernel is younger 2.6.36 where there is no netdev_rx_handler_register challenge yet ()). Below the alternative option with the same functionality but implementing it on absolutely other layer of a network stack of TCP/IP will be considered.

Network layer protocols


It is a lot of not to repeat, it is written that levels (layers) of a network stack TCP/IP do not correspond unambiguously to 7 levels of model of open system interconnection OSI/ISO (or if is more fair, the OSI model close to heart of the academic circles appeared the inadequate really developing TCP/IP network). Creation of the virtual interface, in the previous discussed implementation, was executed at the level of interfaces (L2, Level 2 — very approximately corresponding to the data link layer of OSI). Current implementation uses possibilities of the network layer (L3).

It is advisable to consider some minimum concerning means of the network layer, in amount is even slightly wider, than it is necessary for the current task, for opportunities of the subsequent its expansion. At the network layer of a stack of network protocols (TCP/IP, but not only — here are supported also all other families of protocols, but for today they are represented a little actual) processing of such protocols as is provided: IP/IPv4/IPv6, IPX, IGMP, RIP, OSPF, ARP, or is performed adding of original user protocols. API of the network layer is provided for installation of processors of the network layer ():
struct packet_type { 
   __be16  type; /* This is really htons(ether_type). */ 
   struct net_device *dev; /* NULL is wildcarded here    */ 
   int (*func) (struct sk_buff*, struct net_device*, struct packet_type*, struct net_device*); 
...
   struct list_head list; 
}; 
extern void dev_add_pack( struct packet_type *pt ); 
extern void dev_remove_pack( struct packet_type *pt ); 

Actually, in legal modules of a kernel we have to add the filter through which there pass buffers of sockets from the entering interface flow (the outgoing flow is implemented more simply as it was shown in the previous implementation). The dev_add_pack function () adds one more new processor for packets of the set type realized by the func function (). Function adds, but the existing processor (including the processor by default does not substitute network system of Linux). On processing in function are selected those buffers of sockets which satisfy with the criterion mortgaged in structure of struct packet_type (get) (as the type protocol and to the network dev interface).

Note: According to the same scheme (function filter installation) there is adding of new protocols and at higher, transport layer of a network stack (at which the UDP, TCP, SCTP protocols are processed, for example). More and more high levels (the OSI models more or less similar to levels) in a kernel are not provided, and are serviced in a user space by technology of programming BSD sockets. But all this, belonging to higher levels, parts will not be considered in the text any more.

If we like to add the new protocol (proprietary), then would have to redefine its type:
#define PROTO_ID 0x1234 
static struct packet_type test_proto = { 
   __constant_htons( PROT_ID ), 
   ...
}

The problem would be made at the same time that the standard IP stack does not know such protocol, and we should undertake all its processing. But in our purposes enters only to redefine processing of some packets, for this purpose we use the constant ETH_P_ALL specifying that through the filter there have to pass all protocols (and if the dev field is equal to NULL — and all network interfaces).

For comparison and a specification, a large number of identifiers of protocols (Ethernet Protocol ID's) it is found in, here some of them, for an example:
#define ETH_P_LOOP   0x0060  /* Ethernet Loopback packet  */
#define ETH_P_IP     0x0800  /* Internet Protocol packet  */
#define ETH_P_ARP    0x0806  /* Address Resolution packet */
#define ETH_P_PAE    0x888E  /* Port Access Entity (IEEE 802.1X) */
#define ETH_P_ALL    0x0003  /* Every packet (be careful!!!) */
...

In this case the type field is not abstract numerical value in a program code, this value in a binary type will be entered in the heading Ethernet of the frame which is physically sent to the distribution environment:
struct ethhdr { 
   unsigned char h_dest[ETH_ALEN];   /* destination eth addr */ 
   unsigned char h_source[ETH_ALEN]; /* source ether addr    */ 
   __be16       h_proto;             /* packet type ID field */ 
} __attribute__((packed)); 

(The same description is required to us in a code when filling structure of struct packet_type in the module).

Function of the filter (func field) which we still should write, maybe, in the elementary option, something similar:
int test_pack_rcv( struct sk_buff *skb, struct net_device *dev, 
                   struct packet_type *pt, struct net_device *odev ) { 
   LOG( "packet received with length: %u\n", skb->len );
   kfree_skb( skb ); 
   return skb->len; 
};

Function is shown here, mainly, because of an obligatory challenge of kfree_skb (). It, unlike, apparently, faithful dev_kfree_skb () in the transferring channel, does not destroy the socket buffer but only decrements its counter of use (users field). At installation of each complementary filter of the protocol dev_add_pack challenge () this field of socket buffers will be incremented. You can install several filters of the network layer (in same, or several loadable modules) and they will work everything is all right to the return to their installation, but each of them has to execute kfree_skb (). Otherwise you will have slow, but steady memory leak in a network stack so its result as crash of system, will be found only in several hours of continuous work.

This rather interesting and not obvious place, so, that there is a sense to distract and look at the source code of implementation kfree_skb () (net/core/skbuff.c file):
void kfree_skb(struct sk_buff *skb) {
   if (unlikely(!skb))
      return;
   if (likely(atomic_read(&skb-;>users) == 1))
      smp_rmb();
   else if (likely(!atomic_dec_and_test(&skb-;>users)))
      return;
   trace_kfree_skb(skb, __builtin_return_address(0));
   __kfree_skb(skb);
}

kfree_skb challenge () really will release the socket buffer only in case of skb-> users == 1, at all other values it will only decrement skb-> users (use counter).

Now we have enough parts to organize operation of the virtual interface, but using, this time, the network layer of an IP stack.

Module of the virtual interface


Let's arrive as well as earlier: let's create two versions of the module — the simplified virtl.ko option which network interface (virt0) substitutes the parent network interface, and complete virt.ko option which analyzes network frames of protocols (ARP and IP4), and mentions only that traffic which to its interface belongs. The difference consists that on load time of the simplified module operation of the parent interface temporarily stops (before unloading of the virtl.ko module), and when loading complete option both interfaces can work in parallel and independently. A code of the complete module much more bulky, and for understanding of the principles he adds nothing. Further the simplified option showing the principles, and only later is in details considered we minimum will concern complete option (its code and the test sheet are given in archive of examples):
there is rather long code
#include <linux/module.h> 
#include <linux/version.h> 
#include <linux/netdevice.h> 
#include <linux/etherdevice.h> 
#include <linux/inetdevice.h> 
#include <linux/moduleparam.h> 
#include <net/arp.h> 
#include <linux/ip.h> 

#define ERR(...) printk( KERN_ERR "! "__VA_ARGS__ ) 
#define LOG(...) printk( KERN_INFO "! "__VA_ARGS__ ) 
#define DBG(...) if( debug != 0 ) printk( KERN_INFO "! "__VA_ARGS__ ) 

static char* link = "eth0"; 
module_param( link, charp, 0 ); 

static char* ifname = "virt"; 
module_param( ifname, charp, 0 ); 

static int debug = 0; 
module_param( debug, int, 0 ); 

static struct net_device *child = NULL; 
static struct net_device_stats stats;  // статическая таблица статистики интерфейса 
static u32 child_ip; 

struct priv { 
   struct net_device *parent; 
}; 

static char* strIP( u32 addr ) {      // диагностика IP в точечной нотации 
   static char saddr[ MAX_ADDR_LEN ]; 
   sprintf( saddr, "%d.%d.%d.%d", 
            ( addr ) &0xFF, ( addr >> 8 ) &0xFF, 
            ( addr >> 16 ) &0xFF, ( addr >> 24 ) &0xFF 
          ); 
   return saddr; 
} 

static int open( struct net_device *dev ) { 
   struct in_device *in_dev = dev->ip_ptr; 
   struct in_ifaddr *ifa = in_dev->ifa_list;      /* IP ifaddr chain */ 
   LOG( "%s: device opened", dev->name ); 
   child_ip = ifa->ifa_address; 
   netif_start_queue( dev ); 
   if( debug != 0 ) { 
      char sdebg[ 40 ] = ""; 
      sprintf( sdebg, "%s:", strIP( ifa->ifa_address ) ); 
      strcat( sdebg, strIP( ifa->ifa_mask ) ); 
      DBG( "%s: %s", dev->name, sdebg ); 
   } 
   return 0; 
} 

static int stop( struct net_device *dev ) { 
   LOG( "%s: device closed", dev->name ); 
   netif_stop_queue( dev ); 
   return 0; 
} 

static struct net_device_stats *get_stats( struct net_device *dev ) { 
   return &stats; 
} 

// передача фрейма 
static netdev_tx_t start_xmit( struct sk_buff *skb, struct net_device *dev ) { 
   struct priv *priv = netdev_priv( dev ); 
   stats.tx_packets++; 
   stats.tx_bytes += skb->len; 
   skb->dev = priv->parent;   // передача в родительский (физический) интерфейс 
   skb->priority = 1; 
   dev_queue_xmit( skb ); 
   DBG( "tx: injecting frame from %s to %s with length: %u", 
        dev->name, skb->dev->name, skb->len ); 
   return 0; 
   return NETDEV_TX_OK; 
} 

static struct net_device_ops net_device_ops = { 
   .ndo_open = open, 
   .ndo_stop = stop, 
   .ndo_get_stats = get_stats, 
   .ndo_start_xmit = start_xmit, 
}; 

// приём фрейма 
int pack_parent( struct sk_buff *skb, struct net_device *dev, 
                 struct packet_type *pt, struct net_device *odev ) { 
   skb->dev = child;          // передача фрейма в виртуальный интерфейс 
   stats.rx_packets++; 
   stats.rx_bytes += skb->len; 
   DBG( "tx: injecting frame from %s to %s with length: %u", 
        dev->name, skb->dev->name, skb->len ); 
   kfree_skb( skb ); 
   return skb->len; 
}; 
 
static struct packet_type proto_parent = { 
   __constant_htons( ETH_P_ALL ), // перехватывать все пакеты: ETH_P_ARP &ETH_P_IP 
   NULL, 
   pack_parent, 
   (void*)1, 
   NULL 
}; 

int __init init( void ) { 
   void setup( struct net_device *dev ) { // вложенная функция (расширение GCC) 
      int j; 
      ether_setup( dev ); 
      memset( netdev_priv( dev ), 0, sizeof( struct priv ) ); 
      dev->netdev_ops = &net;_device_ops; 
      for( j = 0; j < ETH_ALEN; ++j )     // заполнить MAC фиктивным адресом 
         dev->dev_addr[ j ] = (char)j; 
   } 
   int err = 0; 
   struct priv *priv; 
   char ifstr[ 40 ]; 
   sprintf( ifstr, "%s%s", ifname, "%d" ); 
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3, 17, 0)) 
   child = alloc_netdev( sizeof( struct priv ), ifstr, setup ); 
#else 
   child = alloc_netdev( sizeof( struct priv ), ifstr, NET_NAME_UNKNOWN, setup ); 
#endif 
   if( child == NULL ) { 
      ERR( "%s: allocate error", THIS_MODULE->name ); return -ENOMEM; 
   } 
   priv = netdev_priv( child ); 
   priv->parent = dev_get_by_name( &init;_net, link ); // родительский интерфейс  
   if( !priv->parent ) { 
      ERR( "%s: no such net: %s", THIS_MODULE->name, link ); 
      err = -ENODEV; goto err; 
   } 
   if( priv->parent->type != ARPHRD_ETHER &&priv->parent->type != ARPHRD_LOOPBACK ) { 
      ERR( "%s: illegal net type", THIS_MODULE->name ); 
      err = -EINVAL; goto err; 
   } 
   memcpy( child->dev_addr, priv->parent->dev_addr, ETH_ALEN ); 
   memcpy( child->broadcast, priv->parent->broadcast, ETH_ALEN ); 
   if( ( err = dev_alloc_name( child, child->name ) ) ) { 
      ERR( "%s: allocate name, error %i", THIS_MODULE->name, err ); 
      err = -EIO; goto err; 
   } 
   register_netdev( child );         // зарегистрировать новый интерфейс 
   proto_parent.dev = priv->parent; 
   dev_add_pack( &proto;_parent );    // установить обработчик фреймов для родителя 
   LOG( "module %s loaded", THIS_MODULE->name ); 
   LOG( "%s: create link %s", THIS_MODULE->name, child->name ); 
   return 0; 
err: 
   free_netdev( child ); 
   return err; 
} 

void __exit virt_exit( void ) { 
   struct priv *priv= netdev_priv( child ); 
   dev_remove_pack( &proto;_parent ); // удалить обработчик фреймов 
   unregister_netdev( child ); 
   dev_put( priv->parent ); 
   free_netdev( child ); 
   LOG( "module %s unloaded", THIS_MODULE->name ); 
   LOG( "=============================================" ); 
} 

module_init( init ); 
module_exit( virt_exit ); 

MODULE_AUTHOR( "Oleg Tsiliuric" ); 
MODULE_LICENSE( "GPL v2" ); 
MODULE_VERSION( "3.7" ); 


Everything is rather transparent:
  • After registration of the new network interface (virt0) it executes dev_add_pack challenge (), the setting filter of the accepted packets for the parent interface;
  • Previously the dev field on the pointer of the parent interface is established in structure of packet_type: only from this interface the incoming traffic will be intercepted by the pack_parent function defined in structure ();
  • This function fixes statistics of the interface and, the most important, substitutes in the socket buffer the pointer of the parent interface for virtual.
  • The return substitution (virtual on physical) happens as sending a frame of start_xmit ().

Here is how it works:
  • On the tested computer loadable module also we configure it on a separate new subnet:
    $ ip address 
    ...
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 
        link/ether 08:00:27:52:b9:e0 brd ff:ff:ff:ff:ff:ff 
        inet 192.168.1.21/24 brd 192.168.1.255 scope global eth0 
        inet6 fe80::a00:27ff:fe52:b9e0/64 scope link 
           valid_lft forever preferred_lft forever 
    3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 
        link/ether 08:00:27:0f:13:6d brd ff:ff:ff:ff:ff:ff 
        inet 192.168.56.102/24 brd 192.168.56.255 scope global eth1 
        inet6 fe80::a00:27ff:fe0f:136d/64 scope link 
           valid_lft forever preferred_lft forever 
    $ sudo insmod virt.ko link=eth1 debug=1 
    $ sudo ifconfig virt0 192.168.50.19 
    $ sudo ifconfig virt0 
    virt0     Link encap:Ethernet  HWaddr 08:00:27:0f:13:6d 
              inet addr:192.168.50.19  Bcast:192.168.50.255  Mask:255.255.255.0 
              inet6 addr: fe80::a00:27ff:fe0f:136d/64 Scope:Link 
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0 
              TX packets:46 errors:0 dropped:0 overruns:0 carrier:0 
              collisions:0 txqueuelen:1000 
              RX bytes:0 (0.0 B)  TX bytes:8373 (8.1 KiB) 
    

    (Here the statistics with zero number of the accepted bytes on the interface is shown).
  • On the computer from which we hold testing, we create aliasny IP for a new subnet (192.168.50.0/24) and we can perform a traffic on the created interface:
    $ sudo ifconfig vboxnet0:1 192.168.50.1
    $ ping 192.168.50.19 
    PING 192.168.50.19 (192.168.50.19) 56(84) bytes of data. 
    64 bytes from 192.168.50.19: icmp_req=1 ttl=64 time=0.627 ms 
    64 bytes from 192.168.50.19: icmp_req=2 ttl=64 time=0.305 ms 
    64 bytes from 192.168.50.19: icmp_req=3 ttl=64 time=0.326 ms 
    ^C 
    --- 192.168.50.19 ping statistics --- 
    3 packets transmitted, 3 received, 0% packet loss, time 2000ms 
    rtt min/avg/max/mdev = 0.305/0.419/0.627/0.148 ms 
    
  • On the same (testing) computer (defendant) it is very information to observe the traffic (in the separate terminal) fixed by tcpdump:
    $ sudo tcpdump -i vboxnet0 
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode 
    listening on vboxnet0, link-type EN10MB (Ethernet), capture size 65535 bytes 
    ... 
    18:41:01.740607 ARP, Request who-has 192.168.50.19 tell 192.168.50.1, length 28 
    18:41:01.741104 ARP, Reply 192.168.50.19 is-at 08:00:27:0f:13:6d (oui Unknown), length 28 
    18:41:01.741116 IP 192.168.50.1 > 192.168.50.19: ICMP echo request, id 8402, seq 1, length 64 
    18:41:01.741211 IP 192.168.50.19 > 192.168.50.1: ICMP echo reply, id 8402, seq 1, length 64 
    18:41:02.741164 IP 192.168.50.1 > 192.168.50.19: ICMP echo request, id 8402, seq 2, length 64 
    18:41:02.741451 IP 192.168.50.19 > 192.168.50.1: ICMP echo reply, id 8402, seq 2, length 64 
    18:41:03.741163 IP 192.168.50.1 > 192.168.50.19: ICMP echo request, id 8402, seq 3, length 64 
    18:41:03.741471 IP 192.168.50.19 > 192.168.50.1: ICMP echo reply, id 8402, seq 3, length 64 
    18:41:06.747701 ARP, Request who-has 192.168.50.1 tell 192.168.50.19, length 28 
    18:41:06.747715 ARP, Reply 192.168.50.1 is-at 0a:00:27:00:00:00 (oui Unknown), length 28 
    


We expand opportunities


Now it is short, in two words, how to make the sound virtual interface which is working only with the traffic, and not breaking operation of the parent interface (what is done by the complete version of the module in archive). For this purpose it is necessary:

  • To declare two separate processors of protocols (for protocols of permission of the names ARP and actually for the IP protocol):
    // обработчик фреймов ETH_P_ARP 
    int arp_pack_rcv( struct sk_buff *skb, struct net_device *dev, 
                      struct packet_type *pt, struct net_device *odev ) { 
       ...
       return skb->len; 
    }; 
    
    static struct packet_type arp_proto = { 
       __constant_htons( ETH_P_ARP ), 
       NULL, 
       arp_pack_rcv,  // фильтр пртокола ETH_P_ARP 
       (void*)1, 
       NULL 
    }; 
    
    // обработчик фреймов ETH_P_IP 
    int ip4_pack_rcv( struct sk_buff *skb, struct net_device *dev, 
                      struct packet_type *pt, struct net_device *odev ) { 
       ...
       return skb->len; 
    }; 
    
    static struct packet_type ip4_proto = { 
       __constant_htons( ETH_P_IP ), 
       NULL, 
       ip4_pack_rcv,    // фильтр пртокола ETH_P_IP 
       (void*)1, 
       NULL 
    }; 
    
  • It is consecutive to register both of them as initialization of the module:
       arp_proto.dev = ip4_proto.dev = priv->parent; // перехват только с родительского интерфейса 
       dev_add_pack( &arp;_proto ); 
       dev_add_pack( &ip4;_proto ); 
    

  • Each of the installed filters has to perform substitution of the interface only for those frames which IP of the receiver matches from the IP interface …
  • Two separate processors are convenient that headings of frames of ARP and IP have absolutely different format, and it is necessary to select assignment IP in them differently (all complete code is shown in archive of an example).

Using such sound module, it is possible to open to a host, for example, two parallel SSH sessions on the different interfaces (using different IP) which in a parallel will really use the common general physical interface:
$ ssh olej@192.168.50.17 
olej@192.168.50.17's password: 
Last login: Mon Jul 16 15:52:16 2012 from 192.168.1.9 
...
$ ssh olej@192.168.56.101 
olej@192.168.56.101's password: 
Last login: Mon Jul 16 17:29:57 2012 from 192.168.50.1 
...
$ who 
olej     tty1         2012-07-16 09:29 (:0) 
olej     pts/0        2012-07-16 09:33 (:0.0) 
...
olej     pts/6        2012-07-16 17:29 (192.168.50.1) 
olej     pts/7        2012-07-16 17:31 (192.168.56.1) 


The last shown command (who) is executed already in SSH session, that is on that distant host to which two independent connections from two different subnets (the last two lines of an output) which actually represent one host, but from the point of view of its different network interfaces are fixed.

Further amendments


By preparation and debugging of examples of modules, for refining of parts, this (rather fresh) book was actively used: Rami Rosen: "Linux Kernel Networking: Implementation and Theory", Apress, 650 pages, 2014, ISBN-13: 978-1-4302-6196-4.
One more virtual interface

The author kindly provided it for free downloading to a publication of the book in sale (2013-12-22). It is possible to download it on this page.

All whom questions interest similar discussed in this article, will be able to find in this edition many ideas for further development of technology of own use of network interfaces.

Archive of the codes mentioned in the text for experiments and further development, it is possible to take here or here.

This article is a translation of the original post at habrahabr.ru/post/270517/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus