📜 ⬆️ ⬇️

Performance comparison of C and C ++ using the example of Huffman compression

Introduction


When the IT forums ask the question "Is the X programming language of the Y language faster," this usually causes a flow of emotions and is considered incorrect. With related questions about religion or the preference of a particular political party. Indeed, language is a way of expressing thoughts, ideas. In this case, the ideas of the software system. He is neither fast nor slow. It may be more or less concise, more or less accurate. And the speed is determined not so much by the language as by the final code generated by the compiler of this language. Or the speed of the interpreter in the case of the interpreted language.

But this is all philosophy. But in practice there is usually a practical task of software development. And, indeed, you can implement this software in a dozen different programming languages. Therefore, even though this is a “religious issue” in the case of a public discussion, this question often arises in the head of an IT specialist who is facing a specific task. “How much time will it take for me to accomplish the task in the X language and what characteristics of the received software will have the speed characteristics. Compared with the implementation of this task in the language Y ". Understandably, there is no exact answer to this question, the specialist relies on his personal experience and responds somehow like "with a 95% probability, written in assembler, this task will work faster than in php". But, in all honesty, this experience is rarely based on the exact numbers of real-world tasks that this specialist himself realized. No, well, who in his right mind would write complex software first in php, and then rewrite it in assembler, only to measure the characteristics? Mainly limited to synthetic tests such as sorting an array, building and traversing a binary tree, and so on.

As a specialist writing 90% in C ++, I often come across “holly” topics of comparing this language with others. And one of them is the progenitor - the C language. On the same quora.com, this question is often raised “Is the C language of C ++ faster” (which is incorrect, as I explained above), or “Why is the Linux kernel or a ton of GNU utilities written in C and not in C ++ ”(which is a completely correct question). I answered the second question for myself:
')

Since C is part of C ++, I have to decide in my everyday tasks whether to express some part of the logic “more in C style” (with working with “raw” pointers, clearing memory through memset, passing context through void * ), or type safe in C ++ style (the pointers are wrapped in unique_ptr / shared_ptr, the memory is cleaned up with normally written constructors, the context is passed as a typed object: either with a pointer to the base class with virtual functions, or as a template in general).

Problem


In order to answer this question a little more thoroughly, I decided to write one more (yes, also a little synthetic) test - data coding using the Huffman method. The article “The Huffman algorithm on the fingers” ( https://habrahabr.ru/post/144200/ ) suggested the idea.

First, I implemented coding on pure C. If you remember, its implementation requires a priority queue, because to build a coding tree, you need to quickly find characters ordered by the number of repetitions. I omit the algorithmic details, referring the reader to the link above (sorry for the tautology). Actually, this would all end, and there would be no article, because the coding I implemented only as a workout in the algorithms. But in the course of the work, I noticed how quickly the C program is compiled compared to C ++ source code of similar size. And mentioned this to a colleague. Having suggested that compiling in C ++ includes, probably, many more optimization methods. So, similarly written code in C ++ should probably be faster - the magic of the most-most gurus in the field of writing optimizing compilers will work in the same place. Grinning, a colleague replied: "Check."

And then I rewrote Huffman coding in C ++. For the purity of the experiment, I did not change the fundamental principles, for example, did not insert a user memory allocator. This can be done both in C (more “custom”) and in C ++ (more “native”). What is then “C ++ - ness”?

Priority queue


The first thing that makes sense to express through patterns in C ++ is the priority queue. In C, it is represented as a structure, the main element of which is a pointer to an array of pointers to data nodes:

struct priority_queue
{
    // A number of active (carrying data) nodes currently in the queue
    unsigned int size;
    // A total number of nodes in "nodes" array
    unsigned int capacity;
    // An array of pointers to nodes
    struct priority_queue_node** nodes;
};

, :

struct priority_queue_node
{
    unsigned int weight;
};

, , . , , : ((struct priority_queue_node*) node_ptr)→weight. :

int priority_queue_push(struct priority_queue* queue, struct priority_queue_node* node)
{
    if (queue->size >= queue->capacity)
    {   
        int new_capacity = queue->capacity * 2;
        if (new_capacity == 0)
            new_capacity = 1;
        struct priority_queue_node** new_nodes = (struct priority_queue_node**) malloc(sizeof(struct priority_queue_node*) * new_capacity);
        if (! new_nodes)
        {
            return 0;
        }
        memcpy(new_nodes, queue->nodes, sizeof(struct priority_queue_node*) * queue->size);
        if (queue->capacity)
            free(queue->nodes);
        queue->nodes = new_nodes;
        queue->capacity = new_capacity;
    }

    queue->nodes[queue->size++] = node;
    heapify(queue);
    return 1;
}

— C++ , . , C++ ( — ):

template <class T> class priority_queue
{
    struct node
    {   
        unsigned int m_weight;
        T m_data;
    };

    using node_ptr = std::unique_ptr<node>;

    std::size_t m_capacity;
    std::size_t m_size;
    std::unique_ptr<node_ptr[]> m_nodes;

    void heapify() noexcept;
    void increase_capacity();
public:
    explicit priority_queue(std::size_t capacity = 16) ;
    // …
};

, C — , , , , — , , . — m_nodes .

— ( — increase_capacity, ):

template <class U>
push(unsigned int weight, U&& obj)
{   
    if (m_size >= m_capacity)
        increase_capacity();

    m_nodes[m_size++].reset(new node({weight, std::forward<U>(obj)}));
    heapify();
}

void increase_capacity()
{
    const auto new_capacity = m_capacity ? m_capacity * 2 : 1;
    std::unique_ptr<node_ptr[]> new_nodes(new node_ptr[new_capacity]);

    for (auto src = m_nodes.get(), dest = new_nodes.get(); src != m_nodes.get() + m_size; ++src, ++dest)
        *dest = std::move(*src);

    m_nodes = std::move(new_nodes);
    m_capacity = new_capacity;
}

( )


, , . C , :

#define NODE_TYPE_TERM 1
#define NODE_TYPE_NODE 2

struct char_node_base
{
    int type;
};

struct char_node_terminal
{
    struct char_node_base base;
    char c;
};

struct char_node
{
    struct char_node_base base;
    struct char_node_base* left;
    struct char_node_base* right;
};

, , , — :
struct char_node_root
{
    struct priority_queue_node pq_node;
    int height;
    struct char_node_base* node;
};

C++ :
struct char_node_base
{
    virtual ~char_node_base() = default;
};

using char_node_ptr = std::unique_ptr<char_node_base>;

struct char_node_terminal : char_node_base
{
    const unsigned char m_c;
    char_node_terminal(char c) noexcept : m_c(c) {}
};

struct char_node : char_node_base
{
    char_node_ptr m_left;
    char_node_ptr m_right;
};

struct nodes_root
{
    int m_height;
    char_node_ptr m_node;
};

C++ — . , . C .


C C++ . , 256 . -, . , , , .

C ( — ) :

static struct priority_queue* build_priority_queue(
    char* buffer, unsigned int size)
{
    unsigned char table[256];

    memset(table, 0, sizeof(table));

    for (unsigned int i = 0; i < size; ++i)
        if (table[(unsigned char)buffer[i]] != 255)
            ++table[(unsigned char)buffer[i]];

    struct priority_queue* queue = priority_queue_create(16);

    for (unsigned short i = 0; i < 256; ++i)
    {
        if (table[i])
        {
            struct char_node_root* node = (struct char_node_root*) malloc(sizeof(struct char_node_root));

            struct char_node_terminal* term = (struct char_node_terminal*) malloc(sizeof(struct char_node_terminal));

            term->base.type = NODE_TYPE_TERM;
            term->c = (char)i;
            node->node = (struct char_node_base*) term;
            node->height = 0;
            node->pq_node.weight = table[i];
            priority_queue_push(queue, (struct priority_queue_node*) node);
        }
    }
    return queue;
}

static struct char_node_root* queue_to_tree(struct priority_queue* queue)
{
    while (priority_queue_size(queue) > 1)
    {
        struct char_node_root* node1 = (struct char_node_root*) priority_queue_pop(queue);
        struct char_node_root* node2 = (struct char_node_root*) priority_queue_pop(queue);
        struct char_node_base* int_node1 = node1->node;
        struct char_node_base* int_node2 = node2->node;

        struct char_node* join_node = (struct char_node*) malloc(sizeof(struct char_node));
        join_node->base.type = NODE_TYPE_NODE;
        join_node->left = int_node1;
        join_node->right = int_node2;

        int new_weight = node1->pq_node.weight;
        if (new_weight + node2->pq_node.weight <= 65535)
            new_weight += node2->pq_node.weight;
        else
            new_weight = 65535;
        node1->pq_node.weight = new_weight;

        if (node1->height > node2->height)
            ++node1->height;
        else
            node1->height = node2->height + 1;
        free(node2);

        node1->node = (struct char_node_base*) join_node;
        priority_queue_push(queue, (struct priority_queue_node*) node1);
    }

    return (struct char_node_root*) priority_queue_pop(queue);
}

C++ — , :

void fill_priority_queue(
    const unsigned char* buffer,
    std::size_t buffer_size,
    queue_t& queue)
{
    unsigned char counts_table[256]{};

    for (auto ptr = buffer; ptr != buffer + buffer_size; ++ptr)
        if (counts_table[*ptr] != 255)
            ++counts_table[*ptr];

    for (unsigned short i = 0; i != 256; ++i)
        if (counts_table[i])
            queue.push(counts_table[i], nodes_root {0, char_node_ptr(new char_node_terminal(i))});
}

void queue_to_tree(queue_t& queue)
{
    while (queue.size() > 1)
    {
        auto old_root1_node = std::move(queue.top());
        const auto old_root1_weight = queue.top_weight();
        queue.pop();
        auto old_root2_node = std::move(queue.top());
        const auto old_root2_weight = queue.top_weight();
        queue.pop();

        auto joined_node = std::unique_ptr<char_node>(new char_node);
        joined_node->m_left = std::move(old_root1_node.m_node);
        joined_node->m_right = std::move(old_root2_node.m_node);

        const auto new_weight = std::min(old_root1_weight + old_root2_weight, 65535U);
        const auto new_height = std::max(old_root1_node.m_height, old_root2_node.m_height) + 1;
        queue.push(new_weight, nodes_root {new_height, std::move(joined_node)});
    }
}


, , , , . «» . , . , . , , .

C , . - . , , , . , — .

struct bits_line
{
    unsigned char bits_count;
    unsigned char* bits;
};

static int build_encoding_map_node(struct char_node_base* node, struct bits_line* bits_table, unsigned char* bits_pattern, int bits_count)
{
    if (node->type == NODE_TYPE_TERM)
    {
        unsigned char index = (unsigned char)((struct char_node_terminal*)node)->c;
        bits_table[index].bits_count = bits_count;
        bits_table[index].bits = (unsigned char*) malloc(bytes_count_from_bits(bits_count + 1));
        if (! bits_table[index].bits)
            return 0;
        memcpy(bits_table[index].bits, bits_pattern, bytes_count_from_bits(bits_count));
        return 1;
    }

    static const unsigned char bit_mask[] = {1, 2, 4, 8, 16, 32, 64, 128};
    bits_pattern[bits_count >> 3] &= ~bit_mask[bits_count & 7];
    if (! build_encoding_map_node(((struct char_node*)node)->left, bits_table, bits_pattern, bits_count + 1))
        return 0;
    bits_pattern[bits_count >> 3] |= bit_mask[bits_count & 7];
    if (! build_encoding_map_node(((struct char_node*)node)->right, bits_table, bits_pattern, bits_count + 1))
        return 0;

    return 1;
}

C++ , , , , .

using unique_bytes_ptr = std::unique_ptr<unsigned char[]>;

class bit_ostream
{
    std::size_t m_capacity;
    unsigned long m_bits_count = 0;
    unique_bytes_ptr m_data;
public:
    explicit bit_ostream(std::size_t initial_capacity = 0) noexcept
        : m_capacity(initial_capacity)
    {
    }

    bit_ostream& push(const unsigned char* bits, unsigned long const bits_count)
    {
        if (bits_count == 0)
            return *this;

        const auto new_bits_count = m_bits_count + bits_count;
        if (covered_bytes(new_bits_count) + 1 > m_capacity || m_bits_count == 0)
        {
            decltype(m_capacity) new_capacity = m_capacity * 2;
            const auto cov_bytes = static_cast<decltype(m_capacity)>(covered_bytes(new_bits_count) + 1);
            if (new_capacity < cov_bytes)
                new_capacity = cov_bytes;
            unique_bytes_ptr new_data(new unsigned char[new_capacity]);
            std::memcpy(new_data.get(), m_data.get(), covered_bytes(m_bits_count));
            m_capacity = new_capacity;
            m_data = std::move(new_data);
        }

        unsigned char* curr = m_data.get() + (m_bits_count >> 3);
        if ((m_bits_count & 7) == 0)
        {
            // All it's simple when current output data size is integer number of bytes
            std::memcpy(curr, bits, covered_bytes(bits_count));
        }
        else
        {
            const unsigned char shift = m_bits_count & 7;
            for (auto bytes_count = covered_bytes(bits_count); bytes_count > 0; ++curr, ++bits, --bytes_count)
            {
                unsigned short val = static_cast<unsigned short>(*bits) << shift;
                val |= static_cast<unsigned short>(*curr & g_bits_fill_mask[shift]);
                *curr = static_cast<unsigned char>(val & 0xff);
                *(curr + 1) = static_cast<unsigned char>(val >> 8);
            }
        }
        m_bits_count += bits_count;

        assert(covered_bytes(m_bits_count) <= m_capacity);
        return *this;
    }

    bit_ostream& push(const bit_ostream& other)
    {
        return push(other.data(), other.bits_count());
    }

    bit_ostream& clear_tail() noexcept
    {
        if (m_bits_count & 7)
            m_data.get()[m_bits_count >> 3] &= g_bits_fill_mask[m_bits_count & 7];

        return *this;
    }

    unsigned long bits_count() const noexcept { return m_bits_count; }
    bool empty() const noexcept { return ! m_bits_count; }
    unsigned char* data() noexcept { return m_data.get(); }
    const unsigned char* data() const noexcept { return m_data.get(); }
};

template <class T>
constexpr inline std::size_t covered_bytes(T bits_count) noexcept
{
    return (bits_count >> 3) + (bits_count & 7 ? 1 : 0); 
}


, . . , . , , — CPU rdtsc.

  1. ts1 rdtsc.
  2. .
  3. . .
  4. t1, ts1, ts2.
  5. , . .
  6. t2, ts2, ts3.
  7. , .
  8. t3, ts3.

, , posix clock_gettime.


, , , : « ?». , gcc-5.4.0 «O3», 31 . , . 64 . , 31 / 64 , .

> build-c/pack-c -m3 ../sample-1.dat data-c.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1053432 (0.754 seconds), t1 = 209957, t2 = 31023, t3 = 811377.

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1182005 (0.846 seconds), t1 = 228527, t2 = 52680, t3 = 894081

«-m3» , , .

-, - . , C++ , 12%. , C . , , 1 ?

> build-c/pack-c -m3 -b1024 ../sample-1.dat data-c.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 9397894 (6.731 seconds), t1 = 5320910, t2 = 1943422, t3 = 2094688.

> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 11586220 (8.3 seconds), t1 = 6399593, t2 = 3125111, t3 = 1663035

, , . C — 23%!

« »


C++ ? , . , , . bit_ostream . , , push? , . , 256 , , 256 bits_line C . .

class small_bit_ostream
{
    unique_bytes_ptr m_data;
    unsigned short m_bits_count = 0;
public:
    small_bit_ostream& push(const unsigned char* bits, const unsigned short bits_count)
    {   
        const auto cov_bytes {covered_bytes(bits_count)};
        m_data.reset(new unsigned char[cov_bytes]);
        std::memcpy(m_data.get(), bits, cov_bytes);
        m_bits_count = bits_count;
        return *this;
    }   

    unsigned long bits_count() const noexcept { return m_bits_count; }
    bool empty() const noexcept { return ! m_bits_count; }
    unsigned char* data() noexcept { return m_data.get(); }
    const unsigned char* data() const noexcept { return m_data.get(); }
};

. . . -? ( C .)

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1173692 (0.84 seconds), t1 = 229942, t2 = 46677, t3 = 890323

> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 11198578 (8.02 seconds), t1 = 6404650, t2 = 2752852, t3 = 1641317

, — . — C++ 19%. t2, .


, CPU. valgrind' «cachegrind». C .

==2794== I   refs:      2,313,382,347
==2794== I1  misses:           14,482
==2794== LLi misses:            1,492
==2794== I1  miss rate:          0.00%
==2794== LLi miss rate:          0.00%
==2794== 
==2794== D   refs:        601,604,444  (472,330,278 rd   + 129,274,166 wr)
==2794== D1  misses:        3,966,884  (  2,279,553 rd   +   1,687,331 wr)
==2794== LLd misses:            7,030  (      3,034 rd   +       3,996 wr)
==2794== D1  miss rate:           0.7% (        0.5%     +         1.3%  )
==2794== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==2794== 
==2794== LL refs:           3,981,366  (  2,294,035 rd   +   1,687,331 wr)
==2794== LL misses:             8,522  (      4,526 rd   +       3,996 wr)
==2794== LL miss rate:            0.0% (        0.0%     +         0.0%  )
==2794== 
==2794== Branches:        299,244,261  (298,085,895 cond +   1,158,366 ind)
==2794== Mispredicts:       8,779,093  (  8,778,920 cond +         173 ind)
==2794== Mispred rate:            2.9% (        2.9%     +         0.0%   )

C++ :

==2994== I   refs:      2,464,681,889
==2994== I1  misses:            2,032
==2994== LLi misses:            1,888
==2994== I1  miss rate:          0.00%
==2994== LLi miss rate:          0.00%
==2994== 
==2994== D   refs:        633,267,329  (491,590,332 rd   + 141,676,997 wr)
==2994== D1  misses:        3,992,071  (  2,298,593 rd   +   1,693,478 wr)
==2994== LLd misses:            8,292  (      3,173 rd   +       5,119 wr)
==2994== D1  miss rate:           0.6% (        0.5%     +         1.2%  )
==2994== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==2994== 
==2994== LL refs:           3,994,103  (  2,300,625 rd   +   1,693,478 wr)
==2994== LL misses:            10,180  (      5,061 rd   +       5,119 wr)
==2994== LL miss rate:            0.0% (        0.0%     +         0.0%  )
==2994== 
==2994== Branches:        348,146,710  (346,241,481 cond +   1,905,229 ind)
==2994== Mispredicts:       6,977,260  (  6,792,066 cond +     185,194 ind)
==2994== Mispred rate:            2.0% (        2.0%     +         9.7%   )

, C++ , , C . . ? , — 2 464 . 2 313. , .

«callgrind» , — malloc free. C++ , , new delete. ? , , unique_ptr, new[]. , C- malloc, . . ? bit_ostream , . . , . malloc / free, unique_ptr, .

struct free_deleter
{
    void operator()(void* p) const noexcept { std::free(p); }
};

template <class T> inline T* allocate_with_malloc(std::size_t size)
{
    T* res = static_cast<T*>(std::malloc(sizeof(T) * size));
    if (! res)
        throw std::bad_alloc();
    return res;
}

template <class T>
using unique_malloc_array_ptr = std::unique_ptr<T[], free_deleter>;

template <class T>
inline unique_malloc_array_ptr<T> unique_allocate_with_malloc(std::size_t size)
{
    return unique_malloc_array_ptr<T>(allocate_with_malloc<T>(size));
}

// Typedefs for byte arrays
using unique_bytes_ptr = unique_malloc_array_ptr<std::uint8_t>;

inline unique_bytes_ptr allocate_bytes(std::size_t size)
{
    return unique_bytes_ptr(unique_allocate_with_malloc<std::uint8_t>(size));
}

( , C , ).

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1042665 (0.746 seconds), t1 = 250480, t2 = 45393, t3 = 740163

> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 11068384 (7.93 seconds), t1 = 6488100, t2 = 2694562, t3 = 1501027

-, C++ C! , , . — 2 430 . 2 464. 633 . 536. , — , .


, C++. , , . m_nodes, . , , ptr1 = std::move(ptr2). «»? ptr1 , , . ptr2 , . , , . ! , . . (!) . .

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1008990 (0.722 seconds), t1 = 221001, t2 = 44870, t3 = 736557

> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 10683068 (7.65 seconds), t1 = 6101534, t2 = 2689178, t3 = 1505929

-, C++ 4,3%, — 13,6%. 2 413 ., — 531 .


?

  1. « », - . , C , C++ «» , , . . . ( , «», «» .) , C « » , C++ .
  2. , C++, , C. , C. C ( C++) , , . C++ — , . — , ( ), « ».
  3. , - , . , . , «XYZ». , , ? . C++ , 31 0.85 , , !
  4. , . «», , C , / , , . . , « A, B, C », C++. « », . . « » , C . «» , , , , , . .


P.S. : http://corosan.ru/data-exp/haffman_pack_for_article.tar.bz2

Update 1. , clang-3.8.0. .

C :
> build-c/pack-c -m3 ../sample-1.dat data-c.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1139373 (0.816 seconds), t1 = 206305, t2 = 29559, t3 = 902493.

C++ :
> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1417576 (1.01 seconds), t1 = 223441, t2 = 53057, t3 = 1134400

C++ :
> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1174028 (0.84 seconds), t1 = 215059, t2 = 59738, t3 = 892821

, clang - , gcc.

Source: https://habr.com/ru/post/347688/


All Articles