📜 ⬆️ ⬇️

Performance analysis of individual subsystems of the program for Linux perf report

Usually, to prepare a profiling report on Linux, I used only the simplest options for running perf report (the performance data itself should be obtained before perf report started with the perf record command, here you can read more with an example ):


Module report:


 $ perf report --stdio --kallsyms=/boot/System.map-$(uname -r) --sort=dso -g none 

Feature Report:


 perf report -g none --stdio --kallsyms=/boot/System.map-$(uname -r) 

Function report with callgraph construction:


 perf report --stdio --kallsyms=/boot/System.map-$(uname -r) -g 

For many situations, such reports were enough to find performance problems. However, some time ago I was faced with a situation where the callgraph report showed a very large call tree and it was difficult to understand.


For a better understanding of the performance report, I wanted to break it into separate reports that would correspond exactly to the program's subsystems:


  1. Subsystem receiving messages, the main function: amqp_service::on_message()
  2. Message processing subsystem, main function: input_handler::execute()
  3. Notification distribution subsystem, main function: event_handler::on_msg()

It turned out that Linux perf has a key that allows you to select only events related to one function in the call stack:


  -p, --parent=<regex> A regex filter to identify parent. The parent is a caller of this function and searched through the callchain, thus it requires callchain information recorded. The pattern is in the exteneded regex format and defaults to "^sys_|^do_page_fault" 

This key allowed me to analyze the performance separately for each function important to me. Previously, it must be said that according to the report, the total number of events in perf.dat was:


 # Event count (approx.): 72784958041 

Analysis amqp_service :: on_message ()


Building a report on amqp_service :: on_message ():


 $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \ -g -p "amqp_service::on_message" 

The number of events with this filter


 # Event count (approx.): 1590851860 

It turns out that 1590851860 events are approximately 2% of the total number of events 72784958041. That is, this function did not generate many events even under heavy load. Just in case, I look at what functions were in the top, when amqp_service::on_message() and whether or not there was pthread_mutex_lock()/pthread_mutex_unlock() in the top:


 $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \ -g -p "amqp_service::on_message" | \ grep -E -e "^[ ]+[0-9]" | sed -re 's/[ ]+data_provider::.*+$//' | head 29.38% data_provider [kernel.kallsyms] [k] clear_page_c 11.09% data_provider libc-2.12.so [.] memcpy 6.01% data_provider [kernel.kallsyms] [k] __alloc_pages_nodemask 5.27% data_provider [kernel.kallsyms] [k] compact_zone 4.44% data_provider [kernel.kallsyms] [k] page_fault 4.08% data_provider [kernel.kallsyms] [k] get_pageblock_flags_group 4.05% data_provider [kernel.kallsyms] [k] __reset_isolation_suitable 3.55% data_provider [kernel.kallsyms] [k] compaction_alloc 1.78% data_provider libpthread-2.12.so [.] pthread_mutex_lock 1.61% data_provider [kernel.kallsyms] [k] __mem_cgroup_commit_charge 

From this data and the full report with callgraph for amqp_service::on_message() I see that under load, the function basically spends time copying messages. Blocking on mutexes in this function is not a problem, only 1.8% under load, there is no large contention on mutexes in this function and in functions called from it according to the report.


Analysis of the function input_handler :: execute ().


Building a report for this particular function:


 $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \ -g -p "input_handler::execute" 

Number of events with this filter:


 # Event count (approx.): 16423187486 

It turns out that 16423187486 events are 22% of the total number of events 72784958041. This is already a large amount of work regarding the total number of events of the perf and here I looked more closely at where the algorithm spends time.


 $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \ -g -p "input_handler::execute" | \ grep -E -e "^[ ]+[0-9]" | sed -re 's/[ ]+data_provider::.*+$//' | head 7.31% data_provider data_provider.005.00 [.] dtl::basic_string_ref<char, std::char_traits<char> >::empty() const 7.19% data_provider data_provider.005.00 [.] dtl::string_parser::fail() const 4.95% data_provider data_provider.005.00 [.] dtl::string_parser::eof() const 4.87% data_provider data_provider.005.00 [.] dtl::string_parser::peek() const 3.46% data_provider data_provider.005.00 [.] meta::json::decoder::parse_value(dtl::basic_string_ref<char, std::char_traits<char> >&) 3.08% data_provider data_provider.005.00 [.] dtl::basic_string_ref<char, std::char_traits<char> >::operator[](unsigned long) const 2.53% data_provider data_provider.005.00 [.] dtl::basic_string_ref<char, std::char_traits<char> >::remove_prefix(unsigned long) 2.30% data_provider data_provider.005.00 [.] dtl::hash_seq(unsigned char const*, unsigned long) 2.21% data_provider data_provider.005.00 [.] bool dtl::strtou<(unsigned char)10, unsigned char>(dtl::basic_string_ref<char, std::char_traits<char> >&, unsigned char&) 

However, judging by this data, the function performs a large number of decoding, but there are no problems with scaling in the code - there is no significant number of pthread_mutex_lock () calls. That is, you can add more streams for input_handler :: execute () and expect an increase in overall performance.


Analysis of the event_handler :: on_msg () function.


I will skip the analysis of the number of events and just note that the calls made from this function, more than all the time spent in blocking. This may require corrections, since such a large number of blocks is bad for scaling.


 $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \ -g -p "event_handler::on_msg" | \ grep -E -e "^[ ]+[0-9]" | sed -re 's/[ ]+data_provider::.*+$//' | head 6.73% data_provider libpthread-2.12.so [.] pthread_mutex_lock 4.90% data_provider libpthread-2.12.so [.] pthread_mutex_unlock 4.45% data_provider data_provider.005.00 [.] std::__atomic_base<unsigned int>::operator++() 3.84% data_provider [kernel.kallsyms] [k] _spin_lock 3.54% data_provider data_provider.005.00 [.] std::__atomic_base<unsigned long>::fetch_add(unsigned long, std::memory_order) 3.47% data_provider data_provider.005.00 [.] dtl::intrusive_ptr<meta::object>::intrusive_ptr(dtl::intrusive_ptr<meta::object> const&) 2.92% data_provider data_provider.005.00 [.] (anonymous namespace)::page_allocator_base::allocate_obj(dtl::page_allocation_info&) 2.70% data_provider data_provider.005.00 [.] std::char_traits<char>::length(char const*) 2.18% data_provider data_provider.005.00 [.] dtl::string::size() const 1.96% data_provider data_provider.005.00 [.] dtl::string::create_(dtl::basic_string_ref<char, std::char_traits<char> > const&) 

As a result, building a report using the --parent key helped to analyze the performance of individual parts of my program. However, due to the fact that the Lua library used was apparently compiled without -fno-omit-frame-pointer , some of the functions that were performed in input_handler::execute() when analyzed with --parent were not included in the report, although and there is a general report.


')

Source: https://habr.com/ru/post/305188/


All Articles