Introduction
The issue of routing security has been an object of constant attention by the industry for more than a decade. BGP vulnerabilities were identified as early as 1988 and many of them still exist today. Despite seeming fragility, global inter-domain routing of the Internet demonstrated exceptional stability and robustness. Indeed, apart from a few high-profile cases when intentional or unintentional configuration mistakes affected global routing for a limited time, threats from the routing system rarely endanger service providers’ commitments.
Long-term data and analysis on frequency and types of BGP attacks are an important element in informing operator’s risk assessment and selection of adequate tools and approaches. What level of attack has there been in the past — to what extent do security incidents happen, but go unnoticed, or get dealt with inside a single network, possibly introducing collateral damage? Are the number and impact of service disruptions and malicious activity stable, increasing or decreasing? Can we understand why, and track it collectively?
This workshop was intended to address these and other questions related to the issue of routing resiliency and was next in the series of events organized by the Internet Society in our effort to foster improvements in the resiliency of the Internet routing system and facilitate adoption of common solutions and best practices in this area
Executive summary
The workshop was divided into three main sections. The scope and main conclusions from these sessions are provided below:
- Measurement methodology and frameworks
The focus of this session was on different methodologies, their limitations and available data sets used for the analysis of suspicious events related to inter-domain routing in the Internet.
Several different methodologies of BGP and traffic data analysis were presented. They use different approaches and data sets (e.g. historical BGP data, registry data, traffic flow data, or their combinations) and might be complementary. Correlation of these data can yield deeper insight.
- Research analysis and operational dataIn this session participants looked at the data related to routing resilience coming both from the research as well from the operational experience. Participants explored how these data related to risks, vulnerabilities and threats.Research data shows a relatively small number of routing incidents. This number goes up if a greater margin is allowed for false positives. It is difficult to say how much goes under the radar.
This corresponds well to the operator experience. It is quite possible that the actual number of incidents is higher, but they either go unnoticed, or get attributed to other causes. Operator’s focus on routing security in a narrow sense is limited, more attention is on routing resilience in a wider sense.
- Metrics and long-term monitoring
In this session participants discussed what metrics could be a useful representation of routing resiliency in the Internet. How can these metrics be used in improving routing resiliency? What data is needed and what needs to be done to get these metrics collected? Is there a need for a long-term monitoring and trend analysis and what can be done in this area?
While the current state of routing resilience may look acceptable, long-term monitoring and trend analysis is necessary for better awareness and as an early warning system. Definition of useful and actionable metrics is crucial here.
Future Directions
- Measurement methodology and frameworks
The development of such metrics that can usefully and consistently describe the state of the routing resiliency could be very useful for long-term monitoring and trend analysis, and in fact can facilitate such efforts. Part of the problem lies in the fact that there is also no vocabulary used to describe various incidents. The development of common vocabulary is crucial for collection of consistent operational statistics.
-
More research into “false negatives” – how much is going unnoticed
From an operator’s perspective and the utility of tools based on the analysis of the routing system the reduction of false positives in the main objective. However from a point of view of the overall understanding of the resiliency of the routing system it is important to estimate how much, or what types of the events got filtered out or have not been noticed at all.
-
Raising awareness of the real safety level
If an operator is not monitoring violations of their routing policy (route hijacks, leaks, etc.) they are not aware of real threats coming from the routing system. Subsequently, there is little motivation to deploy additional controls. Also, the effectiveness of the deployed measures is hard to estimate.
-
Focus on the low hanging fruit – existing best practices tailored to what needs to be most protected
Deployment of some of the best practices, including prefix filtering, limiting number of prefixes received from a neighbor, as well as improvements in the detection and mitigation techniques can be the most effective first step in improving routing resilience. The application of these practices is well understood and if coupled with operator’s understanding of the critical elements of the infrastructure, can result in a substantial protection.