CSIRT
Management
M. E. Kabay, PhD, CISSP-ISSMP-ISSMP
Program Director, MSIA
School of
C:\Data\PUBLISH\MISC\CSIRT_Management_v08.doc
Table of Contents
2.3 Establishing Policies and Procedures.
3 Responding to Computer Emergencies.
3.3.1 Will this Have to be Done Again?
3.3.4 Advantages for Technical Support and CSIRTs
3.4.2 Distinguish Observation From Assumption.
3.4.3 Distinguish Observation From Hearsay.
3.4.4 Distinguish Observation From Hypothesis
3.4.5 Challenge Your Hypothesis
4 Securing the CSIRT: Walk the Talk.
5.2 Setting the Rules for Triage.
5.3 Triage, Process and Social Engineering.
6.2 Continuous Process Improvement: Sharing Knowledge Within the Organization.
6.3 Sharing Knowledge with the Security Community.
In this overview, [1] I will summarize the key points in creating and managing
a computer security incident response team (CSIRT), also sometimes known as
a computer incident response team (CIRT) or a computer emergency response team
(
No matter how good your security, at some point some security measure will fail. Knowing that helps you plan for security in depth, so that a single point of failure does not necessarily result in catastrophe. Furthermore, instead of trying to invent a response when every second counts, it makes sense to have a CSIRT in place, trained, and ready to act. As everyone should know, the value of time is not constant. Spending an hour or a day planning so that one’s emergency response is shortened by a few seconds may save a life or prevent a business disaster.
The CSIRT should include members from every sector of the organization; key members include operations, facilities, legal staff, public relations, information technology, and at least one respected and experienced manager with a direct line to top management. The CSIRT should establish good relations with law-enforcement officials and should be prepared to gather forensic evidence. The organization should have a policy in place on how to decide whether to prosecute malefactors if they can be identified. The CSIRT should be prepared to respond not only to external attacks but also to criminal activities by insiders. Proper logging at the operating system level and from intrusion-detection systems can be useful to the CSIRT. The CSIRT plays an important role in disaster prevention, mitigation, and recovery planning. [2]
Organizing people to respond to computer security incidents is worth the effort not only when you actually have an incident but also because the analysis and interactions leading to establishment of the CSIRT bring benefits even without an emergency. A CSIRT can provide opportunities for improving institutional knowledge, contributing to continuous process improvement, and offering challenging and satisfying work assignments to technical and managerial staff, thus contributing to reduced turnover. A well-trained, professional, courteous CSIRT can improve relations between the entire technical support infrastructure and the user community.
Not every organization has a CSIRT already in place; not all CSIRTs are structured and managed in the most appropriate ways for a specific organization’s needs. This section presents systematic approaches for rational design and implementation of a CSIRT.
Shortly after the infamous Morris
Worm incident of
West-Brown et al. describe the functions of the CSIRT as follows:
For a team to be considered a CSIRT, it must provide one or more of the incident handling services: incident analysis, incident response on site, incident response support, or incident response coordination.
They explain in detail all aspects of these functions and summarize their research on the range of services that CSIRTs actually provide, whether by themselves or in cooperation with other teams in the information technology sector, in a table [see page 25] which I have reformatted below:
- Alerts and warnings
- Incident handling
§ Incident analysis
§ Incident response on site
§ Incident response support
§ Incident response coordination
- Vulnerability handling
§ Vulnerability analysis
§ Vulnerability response
§ Vulnerability response coordination
- Artifact handling
§ Artifact analysis
§ Artifact response
§ Artifact response coordination
- Announcements
- Technology watch
- Security audits or assessments
- Configuration & maintenance of security tools, applications and infrastructures
- Development of security tools
- Intrusion detection services
- Security-related information dissemination
- Risk analysis
- Business continuity and disaster recovery planning
- Security consulting
- Awareness building
- Education / training
- Product evaluation or certification
The only problematic term in this list is “artifact,” which the authors define as “any file or object found on a system that might be involved in probing or attacking systems and networks or that is being used to defeat security measures. Artifacts can include but are not limited to computer viruses, Trojan horse programs, worms, exploit scripts, and toolkits.” [p. 28].
The specific combination of functions that your CSIRT will provide will be a function of personnel and budgetary resources and of the maturity of the team. It is wise to focus a completely new CSIRT on essential services such as incident handling and analysis as their first priority. With time and experience, the team can add functions such as coordinating with other security teams and with computer and network operations in the more proactive services and the security quality services that will lead to long-term reduction in security incidents and to lower damages and costs from such incidents.
When you start working on a CSIRT, you must manage expectations carefully to avoid disappointment, frustration and hostility from users who may want more than you can reasonably provide. Managing expectations is a general principle applicable in a wide range of projects, not just CSIRT management; for example, in planning a large-scale transaction processing system where the contract stipulated a maximum response time per transaction of three seconds, I remember that the programming team built a timer into the system so that responses would take exactly three seconds even during the initial test phases. We knew that only a few data entry clerks would be working on the system to try it out for the first few weeks, and the last thing we wanted was to get them used to sub-second response times that would climb as the databases became increasingly loaded and when several hundred users finally began using the system. At first, the client thought that this strategy was odd, but after thinking about it, they realized that it made sense.
As you establish your CSIRT, you may want to start small, as I mentioned before. Perhaps you can limit the scope of the CSIRT to a few of the smaller production systems to avoid plunging into a new area of expertise with enormous stakes riding on your success. You should decide whether to start with working-hours only, extended hours (e.g., early morning to late night) or 24-hour, seven-day operations. If software development is part of your environment and (as most people will recommend) is physically distinct from production systems, perhaps that could be a good start for the nascent CSIRT. Although many development staff may work long hours and on weekends, the effects of system emergencies may be less severe than attacks or breakdowns involving other systems such as, say, inventory, factory controls, customer service, sales and so on. When you are ready to tackle an even more significant production system, perhaps a system whose users tend to leave more-or-less at the end of the day might be a good candidate; e.g., the accounting system or support systems for any operation that does not run more than one shift per day.
In any case, be sure that you communicate your intentions for when your CSIRT services will be available to your customers (and yes, that’s a deliberate use of the word).
The other aspect of service levels is how fast you can respond to emergencies. That’s a much more complex issue and will be the subject of articles on triage and setting the rules for triage later in this series.
As the
Policies are the statements of the desired goals; procedures are the methods for attaining those goals. Policies tend to be global and relatively stable; procedures can and should be relatively specific and can be adapted quickly to meet changing conditions and to integrate knowledge from experience. Policies cannot be promulgated without the approval and support of appropriate authorities in the organization, so one of the first steps is to identify those authorities. Another step is to gain their support for the policy project.
All policies and especially CSIRT policies should be framed in clear, simple language so that everyone can understand them and should be made available in electronic form. In other works, I have pointed out that hypertext can make policies more understandable by providing pop-up comments or explanations of difficult sections or technical terms. [6]
Similarly, procedures show how to implement the policies in real terms. For example, a policy might stipulate, “All relevant information about the time and details of a computer incident shall be recorded with regard for the requirements of later analysis and for possible use in a legal proceeding.” That policy might spawn a dozen procedures describing exactly how the information is to be recorded, named, stored, and maintained through a proper chain of custody. For example, one procedure might start, “Using the Incident-Report form in the CSIRT Database accessible to all CSIRT members, fill in every required field. Use the pull-down menus wherever possible in answering the questions.” Again, as the DISA CD-ROM points out, these procedures should minimize ambiguity and help members of the team to provide a consistent level of service to the organization. A glossary of local acronyms and technical terms can be helpful as part of these procedures.
Whenever policies and procedures are changed in a way that may affect users, it’s important to let people know about the changes so that their expectations can be adjusted. The DISA course recommends using several channels of communications to ensure that everyone gets the message; e.g., send e-mail, use phone and phone messages, send broadcast voicemail, announce the changes at staff meetings, and use posters and Web sites.
The computer security incident response team may be a permanent, full-time assignment for a fixed group of experts or it may be a part time role assigned to dynamically as conditions require. In either case, or for any of the intermediate arrangements, certain fundamentals will dictate your choice of staff members for the CSIRT. Cowens and Miora write,
Maturity and the ability to work long hours under stress and intense pressure are crucial characteristics. Integrity in the response team members must be absolute, since these people will have access and authority exceeding that given them in normal operations.
Exceptional communications skills are required because, in an emergency, quick and accurate communications are needed. Inaccurate communications can cause the emergency to appear more serious than it is and therefore escalate a minor event into a crisis.” [7]
Information requests can be handled by team members in the 1 to 5 range. For example, a support staff person can send out publications, while someone with greater expertise would be required to address the question about identifying spoofed e-mail.
To handle incidents . . . team members in the 5 to 8 technical range are necessary. This response can involve technical analysis and communicating with compromise sites, law enforcement technical staff, and other CIRTs. In handling incidents that represent new attack types, you may need to call the wizards to help understand our analyze the activity.
Vulnerability handling requires your most proficient personnel, falling into the eight to 10 range. These individuals must be able to work with software vendors, CIRTs, and other experts to identify and resolve vulnerabilities. Many CIRTs don’t have access to this level of technical expertise.”5
I want to add to these excellent comments that in my experience, CSIRT staff with the psychological flexibility to allow them to adapt quickly to changing requirements will do better than people who resist change or resent ambiguity. Ideally, the team will include problem-solvers with an intuitive grasp of the differences between observation and assumption, hypothesis and deduction. As always, team-players committed to getting the problem solved will contribute more than people interested in acquiring personal credit for achievements. I also think that having at least one person on the team with a penchant for meticulous note-taking is a real benefit; more about recordkeeping in another segment in this series.
I turn next to some of the immediate issues in responding to computer emergencies:
So let’s start with triage. The word itself comes from a French root meaning to sort. In medicine, triage is “prioritization of patients for medical treatment: the process of prioritizing sick or injured people for treatment according to the seriousness of the condition or injury.” [8] Similarly, anyone receiving calls about computer security incidents must be able to classify the call right away so that the right resources can be called into play. As the DISA course on computer security incident response team management suggests, “The triage process recognizes and separates
I have altered the order of the original list to reflect a decreasing rank of importance for these factors in communicating and acting upon calls.
Triage is common to ordinary help desks as well as to emergency hotlines. In general, there are two models for staffing the phones for such front-line functions: the “dispatch” model and the “resolve” model. [9] The dispatcher has just enough technical knowledge to collect appropriate information about an incident and assignment to a team member for investigation; the alternative is to assign someone with more expertise to answer the phone so that response can be even faster. However, the resolve model risks wasting resources because the more experienced staff member may end up doing largely clerical work instead of focusing on applying his or her expertise to problem analysis and resolution.
To support triage, staff members need explicit training on data collection and priorities. They need to record who is calling, how to reach that person, what the caller thinks is happening, what the caller has observed, how serious the consequences are, how many people or systems are affected, whether the incident is in progress or is over as far as they know, and how the caller and others are responding. The CSIRT procedures should include guidance on assigning priorities to incidents; factors can include security classifications (e.g., SECRET or COMPANY CONFIDENTIAL data under attack), type of problem (e.g., breach of confidentiality, data corruption, loss of control, loss of authenticity, degradation of availability or utility), possible direct costs (e.g., personnel downtime, costs of recovery, loss of business), possible indirect costs (e.g., damage to business reputation, legal liability) and so on as appropriate for each organization.
Readers may find the work of John Howard relevant for such analysis; Dr Howard has established a useful taxonomy for discussing computer security incidents that can serve as a framework for establishing priorities. [10] , [11]
I recommend an automated system for capturing information on all calls to the CSIRT. Using keywords “helpdesk software” and also “help desk software” brings up dozens of options for such programs. If you have modest skill in database design, you can also create your own using a program such as MS-Access. With appropriate locking strategies and automated reports, your CSIRT can know and control the priorities of all the open incidents under investigation at any time.
For more about triage, see §5.2 , Setting the Rules for Triage .
We can start as the DISA CD-ROM course does by classifying technical expertise in approximate ranges:
As the DISA writers point out, “Vulnerability handling requires your most proficient personnel. . . . These individuals must be able to work with software vendors, CIRTs, and other experts to identify and resolve vulnerabilities. Many CIRTs don’t have access to this level of technical expertise.”
If you would like to download PowerPoint presentations that cover many aspects of technical support management, you are welcome to visit my Web site. [12]
This section focuses on some of the advantages, requirements and tools for incident tracking. First I establish why I think documentation in general is so important.
When
I joined Hewlett‑Packard (
Pretty soon, people began asking me what I thought I was doing‑ writing a novel?
My colleagues may have been puzzled by what they perceived as a mania for record keeping, but I was equally astonished that record keeping was not a normal part of their way of doing work. The reason I automatically kept records was my years in scientific research, where logbooks with hard covers, numbered pages and even waterproof paper were just usual parts of doing serious work. The idea of doing anything of importance without keeping a concurrent record simply didn’t occur to anyone. One could not reproduce an experiment without knowing exactly what sequence one had used in accomplishing the steps. Even adding salts to solutions had to be done in a particular order.
So I just kept on keeping my little green logbooks.
By the time I left Hewlett‑Packard in 1984, I had trained a few of the younger support personnel to keep careful records, especially while solving problems. They had learned the advantages of documentation.
Documentation, far from being a sterile exercise done to conform to arbitrary requirements of nameless, faceless superiors, should be a vital part of any intellectual exercise. Documentation is simply writing down what we learn: the crucial step in human history that changed traditional cultures into civilizations. By keeping a record independent of any specific individual, we liberate our colleagues and our successors from dependence on our physical availability. Documentation is our assurance that work will continue without us; a kind of immortality, if you will.
We document what we do as a part of systematic problem solving. Writing forces us to identify the problem in words, instead of being content to define it in vague, unclear ideas. Writing down each idea we are in the process of testing helps us notice the ideas we missed the first time we tackled the problem. Keeping notes helps us pay attention to what we’re doing.
Documenting what we do also helps us during training–both our own and that of the people we are helping to learn technical skills. Trainees can review their own notes on how to do something instead of relying entirely on someone else’s description. If taking notes is viewed as a chance to engage one’s mind more thoroughly in what we’re learning, it can be fun. When I studied math, I had the habit of using a set of symbols entirely different from those the teacher used; it was harder than mere copying, but I sure learned what was going on.
Finally, accurate records can be a boon in legal wrangles. In one case I experienced, upper management seriously considered legal procedures against a supplier for supposed breach of contract. Careful records of exactly when meetings were held and with whom permitted us to analyze the problem and resolve the issues by collaboration instead of by confrontation. Such records, if kept consistently, in good times and bad, can be accepted in a court of law as evidence–but only if everything points to a steady pattern of record‑keeping as events unfold. Records made long after a problem occurs are worthless.
The best way of keeping records on specific problems is an easy‑to‑use database. Here is the layout of a simple file in I used for several years as director of technical support in a large corporate data center and ever since then to keep track of all my projects:
By the time I finished my task of setting
up a self‑sufficient technical support team in the data center, we had
two thousand entries archived. We had a similar file reserved for system failures
and another for summaries of articles from INTEREX proceedings volumes and various
other publications. Every software product we requested information about was
logged in the files as well, with a pointer to the folder in which specification
sheets and correspondence were stored for the particular entry. These records
were and are still in constant use to find experiences which may help in solving
new problems as they arise.
With easily accessible records, it became possible to solve problems without me. Stored, sharable knowledge meant that it was no longer necessary for staff to depend on my physical presence. I was able to take month‑long holidays without being called for help. Part of that involved intensive training of the staff, but a good deal was the direct result of proper documentation.
Liberate yourselves: share your knowledge.
Keeping track of all of technical support calls is essential for effective incident handling. Having details available to all members of the CSIRT in real-time and for research and analysis later serves many functions:
Some of the more obvious requirements of any incident-handling system are listed below. Most are self-explanatory but I’ve added comments to a few of them:
In an online discussion by someone called “DonaldA-M” I noted two additional points I hadn’t thought of:
There’s a wide range of software available for tracking incidents. You can build your own, but then you’ll have to provide proper documentation and training materials because turnover is a constant problem for CSIRTs. In addition, unless your analysts have experience with the CSIRT function, they are likely to miss useful features that have accumulated over the years in products used by thousands of people.
I have provided a short list of proprietary
(commercial) help desk products in the
There are also well-respected open-source tools listed below.
All such tools can be complex; since you don’t want people fumbling about in an emergency, be sure that you budget for adequate training for your staff as you implement the tool you select.
“DonaldA-M” (2003). Good, but there’s more… < http://tinyurl.com/4bcve >
In this section, I’m focusing on critical distinctions that your CSIRT members should keep in mind in addition to the administrative details I summarized in the last section. From my experience running technical support and operations over the years, I believe that the same principles that underlie effective technical support equally inform effective CSIRT management.
When gathering information about an incident, staff members should establish a clear picture of what people were doing when they realized that there was a problem. For example, it may be important to know that someone was accessing a rarely-used account and noticed that a file was not available because someone else had it open. Those details will help to characterize the attack and to provide clues that may lead to additional valuable data. However, my approach would include asking why my contact was accessing the rarely-used account; it takes only a minute, but getting a wider picture may give the analyst another perspective that can also lead to new clues. In the scenario I have sketched, one could imagine that a system administrator had become curious about some unexpected resource utilization in a supposedly dormant account. This simple fact might lead to additional exploration of system log files and questions about whether any other dormant accounts had sparked curiosity. So, in general, it is worth your while to explore the situation more broadly at first rather than driving down the very first avenue that presents itself in the initial questions.
As the CSIRT member listens to the observations of other staff members, it is critically important to distinguish facts – that is, personal observations – from assumptions. Assumptions are ideas taken for granted or statements that are accepted without proof. For example, imagine the serious consequences of hearing someone say, “And so then they exploited a flaw in the firewall and then they. . .” and simply writing that statement down as if it were a fact. Such an assumption could profoundly distort the investigation, putting people’s efforts into the wrong track and diverting their attention from a more fruitful line of inquiry. Hearing such a statement, I would write down, “And so perhaps they exploited a flaw in the firewall….”
Everyone has played the child’s game of whispering a sentence to another person and then hearing the distorted version that come out the other end of a long chain of transmission without error correction. CSIRT staff must always distinguish between first-person observations (“I read the log file and found…”) and hearsay (“Shalama read the log file and she found…”). Don’t trust hearsay: check it out yourself by tracking down the source of the information.
Sometimes when people are careless or untrained, they don’t distinguish between what they saw and an idea that might explain what they saw. In the previous example about a suppose that flaw in a firewall, the person speaking seemed to take the flaw for granted; that was an assumption. A similar problem can occur when someone thinks that maybe there’s a flaw in the firewall and then proceeds as if that were true without testing their hypothesis. “And so maybe they exploited a flaw in the firewall, so we should patch all the holes right away.” Putting aside for the moment the advisability of patching holes and firewalls, merely hypothesizing an exploit doesn’t make it true. Maybe it’s a good thing to patch the firewall, but it doesn’t follow that it’s the top priority right now simply from having thought of the idea. CSIRT staff should be careful to think about what they’re hearing and note explicitly when people are proposing explanations rather than reporting facts.
I hope you will forgive me, Dear Reader, for a brief foray into the philosophy of science. I do have a reason to bringing it up.
In the 37 years (as of 2007) I have been teaching college courses, I’ve taught biology, genetics, biochemistry, embryology, physiology, applied statistics, programming, software engineering and information assurance. All of these subjects have involved a concept that some students have struggled to grasp: science depends on disproof, not proof. Empirical science (in contrast to logical systems such as mathematics) does not offer “proofs” in an absolute sense. Instead, a scientist formulates an hypothesis, defines a set of conditions and observations with predicted results and sees if there are grounds for rejecting randomness as a simple explanation of the deviation of the observations from the predictions. In many cases, scientists will assume the absence of a relationship or phenomenon (thus “null” hypothesis). Many experiments assume the absence of the interesting stuff and try to see if there are grounds for rejecting this simple explanation: “There’s nothing there.”
Science
works by DISPROVING hypotheses. Explanations that cannot, by definition, be
disproved are not part of a scientific effort.
Even more confusing for people who habitually think in terms of absolutes, even
accepting the null hypothesis doesn’t necessarily mean that there’s nothing
there. We may be measuring or counting too few occurrences to spot the cases
that will challenge the non-existence of the phenomenon. There may also be confounding
factors that obscure a real phenomenon.
But rejecting the null hypothesis does not, however, prove that any specific alternate hypothesis is necessarily correct. The evidence just restricts the range of reasonable hypotheses. We knock out explanation after explanation until what’s left is a smaller set of explanations. In science, the best we hope for is not truth in an absolute sense but an operational equivalent to truth: useful enough to use for now.
OK, so now I want to bring this back to network management and the CSIRT. When your CSIRT members develop hypotheses, they have to try to shoot them down. Trying to show that an idea is correct is – ironically – the wrong approach to testing hypotheses. Just as in quality assurance, we have to come up with ways of showing that our explanation is wrong. If we fail enough times to disprove an hypothesis using genuine, thoughtful, intelligent tests of our ideas, maybe we’ve got something useful after all.
CSIRT members inevitably work with some users who are stressed by the problems they are facing. It is no help to have a technical wizard who so offends the users that they stop cooperating with the problem-resolution team. Sometimes, CSIRT staff forget that their job includes not only resolving a technical issue but also keeping the clients as happy as possible under the circumstances – and the use of the word “clients” is deliberate here.
Here are some of the most irritating responses to users I have run across in my 25 years of technical support followed by my comments in square brackets:
In this section, I want to expand on the importance of securing the CSIRT and more broadly, of using our own advice.
The course narrator in the DISA CD-ROM very properly notes, “Once the CIRT becomes known, it will be an attractive target for intruder attacks. A security breach at your CIRT site can be devastating to your reputation and have repercussions for the commands you support; in terms of security procedures, practice what you preach. You will need to provide solid physical, host, and network security in addition to appropriate staff training.”5
He continues,
“A compromise of any data related to incidents can have legal repercussions as well as financial and credibility consequences. What types of data need to be secured?
· Incident reports,
· electronic mail,
· vulnerability reports, and even
· briefing notes and slides.”
More generally, all security personnel should be scrupulous in respecting security regulations and best practices. Just before writing the original article on which this section is based, I was chatting with some security officers at a large corporation who were doing a due-diligence interview with me before approving enrollment for one of their employees in our graduate program. The questions centered around the confidentiality of company-specific information in the case study reports that the student would submit for grading during the 18-month program. I explained that no student is expected to reveal his or her employer’s name or even location; that students use an internal e-mail address defined by our teaching platform and used on our access-controlled extranet; and finally that all of our instructors are themselves security professionals. I said that it is a matter of course for security professionals to be under nondisclosure whether a contract is signed or not – at least, to maintain a professional reputation. We all agreed that working in security eventually affects our behavior in a reflex way; we laughed that it’s almost impossible not to look away when someone enters a password on a keyboard.
Another example of practicing what we preach is backups. For a security professional to lose data because of a lack of backups would be intensely embarrassing. I constantly urge my students to do backups of their school work so that they never have to repeat what they have already done in case of a disk failure or a human error. Personally, I can demonstrate that I do a daily differential backup every day, clone my main computer’s disk to my laptop at least once a week (actually daily when I’m teaching undergraduate courses) and create a full backup to DVDs once a month. I’ve only had a few occasions over these last decades when I needed those backups, but the minor effort involved was more than repaid by the ease of recovery and by the ability to look someone straight in the eye when telling them how to protect their data.
We have to walk what we talk.
All the work that goes into creating a CSIRT can be wasted if managers fail to lead. Sloppy management can result in degraded performance, alienation of the client base, staff frustration, sabotage and employee turnover.
The DISA course wisely emphasizes the importance of professional behavior by all members of the CSIRT. The authors write, “The survival of your CIRT may well depend upon using a Code of Conduct, which will earn the trust and respect of the commands you support. The conduct of any single team member reflects upon the entire CIRT organization. If the commands don’t trust your CIRT, they won’t report to you. It is important, therefore, not only to have a Code of Conduct, but to shake it out and dust it off every once in a while. Remind team members what it is and why it is important...and use it.”5
Here are some of the practical recommendations from that course (although I have put them in my own words for the most part):
I was a member and then team leader of the Phone-In Consulting Service (PICS) at Hewlett-Packard (Canada) Ltd in Montréal in the early 1980s and later was director of technical services at a big service bureau in that city in the mid-1980s. Those experiences support the correctness of DISA’s advice.
Notice how consistently DISA (and I) refer to clients; this usage emphasizes that both technical support teams and CSIRTs all perceive users as people to whom we owe service. There is no benefit to allowing an adversarial relationship between the technical support team or a CSIRT and the client base. Don’t allow a gulf to develop between the CSIRT and the client community; clamp down on disparaging terms and derogatory comments about users. Ensure that team members understand why such language is harmful.
Identify CSIRT members with a chip on their shoulders: don’t let them adopt defensive, arrogant or aggressive attitudes toward the users. If a computer-security incident can be traced to procedural errors (i.e., the procedures themselves rather than user error are causing problems), the person reporting the problem should be thanked for the information, not criticized for having experienced or identified the problem. And never let anyone say with a sneer, “Well, you’re the first person to report that.”
No
one in a CSIRT has ever regretted being professional. Go out there and be
As I mentioned in § 3.1 , “Triage” in French (my native language) means “sorting.” In emergency medicine, the term was applied to the process of prioritizing treatment for patients arriving at trauma hospitals near combat zones in World War I. The same concept has been applied to help desks. For example, the “Help desk triage policy” from Courtesy Computers illustrates how a help-desk team can categorize problems to ensure that important issues receive faster service than less important problems. [14] Importance is defined in terms of the number of users affected, the effects on mission-critical functions, and the costs of downtime or of less-than-optimal functions. The five priority levels suggested in the document mentioned above are typical of the kind of triage categories established in many help-desk departments (adapted from a table in the Courtesy Computers document):
Priority 1
· Issue of the highest importance–mission-critical systems with a direct impact on the organization (Examples: widespread network outage, payroll system, sales system, telecom system, etc.)
· Contact: Immediate–5 minutes
· Resolution: 30 minutes
Priority 2
· Single user or group outage that is preventing the affected user(s) from working (Examples: failed hard drive, broken monitor, continuous OS lockups, etc.)
· Contact: 15 minutes
· Resolution: 1 hour
Priority 3
· Single user or group outage that can be permanently or temporarily solved with a workaround (Examples: malfunctioning printer, PDA synchronization problem, PC sound problem, etc.)
· Contact: 30 minutes
· Resolution: Same Day
Priority 4
· Scheduled work (Examples: new workstation installation, new equipment/software order, new hardware/software installation)
· Contact: 1 hour
· Resolution: 1-4 days
Priority 5
· Nonessential scheduled work (Examples: office moves, telephone moves, equipment loaners, scheduled events)
· Contact: Same Day
· Resolution: 5 days.
In his helpful overview, “CIRT – Framework and Models,” Ajoy Kumar summarizes the functions of triage as follows: “Triage: The actions taken to categorize, prioritize, and assign incidents and events. [15] It includes following sub-processes:
The DISA training materials suggest three broader categories of interactions with help desks and CSIRTs: “incidents, vulnerabilities, and information requests.”5 Incidents involve breaches of security; vulnerabilities include reports of security weaknesses (and may be reported as part of an incident); information requests – often managed using lists of frequently-asked questions (FAQs).
The DISA instructors go on to define factors which can help CSIRTs prioritize incidents as follows:
On this last point, the DISA writers point out that the organizational rank of someone calling in an incident may bear on its priority – but that it may be wise to cross-check the report with a security expert who can speak to whether the report is sound.
In summary, it is important to establish a sound basis for staff members of the CSIRT to carry out triage effectively. Once the rules for evaluating incidents have been clarified, staff members should practice analyzing a number of cases to train themselves in applying the rules consistently. Role-playing exercises based on historical records or on made-up examples can provide an excellent and enjoyable mechanism for staff members to establish a common standard for this difficult and sensitive task.
Sometimes staff (or even managers) question the value of strict adherence to policy. Policy is sometimes seen as the expression of unnecessary rigidity – an inability to respond quickly to changing or unexpected circumstances. However, in CSIRT management, knowing and adhering to well-thought-out policies and following a reliable process are particularly valuable not only for information gathering, data recording and analysis but also to maintain strict security.
One of the well-known tricks used by criminal hackers and spies is to simulate urgency that supports demands for violations of normal security restrictions. For example, criminals will call a relatively low-status employee such as a secretary and pressure him into violating standard protocols to obtain the password of his boss by claiming extreme circumstances of great urgency. The criminal may escalate the pressure to outright bullying by threatening the employee with punishment.
A criminal determined to penetrate security barriers can manufacture an incident that leads to involvement of the CSIRT. Allowing such a person to apply pressure for violations of protocol is an invitation to compromise. Worse, such deviations from well-tried and well-justified procedures can add to the embarrassment caused by the compromise: it’s bad enough to have someone breaking through our security without having to admit that we helped her.
Another trend is the ironic observation that the better a CSIRT (or help-desk team, but I’ll continue by focusing on CSIRTs) becomes at handling problems, the more readily members of its community will turn to it to report problems or ask for help. Thus the better the CSIRT does its job, the heavier its workload can become, at least for a while. According to the DISA course, “As a new CSIRT grows and the workload increases, and especially on those teams that provide 24-hour emergency response, burnout becomes quite common. By studying the issue, one national CSIRT determined that a full-time team member could comfortably handle one new incident per day, with 20 incidents still open and actively being investigated.”
Staff members who face increasing workloads may become stressed. Working long periods of overtime, missing time with family and friends, perhaps even missing regular exercise and food – these factors may lead to increased errors and turnover if people are forced to accept increasingly demanding conditions for long periods.
One of the most valuable organizational approaches to preventing burnout is to rotate staff through the CSIRT function from your IT group on a predictable schedule. For example, you can assign people to the CSIRT for three- or six-month rotations.
Such rotations require especially good training programs and particularly good documentation to maintain efficiency as new people come on duty; in addition, the assignments must be staggered so that the CSIRT doesn’t have to cope with large numbers of newcomers all at once. Ideally, there wouldn’t be more than one switch of personnel a week.
How should existing assignments be transferred within the CSIRT? I recommend that difficult existing cases be transferred to staff members who have been on duty for a few weeks, not to the incoming staff member (even if she has experience on the CSIRT). The incoming CSIRT member should be given a chance to get into (or get back into) the rhythm of the job before being hit with the most intractable problem or the orneriest client.
Every incident must have a case coordinator – the person who monitors the problem, aggregates information from varied resources and serves as the voice of the CSIRT for that incident. When transferring responsibility for a case from one case coordinator to another, be sure to have the previous coordinator prepare the clients for the transition and introduce the new coordinator to the key client contacts to ensure a smooth transition of control. Clients often come to depend on the person they have been working with to resolve an incident; an unexpected change can be unsettling and even disturbing.
The DISA course writers suggest, “Allow team members to allocate time away from high stress incident response assignments and pursue broader interests in areas such as tool development, public education and presentations, research, and other professional opportunities.” CSIRT members, by the nature of their work, will have a great deal to contribute to the awareness, training and education of their colleagues.
When
I worked in technical support for Hewlett-Packard
As
I mentioned above, the behavior of managers can greatly influence morale, motivation
and dedication among team members. Later in the 1980s, another superb manager,
Pierre Labelle, Vice President of Operations at Mathema, Inc. in Montréal, taught
me a lesson I have never forgotten about upper-management commitment to employee
performance. We had an extended series of acceptance tests running from
Making the CSIRT a stimulating and enjoyable duty that people want to be on is one of the best approaches to avoiding burnout and ensuring reliable response to computer-related problems.
As discussed in §5.4 , rotating assignments among CSIRT members can be an excellent idea. However, frequent changes in work schedules that involve changes in sleep cycles are not a good idea; for example, weekly changes in shift from day to night schedules can seriously disrupt the natural circadian wake/sleep cycle and have been shown to increase the rate of errors and accidents. [16] One authoritative resource states that there are “adverse health and safety effects to working shifts.”
A shiftworker, particularly one who works nights, must function on a schedule that is not natural. Constantly changing schedules can:
· upset one's circadian rhythm (24-hour body cycle),
· cause sleep deprivation and disorders of the gastrointestinal and cardiovascular systems,
· make existing disorders worse, and
· disrupt family and social life. [17]
Scientific studies throughout the world have long shown that shiftwork, by its very nature, is a major factor in the health and safety of workers; LaDou (1982) writes in his abstract,
Daily physiologic variations termed circadian rhythms are interactive and require a high degree of phase relationship to produce subjective feelings of wellbeing. Disturbance of these activities, circadian desynchronization, whether from passage over time zones or from shift rotation, results in health effects such as disturbance of the quantity and quality of sleep, disturbance of gastrointestinal and other organ system activities, and aggravation of diseases such as diabetes mellitus, epilepsy and thyrotoxicosis. [18]
The
US
One of the most important principles of management in general and operations management in particular is that fixing a problem has two aspects: the short term and the long term. One must be able to solve problems quickly enough to be effective; that is, the speed of solution must be appropriate to the consequential costs of delay. However, we should not figuratively wipe our hands in satisfaction and walk away from the problem resolution without thinking about why it happened, how we fixed it, and whether we can do better to avoid repeats and to improve our response. [20]
As a matter of standard operating procedure, every technical support and CSIRT must schedule time to analyze the underlying factors that led to the problem they have just resolved. This analysis will likely involve operational staff outside the CSIRT; these are the people with line expertise who will be able to contribute their intimate knowledge of technical details that contributed to this security breach. These discussions can often lead to practical recommendations for improvement of our security architecture such as topology or firewall placement, operational procedures such as monitoring standards or vulnerability patching, and technical details such as configurations or parameter settings.
Similarly, it is a commonplace in discussions of disaster recovery and business continuity planning that every practice run or real-life incident should be analyzed to see where we have made errors or achieve less than our goals in performance. Managers must ensure that these analyses are not perceived as (or worse, really) finger-pointing exercises for a apportioning blame. In a column for Network World, I have explained the concepts of “egoless work;” the postmortem analysis of an incident must be ego-free. [21] Managers can set the tone by responding positively to what might otherwise be perceived as criticism; “That’s a good point” and “Very good observation” are examples of positive, encouraging responses to observations such as “We were too slow in getting back to the initial caller given that she clearly stated that the entire department was off-line.” The meeting should focus on ways to improve the response given the insights resulting from detailed analysis of successes and failures during the recent incident.
The other aspect that sometimes gets lost in such postmortems is exploring the reasons for the problems. If we don’t pay attention to underlying causes, we may fix specific problems and we may improve particular procedures but we will likely encounter different consequences of the same fundamental errors that caused those particular problems. We must pursue the analysis and deeply in off to identify structural flaws in our processes so that we can correct those problems and thus reduce the likelihood of entire classes of problems. Readers interested in learning more about management style and small-group leadership tools may find some material of value in the Management Skills lectures and in the Leadership lectures on the MSIA section of my Web site. [22]
The
The authors also recommend the following (paraphrasing and summarizing):
On this last point, I must add that all action items should indicate clearly who intends to deliver precisely what operational result to whom in which form by when.
On page 3-23 of the Computer Security Incident Handling Guide, the authors make a series of recommendations on how to capitalize on the knowledge gained through systematic analysis of incidents.23 I am commenting briefly on each of their suggestions (shown in quotation marks).