Detection of credential spearphishing attacks using email analysis (2024)

Embodiments of the disclosure relate to the field of cyber security. More specifically, embodiments of the disclosure relate to a system for credential spearphishing attacks, particularly those carried out via email.

Over the last decade, malicious software has become a pervasive problem for Internet users as many networked resources include vulnerabilities that are subject to attack. In particular, persons looking to infiltrate a network or steal sensitive data have utilized a method known as phishing. Typically, a phishing attack comprises the transmission of an electronic communication, such as an email, to a broad group of recipients that purports to be from a known institution, such as a bank or credit card company, that seems to have a legitimate intention. For example, a malware writer may transmit an email to a large group of recipients purporting to be from a social media platform and asserting a password change is required for continued use of the platform. The email may have the look and feel of a legitimate email sent by the social media platform and include a Uniform Resource Locator (URL) that directs the recipients to a website requesting the recipient to enter credential information in order to change the recipient's password. The URL will not be associated with the social media platform, although it likely has the look and feel of the social media platform's website. The phishing attack is completed when the recipient of the email enters submits credential information to the website, which is then delivered to the malware writer. As used herein, the terms “link” and “URL” are used interchangeably.

As the efficacy of broad scale phishing attacks has decreased, malware writers have turned to a more personalized method, known as spearphishing, or credential spearphishing, attacks. Spearphishing is a more targeted version of phishing attacks that combines tactics such as victim segmentation, email personalization, sender impersonation, and other techniques to bypass email filters and trick targeted recipients into clicking a URL within the email, or opening an attachment attached thereto.

Spearphishers, malware writers that generate and transmit electronic communications that include spearphishing attacks, may use social engineering methods to personalize an email at a targeted recipient or small group of targeted recipients. For example, a spearphisher may extract information from social media platforms or a corporate website to craft an email that includes personalized information attempting to impersonate an institution relevant to the recipient, or small group of recipients, such as a bank, a credit card company or an employer. The spearphishing email may request that the recipient download an attachment or click on a URL. The attachment may contain malicious content, such as a malicious embedded object within a PDF document or Microsoft® Excel® file. The embedded object may comprise, for example, an exploit kit or other malicious payload that either installs malicious software or initiates malicious, anomalous or unwanted behavior (e.g., initiating a callback to a compromised server). The URL within a spearphishing email may direct the recipient of the email to a web page that imitates a legitimate institution claiming to need the recipient to provide credential information (e.g., login) in order to change a password, verify their identity, read an important notice, etc. Submission of credential information through such a web page merely provides the credential information to the spearphisher enabling the spearphisher to access sensitive information. An email that includes a URL directed to a web page that requests credential information may be referred to as a credential spearphishing attack.

These spearphishing attacks may be multi-vector, multi-stage attacks that current malware detection technology is unable to detect. For instance, the spearphishing attack may utilize email spoofing techniques to fool email filters. Additionally, spearphishing attacks may utilize zero-day (i.e., previously unknown) vulnerabilities in browsers or applications, use multi-vector, multi-vector attacks or dynamic URLs to bypass current malware detection systems. Additionally, as spearphishing attacks are personalized, they often lack characteristics typical of spam and therefore usually go undetected by traditional spam-filters.

Based on the problems presented by spearphishing attacks, and in particular, credential spearphishing attacks set forth above, current malware detection systems, including field-based sandbox detection systems contain numerous shortcomings and therefore fail to proactively detect spearphishing attacks. Credential spearphishing attacks may not include exploitation techniques but may instead rely on human interaction to input sensitive data into an input form (e.g., text box) and unknowingly submit that data to an unsecure server. The data may be passed to the unsecure server via an outbound POST request generated by the website on which a user is browsing. Therefore, credential spearphishing attacks present numerous detection challenges to current malware detection systems.

Embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:

FIG. 1 is an exemplary block diagram of a credential spearphishing detection system 110 deployed inside an enterprise network 100.

FIG. 2 is an exemplary block diagram of the credential spearphishing detection system 110 deployed outside of the enterprise network 100.

FIG. 3 is an exemplary block diagram of the credential spearphishing detection system 110 deployed within cloud computing services 160.

FIG. 4 is an exemplary embodiment of a logical representation of the credential spearphishing detection system 110 of FIG. 1.

FIG. 5 is a flowchart illustrating an exemplary method for detecting a credential spearphishing attack through analysis of an email and associated links with the credential spearphishing detection system 110 of FIG. 1.

FIG. 6 is a flowchart illustrating an exemplary method for analyzing an email with the credential spearphishing detection system 110 of FIG. 1.

FIG. 7 is a flowchart illustrating an exemplary method for analyzing a web page directed to by a URL in an email with the credential spearphishing detection system 110 of FIG. 1.

FIG. 8 is a flowchart illustrating an exemplary method for dynamically processing the HTML source code of a web page directed to by a URL in an email with the credential spearphishing detection system 110 of FIG. 1.

FIG. 9 is a block diagram illustrating an exemplary email associated with a credential spearphishing attack.

FIG. 10 is a block diagram illustrating an exemplary web page associated with a credential spearphishing attack.

Various embodiments of the disclosure relate to a spearphishing detection system that improves detection of spearphishing attacks, particularly, credential spearphishing attacks. Herein, a credential spearphishing attack may lead to the recipient of a credential spearphishing email mistakenly providing a spearphisher with credential information via a web page directed to by a URL within the credential spearphishing email. Additional or alternative embodiments may include a spearphishing detection system that detects spearphishing electronic communications that include attachments, the downloading of which may lead to the infection of an endpoint device with malware, wherein “malware” may be broadly construed as including exploits that initiate malicious, anomalous or unwanted behaviors.

In one embodiment of the disclosure, the credential spearphishing detection system includes a communication interface, a scheduler, a data store, a static analysis logic, a dynamic analysis logic, a classification logic, and a reporting logic.

The credential spearphishing detection system may capture network traffic addressed to one or more endpoint devices within a network (e.g., an enterprise network), for example Simple Mail Transfer Protocol (SMTP) traffic and analyze the SMTP traffic, e.g., an email, using the static analysis logic and/or the dynamic analysis logic. The static analysis logic includes (i) an email analysis logic that extracts and analyzes the header and body of the email, (ii) a URL analysis logic to extract and analyze a URL included within the email, and (iii) a web page analysis logic to fetch the HTML code of the web page corresponding to the URL, subsequently extract and analyze the header and body, including images contained therein, and determine whether the web page is attempting to impersonate (e.g., a victim domain). The dynamic analysis logic includes (a) at least one virtual machine (VM) to dynamically process the HTML source code of the web page to which the URL in the email directs, (b) a web browsing emulation logic to simulate human interaction within the web browser of the VM, and (c) an expert system to correlate the target domain with the victim domain and apply additional heuristics to determine if the web page is associated with spearphishing. One embodiment of a method of identifying the presence of a URL within an email message is described in a prior U.S. patent entitled “Electronic Message Analysis For Malware Detection,” U.S. Pat. No. 9,106,694, which issued Aug. 11, 2015, the contents of which are incorporated herein by reference.

The credential spearphishing detection system may accumulate information about the victim domain (impersonated domain) and the target domain (domain of the spearphisher) during analysis by the static analysis logic. The credential spearphishing detection system may accumulate a sufficient amount of information during analysis by the static analysis logic such that a determination may be made that the email is associated with a spearphishing attack. Alternatively, or in addition, the information garnered during the static analysis may be provided to the dynamic analysis logic to aid in the configuration of one or more VMs. The VMs may then be used to process the web page directed to by the URL detected in the email.

A classification logic includes logic to prioritize the results of the analyses performed by the static analysis logic and/or the dynamic analysis logic to determine whether the email is associated with a phishing attack, or in particular, a spearphishing attack. In some embodiments, the score determination logic of the classification logic may generate a score indicating a level of confidence that the email is associated with a spearphishing attack. Herein, a score may be a numerical value; one of a predefined set of categories such as “suspicious,” “malicious,” or “benign”; an electrical signal such as ‘1’ or ‘0’, or the like. In one embodiment, an email may be determined to be associated with a spearphishing attack when a score meets or exceeds a predefined threshold. Alternatively, an email may be determined to be associated with a spearphishing attack when the classification logic classifies the email as “suspicious” and “malicious,” or, possibly, just when classified as “malicious.” Additionally, the classification logic may determine an email is associated with a phishing attack, or more particularly, a spearphishing attack based on one or more of the analyses performed. After any analysis, the score determination logic may determine a first score according to one or more analyses that is above a first threshold indicating a phishing attack (e.g., based on the presence of a domain in a detected URL known to be associated with a generic phishing attack) or a second score that is above a second threshold indicating a credential spearphishing attack (e.g., based on the presence of a domain in a detected URL known to be associated with a spearphishing attack and/or the presence of input forms on a web page requesting credential information).

A user of an endpoint that received, or was to receive, the email and/or a network administer may be alerted to the results of the processing via alert generated by a reporting logic. Such an alert may include various types of messages, which may include text messages and/or email messages, video or audio stream, or other types of information over a wired or wireless communication path. An alert may include an outline or summary of the phishing/spearphishing attack with which the email is associated. Additionally, when an email is determined to have been detected, the extracted characteristics of the email and the web page to which a URL directed may be stored in a data store and incorporated into future analyses by the credential spearphishing detection system. Furthermore, the extracted characteristics of an email determined to be associated with a spearphishing attack and the results of the corresponding processing may be uploaded to cloud computing services for use by other credential spearphishing detection systems.

As used herein, the transmission of data may take the form of transmission of electrical signals and/or electromagnetic radiation (e.g., radio waves, microwaves, ultraviolet (UV) waves, etc.).

In the following description, certain terminology is used to describe features of the invention. For example, in certain situations, both terms “logic” and “engine” are representative of hardware, firmware and/or software that is configured to perform one or more functions. As hardware, logic (or engine) may include circuitry having data processing or storage functionality. Examples of such circuitry may include, but are not limited or restricted to a microprocessor, one or more processor cores, a programmable gate array, a microcontroller, a controller, an application specific integrated circuit, wireless receiver, transmitter and/or transceiver circuitry, semiconductor memory, or combinatorial logic.

Logic (or engine) may be software in the form of one or more software modules, such as executable code in the form of an executable application, an application programming interface (API), a subroutine, a function, a procedure, an applet, a servlet, a routine, source code, object code, a shared library/dynamic link library, or one or more instructions. These software modules may be stored in any type of a suitable non-transitory storage medium, or transitory storage medium (e.g., electrical, optical, acoustical or other form of propagated signals such as carrier waves, infrared signals, or digital signals). Examples of non-transitory storage medium may include, but are not limited or restricted to a programmable circuit; a semiconductor memory; non-persistent storage such as volatile memory (e.g., any type of random access memory “RAM”); persistent storage such as non-volatile memory (e.g., read-only memory “ROM”, power-backed RAM, flash memory, phase-change memory, etc.), a solid-state drive, hard disk drive, an optical disc drive, or a portable memory device. As firmware, the executable code is stored in persistent storage.

According to one embodiment, the term “malware” may be construed broadly as any code or activity that initiates a malicious attack and/or operations associated with anomalous or unwanted behavior. For instance, malware may correspond to a type of malicious computer code that executes an exploit to take advantage of a vulnerability, for example, to harm or co-opt operation of a network device or misappropriate, modify or delete data. Malware may also correspond to an exploit, namely information (e.g., executable code, data, command(s), etc.) that attempts to take advantage of a vulnerability in software and/or an action by a person gaining unauthorized access to one or more areas of a network device to cause the network device to experience undesirable or anomalous behaviors. The undesirable or anomalous behaviors may include a communication-based anomaly or an execution-based anomaly, which, for example, could (1) alter the functionality of an network device executing application software in an atypical manner (a file is opened by a first process where the file is configured to be opened by a second process and not the first process); (2) alter the functionality of the network device executing that application software without any malicious intent; and/or (3) provide unwanted functionality which may be generally acceptable in another context. Additionally, malware may be code that initiates unwanted behavior which may be, as one example, uploading a contact list from an endpoint device to cloud storage without receiving permission from the user.

The term “processing” may include launching an application wherein launching should be interpreted as placing the application in an open state and performing simulations of actions typical of human interactions with the application. For example, the application, an Internet browsing application may be processed such that the application is opened and actions such as visiting a website, scrolling the website page, and activating a link from the website are performed (e.g., the performance of simulated human interactions).

The term “network device” should be construed as any electronic device with the capability of connecting to a network, downloading and installing mobile applications. Such a network may be a public network such as the Internet or a private network such as a wireless data telecommunication network, wide area network, a type of local area network (LAN), or a combination of networks. Examples of a network device may include, but are not limited or restricted to, a laptop, a mobile phone, a tablet, etc. Herein, the terms “network device,” “endpoint device,” and “mobile device” will be used interchangeably. The terms “mobile application” and “application” should be interpreted as software developed to run specifically on a mobile network device.

The term “malicious” may represent a probability (or level of confidence) that the object is associated with a malicious attack or known vulnerability. For instance, the probability may be based, at least in part, on (i) pattern matches; (ii) analyzed deviations in messaging practices set forth in applicable communication protocols (e.g., HTTP, TCP, etc.) and/or proprietary document specifications (e.g., Adobe PDF document specification); (iii) analyzed compliance with certain message formats established for the protocol (e.g., out-of-order commands); (iv) analyzed header or payload parameters to determine compliance, (v) attempts to communicate with external servers during processing in one or more VMs, (vi) attempts to access memory allocated to the application during virtual processing, and/or other factors that may evidence unwanted or malicious activity.

Lastly, the terms “or” and “and/or” as used herein are to be interpreted as inclusive or meaning any one or any combination. Therefore, “A, B or C” or “A, B and/or C” mean “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.

The invention may be utilized for detecting credential spearphishing attacks encountered as a result of receiving email. As this invention is susceptible to embodiments of many different forms, it is intended that the present disclosure is to be considered as an example of the principles of the invention and not intended to limit the invention to the specific embodiments shown and described.

Referring to FIG. 1, an exemplary block diagram of a credential spearphishing detection system 110 deployed inside an enterprise network 100 is shown. In the embodiment illustrated, the enterprise network 100 includes the credential spearphishing detection system 110, a router 150, an optional firewall 151, a network switch 152, and the endpoint device(s) 153. The network 100 may include a public network such as the Internet, a private network (e.g., a local area network “LAN”, wireless LAN, etc.), or a combination thereof. The router 150 serves to receive data, e.g., packets, transmitted via a wireless medium (e.g., a Wireless Local Area Network (WLAN) utilizing the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard) and/or a wired medium from the cloud computing services 160 and the endpoint devices 153. As is known in the art, the router 150 may provide access to the Internet for devices connected to the network 110.

In one embodiment, the network switch 152 may capture network traffic, make a copy of an email within the network traffic, pass the email to the appropriate endpoint devices 153 and pass the copy of the email to the exploit kit detection system 110. In a second embodiment, the network switch 152 may capture the email from the network traffic and pass the email to the credential spearphishing detection system 110 for processing prior to passing the email to the appropriate endpoint devices 153. In such an embodiment, the email will only be passed to the appropriate endpoint devices 153 if the analysis of the email does not indicate that the email is associated with a malicious attack, anomalous or unwanted behavior, or, in particular, a credential spearphishing attack.

The exploit kit detection system 110 includes a communication interface 111, a scheduler 112, a data store 113, a static analysis logic 120, a dynamic analysis logic 130, a classification logic 140, and a reporting logic 114.

As shown, the credential spearphishing detection system 110 is communicatively coupled with the cloud computing services 160, the Internet and one or more endpoint devices 153 via the communication interface 111, which directs at least a portion of the network traffic to the scheduler 112 and the static analysis logic 120.

The network traffic that is provided to the static analysis logic 120 by the communication interface 111 may include a portion of the received network traffic or the entirety of the network traffic. The parser 124 within the static analysis logic 120 parses the received network traffic and extracts SMTP traffic (e.g., an email) and provides the email to the email analysis logic 121. The email analysis logic 121 performs a first stage of analysis on the email which includes an analysis of the header and contents of the body of the email. The email is also provided to the URL analysis logic 121 which performs a second stage of analysis including parsing the email for a URL and upon detection of a URL, performing an analysis of the URL itself. Additionally, when a URL is detected, the email is provided to the web page analysis logic 123, which performs a third stage of analysis including fetching the web page content (e.g., HTML source code and associated metadata) and analyzing the header and body contents of the web page. In one embodiment, the analyses may be performed sequentially (e.g., email analysis, URL analysis, web page analysis) or one or more of the analyses may be performed concurrently (e.g., at least partially overlapping at the same time). In some embodiments, information and results of one analyses may be used to assist in other analyses. For example, information and results of the email analysis and/or the URL analysis may aid the web page analysis by providing the web page analysis logic 123 with the domain of the sender of the email and/or a domain of the URL (e.g., prior to one or more redirects), which may assist the web page analysis logic 123 in narrowing its analysis.

The dynamic analysis logic 130 may also be supplied with the email and the results of the analyses performed by the static analysis logic 120 in order to perform a fourth stage of analysis. The results and information related to the static analysis may be used to assist the dynamic analysis logic 130 in the processing of the email in one or more VMs. In one embodiment, the scheduler 112 may configure one or more of VM 136₁-VM 136_M(M≥1) with selected software profiles. For instance, the results of the analyses by one or more of the email analysis logic 121, the URL analysis logic 122 and/or the web page analysis logic 123 may be used to determine which software images (e.g., application(s)) and/or operating systems to be fetched from the storage device 123 for configuring operability of the VM 136₁-VM 136_M. Herein, the VM 136₁-VM 136_Mmay be provisioned with a guest image associated with a prescribed software profile. Each guest image may include a software application and/or an operating system (OS). Each guest image may further include one or more monitors, namely software components that are configured to observe and capture run-time behavior of an object under analysis during processing within the virtual machine.

Additionally, the results and information related to the static analysis may provide indications to the web browsing emulation logic 132 and the expert system 133 as to, inter alia, the domain of the sender of the email, the domain of the URL (e.g., the victim domain), suspicious text included in the email and/or the web page, etc. The static and dynamic analyses will be discussed below in further detail.

The classification logic 140 includes the score determination logic 141 and the prioritization logic 142 and receives the results and information related to the static analysis and the dynamic analysis. The prioritization logic 142 may be configured to associate weighting with one or more portions of the analyses. The score determination logic 141 determines a score indicative of the likelihood the email is associated with a phishing, or more particularly, a spearphishing attack. The score determination logic 141 may determine a first score indicating the likelihood that the email is associated with a phishing attack based on an analysis of the email (e.g., header and body) and a URL detected within the email. The score determination logic 141 may determine a second score indicating the likelihood that the web page directed to by the URL in the email, and thus the email, is associated with a phishing attack based on an analysis of web page itself. Additionally, the score determination logic 141 may determine a third score indicating the likelihood that the email is associated with a phishing attack based on a dynamic analysis of web page directed to by the URL in the email as well as the information collected during the static analysis, including the first and second score.

When the first, second or third score indicates that the email is above a first, second or threshold predetermined threshold level, respectively, the email, and optionally the information collected during the static and/or dynamic analyses as well as the respective results, may be provided to a network administrator. In such a situation, when the email has not yet been provided to an endpoint device(s) 153, the email will not be provided to the endpoint device(s) 153. In the situation in which the email has been provided, an alert may be generated by the reporting logic 114 and transmitted to the endpoint device(s) 153 alerting the user of the phishing, or in particular, a spearphishing, attack.

When the first, second and third scores determined by the score determination logic 141 do not rise above one or more predetermined thresholds (i.e., the email is not associated with a phishing attack), the email is passed to the endpoint device(s) 153, if it had not previously been done.

The reporting logic 114 is adapted to receive information from the static analysis logic 120 and the dynamic analysis logic 130 and generate alerts that identify to a user of an endpoint device 153, network administrator or an expert network analyst the likelihood that an email is associated with a spearphishing attack. Other additional information regarding the analysis may optionally be included in the alerts.

Referring to FIG. 2, an exemplary block diagram of the credential spearphishing detection system 110 deployed outside of the enterprise network 100 is shown. In such an embodiment, network traffic received by the network 100 may be captured by the network switch 152, a copy generated by the network switch 152 and the copy provided to the credential spearphishing detection system 110 via the router 150. This embodiment may illustrate a situation in which the credential spearphishing detection system 110 is not located at the same location as the location covered by the network 100.

Referring to FIG. 3, an exemplary block diagram of the credential spearphishing detection system 110 deployed within cloud computing services 160 is shown. As with FIGS. 1 and 2, network traffic received by the network 100 may be captured by the network switch 152, a copy generated by the network switch 152 and the copy provided to the credential spearphishing detection system 110 via the router 150.

Referring to FIG. 4, an exemplary embodiment of a logical representation of the credential spearphishing detection system 110 of FIGS. 1-3 is shown. The credential spearphishing detection system 110 includes one or more processors 400 that are coupled to communication interface logic 410 via a first transmission medium 411. Communication interface logic 410 enables communication with network devices via the Internet, the cloud computing services 160 and the endpoint devices 153. According to one embodiment of the disclosure, communication interface logic 410 may be implemented as a physical interface including one or more ports for wired connectors. Additionally, or in the alternative, communication interface logic 410 may be implemented with one or more radio units for supporting wireless communications with other electronic devices.

Processor(s) 400 is further coupled to persistent storage 420 via a second transmission medium 412. According to one embodiment of the disclosure, persistent storage 420 may include (a) the static analysis logic 120 including the email analysis logic 121, the URL analysis logic 122 and the web page analysis logic 123; (b) the dynamic analysis logic 130 including the monitoring logic 131, the web browsing emulation logic 132, the expert system 133, the VMs 136₁-136_Mand the VMM 135; and (c) the classification logic 140 including the score determination logic 141 and the prioritization logic 142. Of course, when implemented as hardware, one or more of these logic units could be implemented separately from each other.

The overall analysis performed by the credential spearphishing detection system 110 may be broken down into multiple stages: (i) email analysis, (ii) URL analysis, (iii) web page analysis, and (iv) dynamic analysis including web page emulation. As was discussed above, although the overall analysis is discussed herein in terms of “stages,” the overall analysis should not be limited to a specific sequential order. In contrast, the stages illustrate one embodiment of the analyses such that portions of the overall analysis may proceed in alternative orders than as discussed below. Additionally, one or more portions of the overall analysis may be performed concurrently with at least part of the portions of the overall analysis overlapping in time.

Referring to FIG. 5, a flowchart illustrating an exemplary method for detecting a credential spearphishing attack through analysis of an email and an associated URL with the credential spearphishing detection system of FIG. 1 is shown. Each block illustrated in FIG. 5 represents an operation performed in the method 500 of detecting a credential spearphishing attack in an email and associated links found within the email. Referring to FIG. 5, an email is received by the credential spearphishing detection system that includes a URL to a web page (block 501). At block 502, the header and body of the email are analyzed. An email analysis logic included within a static analysis logic of the credential spearphishing detection system may perform an analysis of the header (e.g., correlating the domain of the sender of the email with a blacklist of domains known to be associated with spearphishing attacks).

At block 503, the URL included within the email is analyzed. The URL analysis may be performed by the URL analysis logic included within the static analysis logic, which may include a correlation of the domain directed to by the URL with a blacklist of known domains known to be associated with spearphishing attacks. For example, a domain that may easily be mistaken for a well-known and respected institution (e.g., “www.bankofamerca.com”) may be included on such a blacklist.

At block 504, an analysis of the web page directed to by the URL included in the email is conducted by fetching the web page (e.g., requesting the HTML source code associated with the URL) and analyzing the contents of the web page. The analysis of the web page is performed by the web page analysis logic included within the static analysis logic. The analysis of the web page may include, inter alia, (i) a correlation of the attributes of the HTTP response header and/or the HTTP response body with one or more blacklists of attributes known to be associated with spearphishing attacks, and/or (ii) the application of heuristic, probabilistic and/or machine learning algorithms to the attributes of the HTTP response header and/or the HTTP response body.

At block 505, the email may be processed within a virtual machine included in a dynamic analysis logic of the credential spearphishing detection system. Herein, a web browser emulation logic of the dynamic analysis logic provides credential information and submits the credential information in order to generate a POST request. The attributes of the POST request are then correlated with the victim domain (e.g., the domain indicated in the email and/or web page as the domain to which the credential information will purportedly be provided). Additionally, the information extracted during virtual processing may be correlated with the information and attributes extracted during static analysis.

The specific stages included within the overall analysis illustrated by method 500 in FIG. 5 will be discussed in further detail below in FIGS. 6-8.

1. Stages 1 and 2: Email Analysis and URL Analysis

Referring now to FIG. 6, a flowchart illustrating an exemplary method for analyzing an email with the credential spearphishing detection system 110 of FIG. 1 is shown. Each block illustrated in FIG. 6 represents an operation performed in the method 600 of analyzing a content information and attributes, the body and one or more URLs included in an email with the credential spearphishing detection system 110. In the particular embodiment illustrated in FIG. 6, blocks 602 and 603 highlight at least a portion of the email analysis described as stage 1 above and blocks 602 and 604 highlight at least a portion of the URL analysis described as stage 2 above.

At block 601, an email is received by the credential spearphishing detection system that includes a URL to a web page. At block 602, the content information and attributes of the email are extracted. Herein, the content information may refer to the contents of the body of the email and include, but is not limited or restricted to, one or more URLs detected within the email, one or more input forms (e.g., text boxes, radio buttons, drop down menus, etc.) detected within the email, the location of URLs detected within the email, and/or text and/or images detected within the email. It should be noted that the content information may be displayed or may be “hidden” (e.g., white text located on a white background, text located behind an image, text positioned off-screen, text having a font size of ‘0’ and/or a link comprising a single character—e.g., a hyphen—within a paragraph of text). Additionally, header attributes may include, inter alia, the “from” address, the subject, and/or the “reply to” address.

At block 603, the extracted information and attributes are correlated with known malicious actors. Herein, a database, stored in for example, a data store, may include a representation of known malicious actors corresponding to one or more of the extracted content information elements and/or the header attributes. The correlation may determine whether a match occurs between extracted content information and/or header attributes and one or more known malicious actors. Additionally, the correlation may include a determination as to the percentage of the occurrence of a match in order to account for a mutation (e.g., a minor change to an element of a component of content information or a header attribute—for example, one letter changes in the “from” address). In such a situation, the correlation may determine a match exists when a similarity occurs above a predetermined percentage threshold.

At block 604, a URL detected in the email is analyzed for indications that the URL is associated with a phishing attack. The analysis of the URL detected in the email may include, but is not limited or restricted to, a determination of the existence of a typographical error in the URL according to URL's of well-known institutions or well-known web pages (e.g., predefined URLs), a correlation between the domain of the URL and extracted content information and/or header attributes, and/or a correlation between the domain and a subdomain of the URL. In one example, the email analysis may extract the “reply to” address from the header of the email and determine the email is coming from a well-known banking institution based on the content of the text located in the email. Additionally, the URL analysis may extract domain and subdomain information of a URL detected in the email.

At block 605, a score indicating a level of suspiciousness is generated. Subsequent to the email analysis and URL analysis (e.g., blocks 602-604), the results of the analysis are provided to the classification logic. The classification logic of the credential spearphishing detection system may prioritize the extractions and determinations that occurred during the email analysis and the URL analysis and determine a score. For example, if the “reply to” address in the header and the subdomain of the detected URL match but neither match the well-known banking institution portrayed by the content of the email, a score may be determined that indicates the email is likely associated with a spearphishing attack.

At block 606, a determination is made as to whether the score is above a first predefined threshold. As discussed above, the score determined by the classification logic, e.g., the score determination logic, may indicate the email is likely associated with a spearphishing attack by being above a first predetermined threshold. Alternatively, as is discussed previously herein, the score may not necessarily be a numerical score.

When the score is determined to be above the first predefined threshold (yes at block 606), the email is determined to be a phishing email (block 607). In one embodiment, the score may indicate that the email is likely associated with a phishing email (e.g., the “reply to” address and/or the subject line contents match malicious actors known to be associated with phishing attacks) or the score may indicate that the email is likely associated with a spearphishing attack. In one embodiment, a first score may indicate a phishing attack and a second score, being higher than the first score, may indicate a spearphishing attack.

When the score is not determined to be above the first predefined threshold (no at block 606), the web page directed to by the URL detected within the email is analyzed (block 608). The analysis of the web page is detailed below in association with FIG. 7.

2. Stage 3: Web Page Analysis

Referring to FIG. 7, a flowchart illustrating an exemplary method for analyzing a web page directed to by a URL in an email with the credential spearphishing detection system of FIG. 1 is shown. In one embodiment, as will be discussed herein, method 700 is performed subsequent to method 600 of FIG. 6. Therefore, the discussion of FIG. 7 will often refer to one or more portions of FIG. 6. Each block illustrated in FIG. 7 represents an operation performed in the method 700 of activating a URL included in an email and analyzing the web page directed to by the URL with the credential spearphishing detection system. At block 701, the credential spearphishing detection system has determined that a score associated with the suspiciousness of the email is not above a first predefined threshold.

At block 702, the URL, detected during parsing and/or analysis of the header and body contents of the email is activated. Herein, by activating the URL, the web page analysis logic of the spearphishing credential detection system initiates a request for the HTML source code corresponding to the URL.

At block 703, key attributes are extracted from the header of the packets comprising the web page directed to by the URL. Examples of key attributes that may be extracted include, but are not limited or restricted to, information indicating the server delivering the web page, the metadata of the server (e.g., whether the server runs as a Linux® or Windows® platform, the location hosting the domain, the length of time the domain has been hosted, etc.), and/or the use of a secure connection. Subsequent to extraction, the key attributes of the headers are correlated with attributes extracted during analysis of the email. The web page analysis logic may correlate the attributes extracted from the headers of the fetched web page with the attributes extracted from the email to determine the consistency between the sources. For example, if the attributes extracted from the email do not match the attributes extracted from the headers of the fetched web page, the email may have a high likelihood of being associated with a phishing attack.

At block 704, attributes from the web page body are extracted. Following extraction, one or more of the following correlations may be performed: (1) a correlation between the images detected on the web page and images of well-known institutions and companies (e.g., logos of banks, credit card companies, stores) and/or (2) a correlation between the text detected on the web page and stored text known to be associated with well-known institutions and companies (e.g., names or slogans). These correlations may be referred to as a “screen shot analysis.” The correlations of the attributes from the web page body result in a determination of the institution or company the web page is portraying (e.g., in the case of a phishing email, attempting to impersonate). During extraction, the presence of hidden links may be recorded as well.

In some embodiments, the screen shot analysis may include a correlation of the extracted content (e.g., title, text content, input forms, the location of each, etc.) with the one or more entries within a database wherein each entry represents the attributes of web pages previously determined to be associated with phishing attacks. For example, attributes of web pages previously determined to be associated with a phishing attack may be stored in a data store and compared to the extracted attributes of the web page. Therefore, the screen shot analysis may also provide a determination as to how closely the web page directed to by the URL matches a web page known to be associated with a phishing attack. Additionally, machine learning techniques may be applied such that when a web page directed to by a URL in an email is determined to be associated with a phishing attack, the extracted attributes may be added to the data store for future analyses.

At block 705, the correlations performed in blocks 703 and 704 are used to determine a victim domain. The victim domain, as discussed above, is the domain the web page is attempting to impersonate. For example, the correlation of the images detected on the web page may indicate that the web page is attempting to portray a well-known institution such as Bank of America (e.g., the Bank of America logo was detected on the web page).

At block 706, the victim domain is correlated with the attributes extracted from the email. Herein, once the victim domain has been determined, the consistency between the victim domain and the attributes extracted during the analysis of the email (e.g., header, body, detected URLs, etc.). For example, a correlation revealing that the email portrayed a first company but the victim domain portrayed a second company different than the first company may indicate a high likelihood that the email is associated with a phishing attack.

At block 707, a score indicating a level of suspiciousness for the web page (e.g., the likelihood of association with a phishing attack) is generated based on one or more of the correlations performed in blocks 703, 704 and/or 706. At block 708, when the score is above a second predefined threshold, the email is determined to be associated with a credential spearphishing attack.

Alternatively, or in addition, the score indicating the level of suspiciousness of the web page may be combined with the scores generated as a result of the analyses of the email and the URL detected in the email, as discussed with FIG. 6.

3. Stage 4: Dynamic Analysis

Referring to FIG. 8, a flowchart illustrating an exemplary method for virtually processing the HTML source code of a web page directed to by a URL in an email with the credential spearphishing detection system 110 of FIG. 1 is shown. Each block illustrated in FIG. 8 represents an operation performed in the method 800 of analyzing a web page directed to by a URL included in an email with the dynamic analysis logic 130 credential spearphishing detection system.

When referring to method 800, the information extracted and collected by the static analysis logic within the credential spearphishing detection system should be kept in mind. The static analysis logic is communicatively coupled to the dynamic analysis logic such that information extracted and collected during static analysis, as well as results of that analysis may be provided to the dynamic analysis engine to assist in the dynamic analysis.

At block 801, the dynamic analysis logic receives the HTML source code for the web page directed to by the URL in the email. Previously, as is discussed in methods 600 and 700 of FIGS. 6 and 7, (i) a score indicating a level of suspiciousness as to whether the email is associated with a phishing, or particularly, a spearphishing attack based on the analysis of the header and contents of the email and a URL included in the email, and (ii) a score indicating a level of suspiciousness as to whether the web page is associated with a spearphishing attack based on a static analysis of the header and contents of the body of the web page as set forth in the HTML source code of the web page.

At block 802, the web page is scanned for input fields that submit data to an external server via a request method supported by the Hypertext Transfer Protocol (HTTP), which may be, for example, a HTTP POST request. Herein, the analysis may look to detect a POST request as a POST request requests that a web server accepts data enclosed in the request payload. By detecting a POST request, the web browser emulation logic may detect a domain to which the contents of the one or more input fields will be submitted from the link associated with the submission button (hereinafter referred to as the “target domain”). However, in one embodiment, the domain may be obfuscated and not discernible from the link associated with the submission button. In such an embodiment, the obfuscated content will be de-obfuscated by the web browsing emulation logic. Following the de-obfuscation of the content, the web browsing emulation logic may detect a POST request. In other embodiments, a HTTP GET request may be detected and the URL associated with the GET request may be analyzed to determine a target domain.

At block 803, a determination is made as to whether the target domain can be determined based on the detected POST request as set forth in the HTML source code. When the target domain cannot yet be determined from the HTML source code (no at block 803), the input fields are loaded with content and an outgoing POST request is generated upon submission of the contents via the link corresponding to a submission button associated with the input fields (block 804). Herein, the web browsing emulation logic captures the POST request generated by the web browser in the VM. The web browsing emulation logic parses the captured POST request to determine the domain to which the contents of the input fields are being transmitted (e.g., the target domain).

As blocks 802-804 of the method 800 are being performed, blocks 805 and 806 may also be performed concurrently. At block 805, the links to each image detected on the web page are extracted from the HTML source code by the web browser emulation logic. For example, the extraction of the link to an image may be a result of the detection of the HTML syntax: <img src=“url”>. Once the links to each image detected on the web page have been extracted, the web browsing emulation logic utilizes an image search API (e.g., Image Search API, a Custom Search API, optical character recognition techniques (OCR), comparison of the image pointed to by the extracted link with images stored in a database) to determine, for each detected image, images similar to the detected image (block 806). The image search API is also used to determine the links to each image. The web browsing emulation logic determines the reputation of each link (e.g., the image search API results may be provided in order of highest to lowest reputation) and sets the domain of the highest ranking link corresponds to the domain the web page is attempting to portray (e.g., the victim domain). In one embodiment, a rank of an image may correspond to the placement of the image in the list of results ordered according to reputation. The reputation may be decided by the search image search API and based on the image host's index within the image search API, global popularity as established by the image search API, etc. The domain may be obtained by parsing and analyzing the result properties of each image within the results (e.g., the field labeled “originalContextUrl” within a result of an image from an image search using an Image Search API or a Custom Search API returns the URL of the page containing the image, other fields provided similar information such as the raw URL (non-alphanumeric characters only) or an encoded URL).

When multiple images are detected, the web browsing emulation logic prioritizes each image (e.g., may be by location of the image) and sets the domain of the highest ranked image as the victim domain. A correlation may be performed between the images wherein, in one embodiment, when a plurality of images are associated with one domain, that domain is set as the victim domain.

Blocks 805 and 806 illustrate one option for determining the victim domain. As an alternative method for determining the victim domain, the web page analysis logic of the static analysis logic may determine the victim domain (see blocks 704 and 705 of FIG. 7).

When the target domain and victim domain have been determined, a determination is made as to whether the victim domain is the same as the target domain (block 807). In one embodiment, when the target domain is the same as the victim domain (yes at block 807), the classification logic may determine that the web page is not associated with a phishing, and in particular a spearphishing, attack (block 808). In other embodiments, the results of the static analysis may also be taken into account via the prioritization logic and the score determination logic.

When the target domain and the victim domain are not the same (no at block 807), the expert system of the dynamic analysis logic is invoked to perform additional heuristics on the web page to determine whether the web page is associated with a phishing web page, and thus the email associated with a phishing attack (block 809). Examples of additional heuristics that may aid in the determination of a score, as discussed below, may include but are not limited or restricted to, the presence, or lack thereof, of: a redirection from a secured website (“HTTPS”) to an unsecured website (“HTTP”) or vice versa; POST request via HTTP or HTTPS; Captcha (a type of challenge-response test used in computing to determine whether or not the user is human), etc.

Upon applying additional heuristics to the extracted attributes of the web page, the classification logic receives the results and information related to the static analysis and the dynamic analysis. The prioritization logic may be configured to associate weighting with one or more portions of the analyses and provide the weighting to the score determination logic. The score determination generates a third score, as mentioned above, indicating the likelihood the web page, and thus, the email, is associated with a phishing, or more particularly, a spearphishing attack. The third score may be based on one or more of (i) the virtual processing of the web page, (ii) the analysis of the email (e.g., header and body), (iii) the analysis of the URL detected within the email, and/or (iv) the analysis of the fetched web page.

In some embodiments, the third score may indicate a level of confidence that the email is associated with a spearphishing attack. As discussed above, the third score may be a numerical value; one of a predefined set of categories such as “suspicious,” “malicious,” or “benign”; an electrical signal such as ‘1’ or ‘0’, or the like. In one embodiment, the email may be determined to be associated with a spearphishing attack when the third score meets or exceeds a predefined threshold. Alternatively, the email may be determined to be associated with a spearphishing attack when the classification logic classifies the email as “suspicious” and “malicious,” or, possibly, just when classified as “malicious.”

Referring to FIG. 9, a block diagram illustrating an exemplary email associated with a credential spearphishing attack is shown. The email 900 illustrates an example of a spearphishing email that may be received by the credential spearphishing detection system of FIG. 1. Display area 910 illustrates a display of a portion of the email header. As illustrated, the typical portions of the email header displayed to a recipient are shown, including information detailing the sender of the email, the subject of the email and the date of transmission. The full email header, or alternatively referred to as the raw header, may include numerous attributes including, but not limited or restricted to: return-path, x-spamcatcher-score; received from, by and with; date, message-ID; date; user-agent; x-accept-language; mime-version; to; from; subject; content-type; and/or content-transfer. Icon 920 illustrates an example icon for “Bank” as well as the location an icon may be placed to impersonate an email from a legitimate bank. Text 921 may present to further the impersonation of an email from the legitimate bank.

Display area 930 comprises the body of the email and may include text that impersonates an email from a legitimate bank, and may even copy the text directly from an email from the legitimate bank. Display area 930 may include URL 931 and text 932. URL 931, as discussed above, may redirect to a credential spearphishing web page. Text 932 is highlighted as an example of a typographical error, which may be used by the credential spearphishing detection system to indicate an association with a credential spearphishing attack (e.g., “ . . . as we work together to protecting your account.”).

Referring to FIG. 10, a block diagram illustrating an exemplary web page associated with a credential spearphishing attack is shown. The web page 1000 illustrates an example of a spearphishing web page that may be directed to by a URL included in an email received by the credential spearphishing detection system of FIG. 1. The address bar 1010 illustrates a typical address bar that displays the URL of the web page 1000. At first glance, the URL seems legitimate but “bankwebsitem” is likely an attempt to impersonate “bankwebsite.” The credential spearphishing detection system will analyze the URL during, for example, stage 2 as discussed above.

The icon 1020 may be included on the web page 1000 to aid in the impersonation of the legitimate bank. The icon 1020 may be a copy of the logo used by the legitimate bank. The icon 1020 will be analyzed by the credential spearphishing detection system during, for example, stages 3 and/or 4 as discussed above.

The display area 1030 includes a plurality of input forms for submitting credential information (e.g., online ID and passcode). The presence of input forms may be taken into account when determining the suspiciousness of the web page 1000 during, for example, stage 3 as discussed above. Additionally, the web browsing emulation logic may analyze the POST request generated by submission of content into the input form during the virtual processing of the web page 1000.

In the foregoing description, the invention is described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims.

Detection of credential spearphishing attacks using email analysis (2024)

References