KNOWLEDGE BASE

DATA INGESTION

EXTRACTORS

Custom Extractors

DNIF allows users to add custom event properties that are not enabled or parsed by default. Ask your administrator to review the custom event property that you want to create to ensure that it does not exist.

Method I

To create a custom extractor, click the plus icon on the extractor list page. A side panel will be displayed as follows.

Enter a name for the extractor you are about to create and you can directly start writing in the yaml editor.

image 1-Dec-04-2023-11-40-34-5835-AM

Click Submit after writing the parser and it will be listed in the extractor list.

By default, the existing native extractor will be disabled and custom will be added to the name of extractor.

The extractor will be listed on the extractor list page.

Method II

In this method you can create a custom extractor by cloning the existing native extractor.

To create a custom extractor by cloning, click the name of an existing native extractor, the yml page will be displayed.
Make the required changes and click Submit at the end of the page, the following screen will be displayed.
Click Save to create a custom extractor.

How to write an extractor using a yaml file?

Extractors are built-in .yaml files, the value for ExtractorID, SourceName, and SourceType is populated along with the assigned key value.

Basic Information

Each extractor will have the following basic information

Field	Description
schema-version	The version assigned to the extractor.
extractor-id	The unique ID assigned to the extractor
source-name	The name assigned to the extractor as per the device. Example: Fortigate, Checkpoint, etc
source-type	The type of device.Example : Firewall, OS, switch etc.
source-description	A short description regarding the extractor

Stream

Stream is a domain-specific collection of data from different sources that contributes to a unique dataset and a unique set of use-cases. Each value in the Stream field within the extractor can be used to generate a search that returns a particular dataset with information.
image 4-Dec-04-2023-11-41-57-4253-AM

This is the section where we could define the streams that are included in the extractor. There are various streams such as: AUTHENTICATION, SYSMON-PROCESS, SYSMON-NETWORK, IAM etc
Example: Authentication refers to the login and logout activity events, IAM refers to the User Management events such as create user, delete user.

Master Filters

The Master Filters and First Matches helps to identify the extractor to be applied to a given log source, it has been heavily optimized for performance.

Event Details:

The following configuration should be done under Event details

First Match
First Matches will help us identify different patterns associated with a log source.
- Each first match will be associated with a decoder.
- First matches can now yield multiple events if used with decoder=json or custom-kv.

The decoder=regex is the legacy event detail approach that will be relevant for older devices.

Decoder
Decoder section defines the type of decoder to be used on the basis of the log format. Decoders are defined at the First match level, therefore, we could use multiple decoders in an extractor file.
There are 3 decoders available
- JSON: It is written as ‘decoder: json’ in the extractor files. The log samples which are in JSON format could be parsed using this decoder. It parses all the key-values correctly that are rendered in the log sample.
  Note: Regex is not required to parse key values.
- Custom(key-value): It is written as ‘decoder: custom’ in the extractor files. The log samples which are in key-value format could be parsed using this decoder. Here, we have to write a generic regex that captures the key and value from the log samples appropriately.
  Example: Refer to the below snapshot:
- As per the snapshot, it is seen that a generic regex is written to capture the key value in the log sample. This regex will result in groups of keys and values, displayed in the image below. Further, the Key could be annotated as per the field Annotations in the extractor.
- Regex: It is written as ‘decoder: regex’ in the yml files. The log samples that are in Syslog (only values) format could be parsed using this decoder. Here, we have to write the regex and define the field name in it. This field name could be mapped and annotated in the extractor accordingly.
  Example: Refer to the below snapshot
  
  In the snapshot, it can be seen that field names are defined in the regex. This could be achieved by writing (?P<field_name>) at the start of the group.
Event Key Format
In the event-key-format section we have to define the field on the basis of which we could achieve an accurate present in the log event.
For example: Refer to the snapshot below:
In the snapshot above one can see that First Match is defined on the basis of SourceName and further it is segregated on the basis of EventID in the ‘event-key-format’ section.
Event Key Mapping:
In the ‘event-key-mapping’ section the events could be defined with appropriate Streams. Although while specifying an event in this section one needs to ensure each event identifies itself with a Stream.

In the snapshot, EventID is defined as the pointer that provides us maximum information about the log event. Refer to the below table to understand regarding annotate and translate fields.

Field	Description
annotate	Static key value for Stream, Action and status to be added as per the log event’s information. In the above snapshot, relevant Stream, Action and Status is defined as per the log event’s information. Stream: Type of log Action: Action Performed in the log event. Eg: Login, Logout Status: Status of the action performed in the log event. Eg: Passed, Failed
translate	All the relevant fields as per the stream should be defined under the translate section. Allows you to replace the fields as per DNIF terminology.

Fallback:
Fallback is a mandatory field. All the events that are defined with Stream will be parsed accurately, while the undefined events for that particular First Match will parse under the fallback section.
Example: Let us consider we have created the First Match on the basis of SourceName for Windows Extractor and the further division is on the basis of EventID. In this we have defined some EventIDs with proper stream while some of the EventIDs could not be defined, this undefined EventID will then parse under the fallback section.

Refer the snapshot for fallback events field definition:
Globals

Globals is a non-mandatory field. In this section we could define the generic fields that are present throughout the Extractor.

Refer the snapshot for globals definition:
Substitutions

In most of the devices, there are substitutions provided for some fields. This substitution can be defined under the globals section as follows:

To make this work for multiple samples (First Matches) in the Extractor, following procedures can be followed.

For example:
image 15-2

image 16-2

We have two first matches here.

Referring to the first occurrence of First Match, we have subs defined for it. The only addition being (&id001) present against subs. The character ‘&’ denotes assigning value of subs to the variable id001. Once we assign this to a variable, we can reuse it wherever we are using the same values, as in case of subs. (& sign is placed before variable name as assignment)

Referring to the second occurrence of First Match, we have subs but now we are only using (id001). So we are basically reusing the subs defined before. The symbol ‘’ is used with id001 to refer (&id001).

For all other occurrences of subs further, we can simply refer to the first occurrence of subs. We just need to make sure that assigning value to a variable has to be done at first occurrence of subs and then used later with its reference. And basic variable naming should be considered (alphabets/alphanumeric would be preferred.)

Pitfalls to avoid in a new way of building parsers

The procedure for creating an extractor has been mentioned in detail. If any of the steps are not followed correctly, it would result in bad extractor performance on the setup, this could also affect the EPS hits.