Exploratory data analysis in Python (Functional)

Here are some tips for handling log file extracts. Suppose we are looking at some Enterprise Splunk extracts. We can use Splunk to explore the data. Or we can get a simple extract and fiddle with these data in Python.

Running different experiments in Python seems more effective than trying to perform such exploratory operations in Splunk. This is mainly because we can do anything to the data without limitation. We can create very complex statistical models in one place.

Theoretically, we can do a lot of exploration in Splunk. It has various reporting and analysis features.

But...

Using Splunk requires us to assume that we know what we are looking for. In many cases, we don’t know what we are looking for: we are exploring. There might be some clues that some RESTful API processing is slow, but that’s not all. How do we continue?

The first step is to obtain the original data in CSV format. How do we do that?

Reading original data

We will first wrap a CSV.DictReader object with some additional functions.

Pure object-oriented purists would object to this strategy. “Why not extend DictReader?” they ask. I don’t have a good answer. I tend to favor functional programming and the orthogonality of components. For a purely object-oriented approach, we would have to use more complex mixes to achieve this.

The general framework we use to process logs is like this.

with open("somefile.csv") as source:
rdr = csv.DictReader(source)

This allows us to read CSV format Splunk extracts. We can iterate over the rows in the reader. This is trick #1This is not very tricky, but I like it.

with open("somefile.csv") as source:
rdr = csv.DictReader(source)
for row in rdr:
print("{host} {ResponseTime} {source} {Service}".format_map(row))

We can - To some extent - Report the original data in a useful format. If we want to spruce up the output, we can change the format string. It might be “{Host:30s} {Response Time:8s} {Source: s}”or something similar.

Filtering

The common situation is that we have extracted too much, but we only need to look at a subset. We can change the Splunk filter, but, it is讨厌 to overuse filters before we complete our exploration. Filtering is much easier in Python. Once we know what we need, we can complete it in Splunk.

with open("somefile.csv") as source:
rdr = csv.DictReader(source)
rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
for row in rdr_perf_log:
print("{host} {ResponseTime} {Service}".format_map(row))

We have added a generator expression to filter the source lines, which can handle a meaningful subset.

Projection

In some cases, we may add additional source data columns that we do not want to use. Therefore, we will eliminate these data by projecting each row.

In principle, Splunk never produces empty columns. However, RESTful API logs may cause a large number of column headers to be included in the dataset, which are based on a proxy key part of the request URI. These columns will contain a row of data from a request using that proxy key. For other rows, there is no use for this column. Therefore, these empty columns should be removed.

We can also do this with a generator expression, but it will become a bit long. The generator function is easier to read.

def project(reader):
for row in reader:
yield {k:v for k,v in row.items() if v}

We have constructed a new row dictionary from a part of the original reader. We can use it to wrap our filter outputs.

with open("somefile.csv") as source:
rdr = csv.DictReader(source)
rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
for row in project(rdr_perf_log):
print("{host} {ResponseTime} {Service}".format_map(row))

This will reduce the number of unused columns visible within the for statement.

Symbol Change

The symbol row['source'] will become quite cumbersome. Using types.SimpleNamespace is better than using a dictionary. This allows us to use row.source.

This is a cool trick to create something more useful.

rdr_ns= (types.SimpleNamespace(**row) forrowinreader)

We can collapse it into such a sequence of steps.

with open("somefile.csv") as source:
rdr = csv.DictReader(source)
rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
rdr_proj = project(rdr_perf_log)
rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
for row in rdr_ns:
print("{host} {ResponseTime} {Service}".format_map(vars(row)))

Please note the slight modification we made to the format_map() method. From the SimpleNamespace attributes, we added the vars() function to extract the dictionary.

We can write it as a function using other functions to preserve syntactic symmetry.

def ns_reader(reader):
return (types.SimpleNamespace(**for row in reader)

Indeed, we can write it as a lambda structure that can be used like a function

ns_reader = lambda reader: (types.SimpleNamespace(**for row in reader)

Although the usage of ns_reader() function and ns_reader() lambda is the same, it is slightly more difficult to write documentation strings and doctest unit tests for lambda. For this reason, it is advisable to avoid using lambda structures.

We can use map(lambda row: types.SimpleNamespace(** row), reader()). Some people like this generator expression.

We can use an appropriate for statement and an internal yield statement, but writing large statements from a small thing seems to have no benefit.

We have many choices because Python provides so many functional programming features. Although we do not often consider Python as a functional language. But we have multiple ways to handle simple mappings.

Mapping: Transforming and Deriving Data

We often have a very obvious list of data transformations. In addition, we will have a list of derived data items that are increasingly more. Derived items will be dynamic and based on the different assumptions we are testing. Whenever we have an experiment or a problem, we may change the derived data.

Each of these steps: filtering, projecting, transforming, and deriving are map-The stage of the 'map' part of the reduce pipeline. We can create some smaller functions and apply them to map(). Since we are updating a stateful object, we cannot use the general map() function. If we want to implement a purer functional programming style, we will use an immutable namedtuple instead of a mutable SimpleNamespace.

def convert(reader):
for row in reader:
row._time = datetime.datetime.strptime(row.Time, "%Y"-%m-%dT%H:%M:%S.%F%Z")
row.response_time = float(row.ResponseTime)
yield row

In the process of our exploration, we will adjust the main body of this conversion function. Perhaps we will start with some of the smallest conversions and derivatives. We will continue the exploration with some questions like 'Are these correct?' When we find something not working, we will take some out of it.

Our overall processing process is as follows:

with open("somefile.csv") as source:
rdr = csv.DictReader(source)
rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
rdr_proj = project(rdr_perf_log)
rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
rdr_converted = convert(rdr_ns)
for row in rdr_converted:
row.start_time = row._time - datetime.timedelta(seconds=row.response_time)
row.service = some_mapping(row.Service)
print( "{host:30s} {start_time:%H:%M:%S} {response_time:6.3f} {service}".format_map(vars(row)) )

Please note the change in the statement subject. The convert() function produces the values we are sure of. We have added some additional variables in the for loop, and we cannot100% certain. Before updating the convert() function, we will see if they are useful (even correct).

Decrement

In terms of decrement, we can take a slightly different processing approach. We need to refactor our previous example and turn it into a generator function.

def converted_log(some_file):
with open(some_file) as source:
rdr = csv.DictReader(source)
rdr_perf_log = (row for row in rdr if row['source'] == 'perf_log')
rdr_proj = project(rdr_perf_log)
rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
rdr_converted = convert(rdr_ns)
for row in rdr_converted:
row.start_time = row._time - datetime.timedelta(seconds=row.response_time)
row.service = some_mapping(row.Service)
yield row

Then we replaced print() with a yield.

This is another part of refactoring.

for row in converted_log("somefile.csv"):
print( "{host:30s} {start_time:%H:%M:%S} {response_time:6.3f} {service}".format_map(vars(row)) )

Ideally, all our programming is like this. We use generator functions to generate data. The final display of data remains completely separate. This allows us to refactor and change the processing more freely.

Now we can do things like collect the rows into a Counter() object, or possibly calculate some statistics. We can use defaultdict(list) to group the rows by service.

by_service = defaultdict(list)
for row in converted_log("somefile.csv"):
by_service[row.service] = row.response_time
for svc in sorted(by_service):
m = statistics.mean(by_service[svc])
print( "{svc:15s} {m:.2f"}".format_map(vars()) )

We decide to create specific list objects here. We can use itertools to group response times by service. It looks like functional programming, but this implementation points out some limitations in Pythonic functional programming. Either we must sort the data (create list objects) or create lists when grouping data. It is usually easier to group data for several different statistics by creating specific lists.

Now we are doing two things instead of simply printing the line object.

Create some local variables, such as svc and m. We can easily add changes or measures.

Using the vars() function without parameters, it will create a dictionary from the local variables.

The behavior of using vars() without parameters is like locals(), which is a convenient trick. It allows us to simply create any local variables we want and include them in the formatted output. We can intrude on various statistical methods we think may be related.

Since our basic processing loop is for the lines in converted_log(“somefile.csv”), we can explore many processing options through a small, easily modified script. We can explore some assumptions to determine why some RESTful API processing is slow while others are fast.

Summary

The above-mentioned is the exploratory data analysis (functional) in Python introduced by the editor for everyone, hoping it will be helpful to everyone. If you have any questions, please leave a message, and the editor will reply to everyone in time. Thank you very much for everyone's support for the Naihua tutorial!

Declaration: The content of this article is from the Internet, the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously, and this website does not own the copyright. It has not been edited by humans and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#w3Please report via email to codebox.com (replace # with @ when sending an email) and provide relevant evidence. Once verified, this site will immediately delete the content suspected of infringement.

Basic Tutorial