Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add calculation of total information content and information temperature #5011

Open
notxvilka opened this issue Mar 18, 2025 · 6 comments
Open
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@notxvilka
Copy link
Contributor

notxvilka commented Mar 18, 2025

Extend information properties calculation and visualisation with two more fundamental functions:

  • Total information content (or mutual information)
  • Information temperature

Definitions

Total information content or mutual information could be considered analogous to enthalpy in thermodynamics. For mutual information:
I(X;Y)=∑x∈X∑y∈Yp(x,y)log(p(x,y)/p(x)p(y))

  • p(x,y): Joint probability of X and Y
  • p(x), p(y): Marginal probabilities of X and Y

Note, that mutual information is only possible to calculate relatively to something else, thus not always applicable. Thus, what could be done is to calculate mutual information compared to somethiing else, e.g. to the random uniform sequence or two data blocks against each other.

In information theory, information temperature (Tinfo) is sometimes defined as the derivative of entropy with respect to an "informational energy" or complexity measure:
Tinfo=∂H/∂Einfo

  • H: Entropy
  • Einfo: Informational energy or complexity

Entropy and mutual information relationship

  • In thermodynamics, entropy and enthalpy are connected through temperature and Gibbs free energy, describing the balance between disorder and energy.
  • In information theory, entropy quantifies uncertainty, while mutual information (an enthalpy-like concept) quantifies shared information. The analogy highlights how both fields use these concepts to describe system behavior, though the mathematical relationships differ.

Applications

  • Code similarity
  • Data dependency calculation (e.g. between fields of a structure or file headers)
  • Detecting patterns in obfuscated data and code

See example (generated by an AI):

import numpy as np
from collections import Counter

def calculate_entropy(sequence):
    """Calculate Shannon entropy of a byte sequence."""
    length = len(sequence)
    counts = Counter(sequence)
    probabilities = np.array([count / length for count in counts.values()])
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy

def calculate_mutual_information(seq1, seq2):
    """Calculate mutual information between two sequences."""
    joint_counts = Counter(zip(seq1, seq2))
    joint_probabilities = np.array([count / len(seq1) for count in joint_counts.values()])
    seq1_counts = Counter(seq1)
    seq2_counts = Counter(seq2)
    seq1_probabilities = np.array([count / len(seq1) for count in seq1_counts.values()])
    seq2_probabilities = np.array([count / len(seq2) for count in seq2_counts.values()])

    # Calculate joint entropy
    joint_entropy = -np.sum(joint_probabilities * np.log2(joint_probabilities))

    # Calculate individual entropies
    entropy_seq1 = -np.sum(seq1_probabilities * np.log2(seq1_probabilities))
    entropy_seq2 = -np.sum(seq2_probabilities * np.log2(seq2_probabilities))

    # Mutual information
    mutual_information = entropy_seq1 + entropy_seq2 - joint_entropy
    return mutual_information

def calculate_information_temperature(entropy, sequence_length):
    """Approximate information temperature as the derivative of entropy with respect to length."""
    # Assuming entropy scales logarithmically with sequence length
    temperature = entropy / np.log2(sequence_length)
    return temperature

# Generate a random sequence of 1024 bytes
np.random.seed(42)  # For reproducibility
sequence = np.random.randint(0, 256, 1024, dtype=np.uint8)

# Calculate entropy
entropy = calculate_entropy(sequence)
print(f"Entropy: {entropy:.4f} bits")

# Split sequence into two halves and calculate mutual information
half_length = len(sequence) // 2
seq1, seq2 = sequence[:half_length], sequence[half_length:]
mutual_information = calculate_mutual_information(seq1, seq2)
print(f"Mutual Information (Enthalpy Analogy): {mutual_information:.4f} bits")

# Calculate information temperature
information_temperature = calculate_information_temperature(entropy, len(sequence))
print(f"Information Temperature: {information_temperature:.4f}")

And an output example:

Entropy: 7.8007 bits
Mutual Information (Enthalpy Analogy): 6.1521 bits
Information Temperature: 0.7801

See these sources for more context:

@notxvilka notxvilka added enhancement New feature or request good first issue Good for newcomers labels Mar 18, 2025
@yadunand-kamath
Copy link

Hey! I would like to try and work on this.

@notxvilka
Copy link
Contributor Author

Hey! I would like to try and work on this.

Go ahead.

@yadunand-kamath
Copy link

I am confused as to in which directory to add the "information_theory.py" file and in which test directory to add the "test_information_theory.py" file. Please advise.

@notxvilka
Copy link
Contributor Author

I am confused as to in which directory to add the "information_theory.py" file and in which test directory to add the "test_information_theory.py" file. Please advise.

  1. Rizin is written in pure C
  2. See https://github.com/rizinorg/rizin/blob/dev/DEVELOPERS.md
  3. Start from librz/core/cmd/cmd_hash.c and librz/core/cmd_descs/cmd_hash.yaml - see how entropy commands are implemented.

@varda0
Copy link

varda0 commented Mar 25, 2025

Hello!
I’m working on the issue to extend information properties in Rizin.
After searching the codebase, I found that entropy is already handled in:

  • cmd_print.c (handle_entropy())
  • cmd_search.c (entropy-based searches)
  • bfile.c (file entropy analysis)

I plan to:

  1. Add a mutual information function in C.
  2. Modify handle_entropy() to compute mutual information & information temperature.
  3. Integrate these into the entropy commands.

Does this approach align with Rizin’s design? Any suggestions before I start?

@Rot127
Copy link
Member

Rot127 commented Mar 25, 2025

Nah, a better approach is to:

  • Add a new command in librz/core/cmd_descs/cmd_hash.yaml
  • Implement a handler in librz/core/cmd/cmd_hash.c.
  • You can do the entropy calculations via the RzHash API (see rz_hash.h). You initialize a new RzHash instance, then initialize a config for entropy (with rz_hash_cfg_new_with_algo2()) and do rz_hash_cfg_update() and rz_hash_cfg_final() on it.

This is roughly how you can calculate the entropy on data.
Not sure in which file the analsysis should be in. But we can move it ones you opened the first draft PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants