Skip to content

Gram Anchoring Stage #293

@youngtboy

Description

@youngtboy

This is excellent work, but I have some questions that I hope can be answered.

  1. Based on the instructions in the README file, by specifying the corresponding configuration files, the Stage 1 training should be performed first, followed by Stage 2 training. In the Stage 2 training configuration file, the setting is that gram loss will only start being enabled after 1 Million iterations. However, according to the paper, after completing Stage 1 training, gram loss should be applied immediately when starting the next phase of training. Given the current training procedure, Stage 2 re-trains for 1 Million iterations from scratch (before gram loss is activated). Why is this the case?
  2. The paper shows that the dense features at 200K iterations in the early stage of training are better than those at 1M iterations. However, the current practice is to use the checkpoint from the end of Stage 1 (1M iterations) as the gram teacher for Stage 2. This is contradictory to the findings presented in the paper.
  3. If that is the case, can I then assume that Stage 1 (as defined in the configuration file) only serves the purpose of finding a Gram Teacher? And, does the second stage (Stage 2) launched with the current configuration file, actually simultaneously include the Pre-training stage (Stage 1 in the paper) and the Gram Anchor stage (Stage 2 in the paper)?

I look forward to your reply

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions