Reads and trains an adaptive logistic regression model on the 20 newsgroups data.
The first command line argument gives the path of the directory holding the training
data. The optional second argument, leakType, defines which classes of features to use.
Importantly, leakType controls whether a synthetic date is injected into the data as
a target leak and if so, how.
The value of leakType % 3 determines whether the target leak is injected according to
the following table:
0 | No leak injected |
1 | Synthetic date injected in MMM-yyyy format. This will be a single token and
is a perfect target leak since each newsgroup is given a different month |
2 | Synthetic date injected in dd-MMM-yyyy HH:mm:ss format. The day varies
and thus there are more leak symbols that need to be learned. Ultimately this is just
as big a leak as case 1. |
Leaktype also determines what other text will be indexed. If leakType is greater
than or equal to 6, then neither headers nor text body will be used for features and the leak is the only
source of data. If leakType is greater than or equal to 3, then subject words will be used as features.
If leakType is less than 3, then both subject and body text will be used as features.
A leakType of 0 gives no leak and all textual features.
See the following table for a summary of commonly used values for leakType
leakType | Leak? | Subject? | Body? |
|
0 | no | yes | yes |
1 | mmm-yyyy | yes | yes |
2 | dd-mmm-yyyy | yes | yes |
|
3 | no | yes | no |
4 | mmm-yyyy | yes | no |
5 | dd-mmm-yyyy | yes | no |
|
6 | no | no | no |
7 | mmm-yyyy | no | no |
8 | dd-mmm-yyyy | no | no |
|