public class FileDataModel extends AbstractDataModel
A DataModel
backed by a delimited file. This class expects a file where each line
contains a user ID, followed by item ID, followed by optional preference value, followed by
optional timestamp. Commas or tabs delimit fields:
userID,itemID[,preference[,timestamp]]
Preference value is optional to accommodate applications that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference).
The preference value is assumed to be parseable as a double
. The user IDs and item IDs are
read parsed as long
s. The timestamp, if present, is assumed to be parseable as a
long
, though this can be overridden via readTimestampFromString(String)
.
The preference value may be empty, to indicate "no preference value", but cannot be empty. That is,
this is legal:
123,456,,129050099059
But this isn't:
123,456,129050099059
It is also acceptable for the lines to contain additional fields. Fields beyond the third will be ignored. An empty line, or one that begins with '#' will be ignored as a comment.
This class will reload data from the data file when refresh(Collection)
is called, unless the file
has been reloaded very recently already.
This class will also look for update "delta" files in the same directory, with file names that start the
same way (up to the first period). These files have the same format, and provide updated data that
supersedes what is in the main data file. This is a mechanism that allows an application to push updates to
FileDataModel
without re-copying the entire data file.
One small format difference exists. Update files must also be able to express deletes. This is done by ending with a blank preference value, as in "123,456,".
Note that it's all-or-nothing -- all of the items in the file must express no preference, or the all must. These cannot be mixed. Put another way there will always be the same number of delimiters on every line of the file!
This class is not intended for use with very large amounts of data (over, say, tens of millions of rows).
For that, a JDBC-backed DataModel
and a database are more appropriate.
It is possible and likely useful to subclass this class and customize its behavior to accommodate
application-specific needs and input formats. See processLine(String, FastByIDMap, FastByIDMap, boolean)
and
processLineWithoutID(String, FastByIDMap, FastByIDMap)
Modifier and Type | Field and Description |
---|---|
static long |
DEFAULT_MIN_RELOAD_INTERVAL_MS |
Constructor and Description |
---|
FileDataModel(File dataFile) |
FileDataModel(File dataFile,
boolean transpose,
long minReloadIntervalMS) |
FileDataModel(File dataFile,
boolean transpose,
long minReloadIntervalMS,
String delimiterRegex) |
FileDataModel(File dataFile,
String delimiterRegex) |
Modifier and Type | Method and Description |
---|---|
protected DataModel |
buildModel() |
static char |
determineDelimiter(String line) |
File |
getDataFile() |
LongPrimitiveIterator |
getItemIDs() |
FastIDSet |
getItemIDsFromUser(long userID) |
float |
getMaxPreference() |
float |
getMinPreference() |
int |
getNumItems() |
int |
getNumUsers() |
int |
getNumUsersWithPreferenceFor(long itemID) |
int |
getNumUsersWithPreferenceFor(long itemID1,
long itemID2) |
PreferenceArray |
getPreferencesForItem(long itemID) |
PreferenceArray |
getPreferencesFromUser(long userID) |
Long |
getPreferenceTime(long userID,
long itemID)
Retrieves the time at which a preference value from a user and item was set, if known.
|
Float |
getPreferenceValue(long userID,
long itemID)
Retrieves the preference value for a single user and item.
|
LongPrimitiveIterator |
getUserIDs() |
boolean |
hasPreferenceValues() |
protected void |
processFile(FileLineIterator dataOrUpdateFileIterator,
FastByIDMap<?> data,
FastByIDMap<FastByIDMap<Long>> timestamps,
boolean fromPriorData) |
protected void |
processFileWithoutID(FileLineIterator dataOrUpdateFileIterator,
FastByIDMap<FastIDSet> data,
FastByIDMap<FastByIDMap<Long>> timestamps) |
protected void |
processLine(String line,
FastByIDMap<?> data,
FastByIDMap<FastByIDMap<Long>> timestamps,
boolean fromPriorData)
Reads one line from the input file and adds the data to a
FastByIDMap data structure which maps user IDs
to preferences. |
protected void |
processLineWithoutID(String line,
FastByIDMap<FastIDSet> data,
FastByIDMap<FastByIDMap<Long>> timestamps) |
protected long |
readItemIDFromString(String value)
Subclasses may wish to override this if ID values in the file are not numeric.
|
protected long |
readTimestampFromString(String value)
Subclasses may wish to override this to change how time values in the input file are parsed.
|
protected long |
readUserIDFromString(String value)
Subclasses may wish to override this if ID values in the file are not numeric.
|
void |
refresh(Collection<Refreshable> alreadyRefreshed)
Triggers "refresh" -- whatever that means -- of the implementation.
|
protected void |
reload() |
void |
removePreference(long userID,
long itemID)
See the warning at
setPreference(long, long, float) . |
void |
setPreference(long userID,
long itemID,
float value)
Note that this method only updates the in-memory preference data that this
FileDataModel
maintains; it does not modify any data on disk. |
String |
toString() |
setMaxPreference, setMinPreference
public static final long DEFAULT_MIN_RELOAD_INTERVAL_MS
public FileDataModel(File dataFile) throws IOException
dataFile
- file containing preferences data. If file is compressed (and name ends in .gz or .zip
accordingly) it will be decompressed as it is read)FileNotFoundException
- if dataFile does not existIOException
- if file can't be readpublic FileDataModel(File dataFile, String delimiterRegex) throws IOException
delimiterRegex
- If your data file don't use '\t' or ',' as delimiter, you can specify
a custom regex pattern.IOException
public FileDataModel(File dataFile, boolean transpose, long minReloadIntervalMS) throws IOException
transpose
- transposes user IDs and item IDs -- convenient for 'flipping' the data model this wayminReloadIntervalMS
- the minimum interval in milliseconds after which a full reload of the original datafile is done
when refresh() is calledIOException
FileDataModel(File)
public FileDataModel(File dataFile, boolean transpose, long minReloadIntervalMS, String delimiterRegex) throws IOException
delimiterRegex
- If your data file don't use '\t' or ',' as delimiters, you can specify
user own using regex pattern.IOException
public File getDataFile()
protected void reload()
protected DataModel buildModel() throws IOException
IOException
public static char determineDelimiter(String line)
protected void processFile(FileLineIterator dataOrUpdateFileIterator, FastByIDMap<?> data, FastByIDMap<FastByIDMap<Long>> timestamps, boolean fromPriorData)
protected void processLine(String line, FastByIDMap<?> data, FastByIDMap<FastByIDMap<Long>> timestamps, boolean fromPriorData)
Reads one line from the input file and adds the data to a FastByIDMap
data structure which maps user IDs
to preferences. This assumes that each line of the input file corresponds to one preference. After
reading a line and determining which user and item the preference pertains to, the method should look to
see if the data contains a mapping for the user ID already, and if not, add an empty data structure of preferences
as appropriate to the data.
Note that if the line is empty or begins with '#' it will be ignored as a comment.
line
- line from input data filedata
- all data read so far, as a mapping from user IDs to preferencesfromPriorData
- an implementation detail -- if true, data will map IDs to
PreferenceArray
since the framework is attempting to read and update raw
data that is already in memory. Otherwise it maps to Collection
s of
Preference
s, since it's reading fresh data. Subclasses must be prepared
to handle this wrinkle.protected void processFileWithoutID(FileLineIterator dataOrUpdateFileIterator, FastByIDMap<FastIDSet> data, FastByIDMap<FastByIDMap<Long>> timestamps)
protected void processLineWithoutID(String line, FastByIDMap<FastIDSet> data, FastByIDMap<FastByIDMap<Long>> timestamps)
protected long readUserIDFromString(String value)
IDMigrator
to perform
translation.protected long readItemIDFromString(String value)
IDMigrator
to perform
translation.protected long readTimestampFromString(String value)
public LongPrimitiveIterator getUserIDs() throws TasteException
TasteException
- if an error occurs while accessing the datapublic PreferenceArray getPreferencesFromUser(long userID) throws TasteException
userID
- ID of user to get prefs forNoSuchUserException
- if the user does not existTasteException
- if an error occurs while accessing the datapublic FastIDSet getItemIDsFromUser(long userID) throws TasteException
userID
- ID of user to get prefs forNoSuchUserException
- if the user does not existTasteException
- if an error occurs while accessing the datapublic LongPrimitiveIterator getItemIDs() throws TasteException
LongPrimitiveIterator
of all item IDs in the model, in orderTasteException
- if an error occurs while accessing the datapublic PreferenceArray getPreferencesForItem(long itemID) throws TasteException
itemID
- item IDPreference
s expressed for that item, ordered by user ID, as an arrayNoSuchItemException
- if the item does not existTasteException
- if an error occurs while accessing the datapublic Float getPreferenceValue(long userID, long itemID) throws TasteException
DataModel
userID
- user ID to get pref value fromitemID
- item ID to get pref value forNoSuchUserException
- if the user does not existTasteException
- if an error occurs while accessing the datapublic Long getPreferenceTime(long userID, long itemID) throws TasteException
DataModel
userID
- user ID for preference in questionitemID
- item ID for preference in questionNoSuchUserException
- if the user does not existTasteException
- if an error occurs while accessing the datapublic int getNumItems() throws TasteException
TasteException
- if an error occurs while accessing the datapublic int getNumUsers() throws TasteException
TasteException
- if an error occurs while accessing the datapublic int getNumUsersWithPreferenceFor(long itemID) throws TasteException
itemID
- item ID to check forTasteException
- if an error occurs while accessing the datapublic int getNumUsersWithPreferenceFor(long itemID1, long itemID2) throws TasteException
itemID1
- first item ID to check foritemID2
- second item ID to check forTasteException
- if an error occurs while accessing the datapublic void setPreference(long userID, long itemID, float value) throws TasteException
FileDataModel
maintains; it does not modify any data on disk. Therefore any updates from this method are only
temporary, and lost when data is reloaded from a file. This method should also be considered relatively
slow.userID
- user to set preference foritemID
- item to set preference forvalue
- preference valueNoSuchItemException
- if the item does not existNoSuchUserException
- if the user does not existTasteException
- if an error occurs while accessing the datapublic void removePreference(long userID, long itemID) throws TasteException
setPreference(long, long, float)
.userID
- user from which to remove preferenceitemID
- item to remove preference forNoSuchItemException
- if the item does not existNoSuchUserException
- if the user does not existTasteException
- if an error occurs while accessing the datapublic void refresh(Collection<Refreshable> alreadyRefreshed)
Refreshable
Triggers "refresh" -- whatever that means -- of the implementation. The general contract is that any
Refreshable
should always leave itself in a consistent, operational state, and that the refresh
atomically updates internal state from old to new.
alreadyRefreshed
- Refreshable
s that are known to have already been
refreshed as a result of an initial call to a {#refresh(Collection)} method on some
object. This ensure that objects in a refresh dependency graph aren't refreshed twice
needlessly.public boolean hasPreferenceValues()
public float getMaxPreference()
getMaxPreference
in interface DataModel
getMaxPreference
in class AbstractDataModel
Recommender
may estimate a preference value above 5.0, it
isn't "fair" to consider that the system is actually suggesting an impossible rating of, say, 5.4 stars.
In practice the application would cap this estimate to 5.0. Since evaluators evaluate
the difference between estimated and actual value, this at least prevents this effect from unfairly
penalizing a Recommender
public float getMinPreference()
getMinPreference
in interface DataModel
getMinPreference
in class AbstractDataModel
DataModel.getMaxPreference()
Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.