Wednesday, 8 April 2009

Jaikoz Future Roadmap

Okay here are the priorities for the next few months.

1. Fix memory management so memory consumption isn't tied by number of songs loaded by use of database. This should allow you to load your complete collection if you wish AND subsequent restarts of Jaikoz should be able to use the cached data instead of reading from the file if it hasn't changed which will speed the initial file loading greatly.

2. Simplify Matching for better results. Jaikoz has many options to change how the matching works, but they are difficult to understand and some only make sense for individual files. Im going to drop some of these options and replace with ones that make more sense such as
Match the album that best matches existing metadata OR always prefer original albums even if better match with compilation
.

3. Improve results by improving Musicbrainz, Im going to work on the Musicbrainz Search server , fixing issues and improving performance.

4. Simplify Interface, new users find it difficult to understand the seperate tag and analyse tasks and the user interface. Im going to either simplify it further or provide a new Simple Mode, with the existing interface becoming Advanced Mode. Thinking along the lines of how Azereus retrofitted a new default interface when it becamwe Vuze.

5. Implement more automated tests for the GUI, I have many for the reading and writing to files but not for the Interface itself. I need to do this to prevent the occasional regression cropping up as it has in the past.

Jaikoz 2.9.2 available

Fixed a couple of problems including a regression in 2.9.0 wherby Jaikoz was incorrectly matching tracks with no album name with albums in the local cache causing a slow down for those of use with large caches.

Monday, 30 March 2009

Performance testing Musicbrainz Search Index

Musicbrainz have been looking at improving the performance of their search server. When search queries are sent to Musicbrainz they do not access the database, instead they access a Search Index that is built from the database and contains the information to be searched and allows full test search and other features not usually available from a database.

Currently Lucene is used, but is accessed using Python with as pylucene. In an attempt to boost performance a simplified version of the search code was developed using pure Java available from here, but this still didn't seem to be giving the required performance enhancements.

Ive tried some tuning mechanisms and code changes to see if I can make some improvements or at least get some benchmark figures.

The tests were performed on a MacBook Pro 2.66Ghz Core 2 Duo with 4GB of 1067 DDR3 RAM using OSX 10.5, So it is a good lab top but doesn't compare so well with a desktop of server. (I used a Macbook because it was 64bit so I could use 64 bit Java to address more than 2GB of memory whereas my Window PC is not, and I have no native linux desktop).

Summary of results
:
Track OR query:Single Threaded test with insufficient memory, index on hard drive :1.91 query /sec
Track OR query:Single Threaded test with enough memory, index on hard drive :24.66 track query /sec
Track OR query:Best Multi Threaded test :43.96 queries /sec
Track query:Best Multi Threaded test :59.47 queries /sec
Release query:Threaded test :252.75 queries /sec

I created a test set of 10,011 track titles and 10,011 releases from the database, and I rebuilt the indexes with StopWords Filter removed to solve the stop words bug. I then created a test program to fire requests to the jetty (servlet searcher) from multiple threads, but I was getting inconsistent results and with Jetty not being the thing that was being tested I removed this from the equation.

I restarted by creating a test program that loaded pairs of track/release records into a queue, and then creating a number of threads to read the next pair of the queue and then send these directly to a Search class, this would perform the search and find the best hits and then return, the test results are for this setup. I found myself running a few tests, then making some more code changes to see if it made any difference, this is what I found:


Index Directory Location
:
The macbook has a SSD drive, and external hard drive, and of course the index could also be loaded in memory. I expected loading the index into memory would give vastly superior performance but when I tried loading the Track Index (about 2.5 GB) the performance was terrible, I concluded that it must be swapping memory because of the memory required by the OS itself. I did some with the smaller Release Index as well, here the RAM Directory performed as well as the others but not any better. The SSD performed much better than the hard drive when I ran tests with insufficient memory, but when enough memory was allocated there was little difference between the two.

Memory allocated to JVM: With the default (of 64MB) it performed very badly on the hard drive, but when I increased it to 2GB there was a big improvement, but further adjustments (up and down) didn't make much difference. So it seems you need a reasonable but not ridiculous memory for decent results.

Code Improvements
: Posted a few questions on the lucene mailing list and was given some optimizations for how the index is opened
new IndexSearcher(IndexReader.open(new NIODSDirectory(new File indexDir + "track_Index"),null),true)));
and on iterating through the results.
Opening using the index using this new method doubled the number of queries that could be processed by reducing contention on the index searcher, I haven't yet tried the iterating query improvements.

No of Threads:I ran tests using just a single thread at first then increased the number of threads to find the optimum throughput. With the code improvement in place I found you needed at least 30 threads for the best results , but additional threads didn't give further improvements. If the tests were performed on a Quad CPU or better I expect more threads would give more gains.

Query type
:I tried an OR and a simple query against the track index
type=track&query=track:"trackname" OR release:"releasename"
type=track&query=track:"trackname"
Of course the OR query was slower , but not exponentially so - it was about 30% slower.

I then tried querying the release index, because it is much smaller the results were much better, with a 400% improvement in speed.

The full results are available here

and the amended zip of the code can be found here.

Wednesday, 25 March 2009

Unable to find 'is this it' album bug solved - almost

There is a longstanding bug in Musicbrainz that makes it difficult to find songs that contain a number of common stop words such as the,is,that,a . This is because these stop words have been removed from the search index so do not count towards a match. Album such as 'is this it?' by the Strokes have a real problem because it ONLY contains stop words.

I suggested a fix some time ago which didnt get acted on. Ive now implemented the fix successfully on a pure Java development server results at http://www.jthink.net/jaikoz/scratch/isthatitsearch.jpg. I need to reimplement in the existing code base, then hopefully this will prove the fix and get it rolled out.

Jaikoz 2.9 released

The Export feature has been added, and I think this could be very useful for some of you power users. Remember you can use it for:
backing up your metadata
editing of metadata within a spreadsheet
moving metadata from one file to another
sharing your song list with other applications or users

I wait for feedback on what you think of it and what uses you make of it.

Ive also been trawling through the bugs list trying to solve some long standing bugs that may have been forgotten. Im happy with results and aim to have Jaikoz essentially bug free within a couple of releases, of course there is still the ever increasing enhancements list....

Tuesday, 17 February 2009

Export Songs to Spreadsheet


Working on an Export feature which is simple and effective. It will allow you to export the details of your loaded Songs to a comma seperated file. You can then use this file as an archive of your metadata AND you can also open and edit values within a proper spreadsheet application and then import the changes back into Jaikoz.

So its give you tag backup and mega editing capabilities in one go, and you can also use the file created to share your songs list with friends or to create playlists.


There are a few decisions to be made yet though.
1. The export feature only works on the editable fields common to all formats so fields not supported by Jaikoz or only supported in the ID3 tabs view are not exported.
2. Artwork is not exported, its not appropriate to store large binary fields in this sort of file but I know artwork is very important to people so maybe it can be shoehorned into this feature somehow, or should I just have a seperate 'Export Artwork' that could export artwork either on a folder by folder by basis so its kept with the files or all lumped into a single folder.
3. The export only supports single instances of fields, so for example it would only export one genre per song.
4. Because not desirable to load all songs into Jaikoz in one go if you select a file that already exists Jaikoz should append the new entries, but would need to overwrite an entry if it already existed for the same file.
5. The first column of the created file would be the full filename, so that Import can work by matching the filename with a file open in Jaikoz and then update accordingly.
6. You might have two versions of a song , a flac version and a mp3 version and want to import metadata from the mp3 version to the flac file you would just have to edit the filename in the csv file.
7. If when exporting you are are replacing existing songs, should they be replaced in the same place, or afterwards. Would it be better to always sort the file alphabetically.
8. The data is encoded using utf8, this fully supports Unicode so all characters can be encoded and also it is economical with memory - only one byte is used for ascii chars. The only problem is that it is only the default encoding on Linux, so might not be the default choice when the csv file is open with some applications. For example Open Office on Windows Vista assumes that the encoding is windows-1252 , you have to tell it to use utf8.
9. In the future would also like to alow export to an xml format but xml not terribly useful for editing, this would also provide a solution for (3).
10. Could also create native spreadsheet formats such as .xls or .ods which is slightly more user friendly, but I dont think the extra effort involved is worth it at the moment.

Bugs, Enhancements or Testing

Jaikoz 2.8.4 now out with a few enhancements and a host of bugs fixes, some introduced in earlier versions, and there lies the rub. I've been concentrating on bug fixes recently and small enhancements rather than new features but the regressions came about due to not enough testing.

It is very difficult to get the balance right, I have automated tests that cover the reading and writing of metadata but not automated tests for the Jaikoz GUI itself.

So do I spend my time writing more tests, fixing problems or adding new features ?

I think the correct answer is to continue with all three, and ensure I do beta releases of all major releases, but has anyone else got any other views.