Full data export

As promised a while ago, there are now full data exports available of the artist, label, and release data.

http://www.discogs.com/data/

(You must include the slash after “data”).
There are a few exports already available there, although the frequency was not regular while I was configuring this. New exports will now be available on the first of every month.

enjoy!

Return to Discogs Blog
33 Comments
  • Jul 14,2009 at 16:38

    [quote=makbo]July 2009 data dumps are shown as links, but several attempts to download them today yielded interrupted downloads before even 15% was received, and that’s the two smaller files.[/quote]

    FWIW, at approximately 16:35 PDT (UTC -0700) today (July 14) the downloads of the two smaller files are working well and at a speed that seems normal from here.

  • Jul 14,2009 at 13:38

    July 2009 data dumps are shown as links, but several attempts to download them today yielded interrupted downloads before even 15% was received, and that’s the two smaller files. I’m pretty sure the problem is not on my end, plus if the servers are in the Northwest U.S. (Integra Telecom?) then distance is not an issue either as I’m in San Francisco.

    I realize there have been some server problems lately, but on the other hand the social contract with the “vast, global community of volunteer contributors” is pretty important too. So I have some questions/suggestions:

    1) is there a preferred time of day when downloads won’t be competing with interactive traffic so much

    2) please break the downloads into smaller pieces.

    3) consider mirroring the downloads on some of the available file sharing sites out there.

    4) while I can understand the decision to use XML, could you also consider providing a “native” database export (e.g. Oracle, MySql). It would be a smaller file and much easier to use, especially since both of the products I mentioned are generally available for free for non-commercial use. What does Discogs use for a database, anyway?

    5) also consider providing an abridged version of the data, without all the credits and notes?. Just artist, title, label/cat#, format, year, genre, and flat tracklisting? This could be implemented pretty easily as one or two CSV files, even.

  • mjb
    Jun 11,2009 at 19:55

    In addition to the UTF-8 -> ISO-8859-1 character references bug mentioned above, the April dump contained some extra bytes and corrupted bytes. We were discussing this in the thread [t=189304] in the API forum.

  • Jun 11,2009 at 01:23

    Any word on when the next dump will land? There wasn’t one on May and June’s is overdue…

  • May 24,2009 at 03:37

    i put in a request in the discogs development forum for the main ‘extraartists’ credits element to be included in the next data dump, link below:
    [url=http://www.discogs.com/help/forums/topic/185355]Please Include ‘extraartists’ element in next releases.xml.gz[/url]
    but i guess this is the right place to inform of it’s absence and ask about including it:
    here’s a copy of the request:
    [quote=Hyper-Trax]hi, This might just be an oversight but the main ‘extraartists’ credits element appears to be missing from the
    ‘discogs_20090401_releases.xml.gz’ dump file.
    (not the extraartists element which is inside ‘tracklist’ tree, which is present where data exists)
    I haven’t got previous dumps so i can’t say if it has been there before or if it was always absent.
    This element is included when downloading release data from ‘export your data’ as xml.zipped.

    Can it be included in next months dump?

    Many thanks regardless of the outcome,
    these dumps are a great source of ‘offline’ info and are brilliant even as is…[/quote]

  • May 13,2009 at 05:48
  • May 13,2009 at 04:30

    Is a new release of the data due?

  • Apr 30,2009 at 13:21

    [quote=kumar303]In case I wasn’t clear, this is precisely the problem with the XML file. Your list above mentions the possible ways entities can be encoded. Discog’s method of encoding is *not* in your list of possibilities. In discogs XML the character reference for ä is & #xc3; & #xa4; [without spaces] which is wrong. It should be a Unicode code point reference: & #xe4;[/quote]
    Sorry about the misunderstanding. I fully agree with you that this means that the XML file is not encoded correctly.

  • Apr 26,2009 at 09:40

    [quote=Ronaldvd]When character references are used, the numbers always refer to Unicode code points. It does not depend on how the file is encoded.

    So the ä can be represented in a number of ways.

    * xe4 : 1-byte sequence when the XML file is encoded as ISO 8859-1.
    * xc3xa4 : 2-byte sequence when the XML file is encoded as UTF-8.
    * & #xe4; : 6-character sequence [without the space between the & and #] in any encoding[/quote]

    In case I wasn’t clear, this is precisely the problem with the XML file. Your list above mentions the possible ways entities can be encoded. Discog’s method of encoding is *not* in your list of possibilities. In discogs XML the character reference for ä is & #xc3; & #xa4; [without spaces] which is wrong. It should be a Unicode code point reference: & #xe4;

    My parse now handles this case but it took a lot of extra work since there is nothing standard about encoding XML in this fashion so nothing out of the box really supports it (except Firefox which is designed to handle broken xml).

    In addition to speeding up my parser (it takes about 14 hours to import the data on a fast machine, partly due to swapping out incorrect character refs) it will lower the barrier of entry for other people trying to import the data. I’m just throwing this out there so that future XML dumps can corrected.

  • Apr 26,2009 at 09:40

    [quote=Ronaldvd]When character references are used, the numbers always refer to Unicode code points. It does not depend on how the file is encoded.

    So the ä can be represented in a number of ways.

    * xe4 : 1-byte sequence when the XML file is encoded as ISO 8859-1.
    * xc3xa4 : 2-byte sequence when the XML file is encoded as UTF-8.
    * & #xe4; : 6-character sequence [without the space between the & and #] in any encoding[/quote]

    In case I wasn’t clear, this is precisely the problem with the XML file. Your list above mentions the possible ways entities can be encoded. Discog’s method of encoding is *not* in your list of possibilities. In discogs XML the character reference for ä is & #xc3; & #xa4; [without spaces] which is wrong. It should be a Unicode code point reference: & #xe4;

    My parse now handles this case but it took a lot of extra work since there is nothing standard about encoding XML in this fashion so nothing out of the box really supports it (except Firefox which is designed to handle broken xml).

    In addition to speeding up my parser (it takes about 14 hours to import the data on a fast machine, partly due to swapping out incorrect character refs) it will lower the barrier of entry for other people trying to import the data. I’m just throwing this out there so that future XML dumps can corrected.

  • Apr 21,2009 at 01:57

    I had similar problems. See this page and the table about half way down for some useful info:

    http://www1.tip.nl/~t876506/utf8tbl.html

    The ‘a diaresis’ (ä) character was getting garbled when we parsed it in to xml, no matter how we did it. It comes out as ä. As far as I can work out, this is because the character is HTML encoded – i.e. the UTF8-encoded bytes appear as decimal numbers. This means that ä is #195 followed by #164, not a 2-byte UTF8 sequence (xc3 xa4) or a single Unicode decimal number (#228) as it should be.

    We also had to write a wrapper to unscramble it.

  • Apr 20,2009 at 12:17

    [quote=kumar303]As mentioned in the spec http://www.w3.org/TR/REC-xml/#NT-CharRef encoded character references should be Unicode code points not byte representations of Unicode (e.g. UTF-8).[/quote]
    Kumar, I think you’re mixing up the encoding of the file itself with character references inside the file.
    [quote=kumar303]it is valid UTF-8, but that’s not valid XML[/quote]
    An XML file may be encoded as UTF-8, UTF-16, etc. According to the specification you linked to:
    [quote]All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.[/quote]
    When character references are used, the numbers always refer to Unicode code points. It does not depend on how the file is encoded.

    So the ä can be represented in a number of ways.

    * xe4 : 1-byte sequence when the XML file is encoded as ISO 8859-1.
    * xc3xa4 : 2-byte sequence when the XML file is encoded as UTF-8.
    * & #xe4; : 6-character sequence [without the space between the & and #] in any encoding

  • Apr 20,2009 at 12:17

    [quote=kumar303]As mentioned in the spec http://www.w3.org/TR/REC-xml/#NT-CharRef encoded character references should be Unicode code points not byte representations of Unicode (e.g. UTF-8).[/quote]
    Kumar, I think you’re mixing up the encoding of the file itself with character references inside the file.
    [quote=kumar303]it is valid UTF-8, but that’s not valid XML[/quote]
    An XML file may be encoded as UTF-8, UTF-16, etc. According to the specification you linked to:
    [quote]All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.[/quote]
    When character references are used, the numbers always refer to Unicode code points. It does not depend on how the file is encoded.

    So the ä can be represented in a number of ways.

    * xe4 : 1-byte sequence when the XML file is encoded as ISO 8859-1.
    * xc3xa4 : 2-byte sequence when the XML file is encoded as UTF-8.
    * & #xe4; : 6-character sequence [without the space between the & and #] in any encoding

  • Apr 17,2009 at 14:16

    [Hm, somehow the forum ate my last post.]

    Ronaldvd, you are right, it is valid UTF-8, but that’s not valid XML ;) As mentioned in the spec http://www.w3.org/TR/REC-xml/#NT-CharRef encoded character references should be Unicode code points not byte representations of Unicode (e.g. UTF-8).

    I have a workaround, wrapping the file object stream, decoding and rencoding, then feeding the XML parser, but it’s very hackish and makes XML processing slow (these files are big).

    Problem: When the discogs XML was generated, character references were encoded using a UTF-8 byte sequence when instead they should have been encoded using Unicode code points.
    Could you guys regenerate the XML using Unicode?

    Here is an illustration of the problem using Python :
    http://pastebin.com/m4a2415ca

    (in case the snippet is lost, the code shows using a Unicode string instead of a utf-8 string to create & # xe4; entity – a with an umlaut)

    If anyone wants the code needed to work around this I can post.

  • Apr 16,2009 at 05:47

    [quote=kumar303]The byte sequence in the XML is Jesper Dahlbxc3xa4ck and that shows up in this forum correctly (!) but does not seem to be the right utf-8 encoding.[/quote]
    xc3xa4 is the correct UTF-8 sequence for ä.

    [quote=kumar303]http://www.tony-franks.co.uk/UTF-8.htm [/quote]
    This website not about UTF-8 encoding (although it says so), but about using Unicode characters in HTML.

  • Apr 16,2009 at 05:47

    [quote=kumar303]The byte sequence in the XML is Jesper Dahlbxc3xa4ck and that shows up in this forum correctly (!) but does not seem to be the right utf-8 encoding.[/quote]
    xc3xa4 is the correct UTF-8 sequence for ä.

    [quote=kumar303]http://www.tony-franks.co.uk/UTF-8.htm [/quote]
    This website not about UTF-8 encoding (although it says so), but about using Unicode characters in HTML.

  • Apr 15,2009 at 18:00

    Hi. How is the XML encoded? It does not appear to be UTF-8 (or is it?). I’ve very confused by, for example, Jesper Dahlbäck of The Persuaders [1].

    The byte sequence in the XML is Jesper Dahlbxc3xa4ck and that shows up in this forum correctly (!) but does not seem to be the right utf-8 encoding. By my calculation it should be Jesper Dahlbxe4ck, in other words hex e4 / decimal 228 / a with umlaut (http://www.tony-franks.co.uk/UTF-8.htm).

    [1] http://www.discogs.com/artist/Persuader%2C+The

  • Apr 15,2009 at 18:00

    Hi. How is the XML encoded? It does not appear to be UTF-8 (or is it?). I’ve very confused by, for example, Jesper Dahlbäck of The Persuaders [1].

    The byte sequence in the XML is Jesper Dahlbxc3xa4ck and that shows up in this forum correctly (!) but does not seem to be the right utf-8 encoding. By my calculation it should be Jesper Dahlbxe4ck, in other words hex e4 / decimal 228 / a with umlaut (http://www.tony-franks.co.uk/UTF-8.htm).

    [1] http://www.discogs.com/artist/Persuader%2C+The

  • Mar 9,2009 at 14:24

    just a suggestion…

    see when you make the data available each month, would there also be any chance of posting just the differential between the current month and the last? that means you could keep up with it without having to re-download 400mb of data every month.

  • Mar 8,2009 at 04:32

    i’m building my own music database that replicates a lot of discogs functionality, so i’m gonna put this data in my own database.

    i already have code that parses the data provided by the API, so this should be easy!

    amazing that this has been made available – many, many thanks!

  • Feb 19,2009 at 14:53

    I’ve used it successfully. I wrote a php script to parse the data for information on the releases I wanted (91-93 hardcore & breakbeat). It took some processing time I have to say! With the extracted data I created my own SQL tables so I could query the data at speed. I ended up with circa 9500 releases that interested me. With this I was able to create my most wanted & most owned lists using another script that parsed actual discogs pages using a socket connection. See http://www.iamdek.co.uk/2009/02/06/the-100-most-owned-uk-old-skool-hardcore-breakbeat-records/ for the results.

  • Feb 12,2009 at 18:32

    [quote=jenko3000]Has anyone managed to successfully access and/or use this data yet?

    As far as I can tell, it’s an xml with no root and contains illegal characters.[/quote]
    …Yeah interested to know how people use and abuse these data dumps… ?

  • Feb 11,2009 at 09:44

    (not to say I’m not grateful for the data, by the way – big thanks to discogs for enabling these downloads!)

  • Feb 11,2009 at 09:43

    Has anyone managed to successfully access and/or use this data yet?

    As far as I can tell, it’s an xml with no root and contains illegal characters.

  • teo
    Jan 3,2009 at 13:23

    mjb, sorry about that. The Jan 1 2009 export is there now.

    rassel, those smaller ones accidentally had tracklistings omitted.

  • Dec 24,2008 at 12:44

    Hmmmm
    discogs_20080309_releases.xml.gz 2008-Apr-01 10:41:17 [b]114.6M[/b]
    discogs_20080918_releases.xml.gz 2008-Sep-18 22:59:34 [b]139.0M[/b]
    discogs_20081014_releases.xml.gz 2008-Oct-14 23:40:40 [b]343.2M[/b]

    What is happening here? I won’t download it to check, but why has the size of this file nearly doubled within one month?

  • mjb
    Dec 17,2008 at 13:07

    These are supposed to be ‘monthly’ updates according to the data page, but so far we’ve only got March, September, and October 2008, and it’s now mid-December.

    Are these being produced manually? Can we get a more recent dump?

  • Nov 26,2008 at 08:53

    Thanks! IMHO. this is both a great idea and demonstrates a real show of respect specifically for Discog’s users and the public in general.

    Perhaps torrents for these archives might be helpful?

  • Nov 24,2008 at 17:30

    wow!!
    fantastic features
    thanks a lot!!

  • Nov 20,2008 at 00:55

    [quote=teo]The export from March 9 does contain pre-V4 data (discogs_20080309_*.txt.gz). [/quote]
    Oh indeed it does. Cool.

  • Nov 19,2008 at 23:26

    Thank you.

  • teo
    Nov 19,2008 at 09:12

    The export from March 9 does contain pre-V4 data (discogs_20080309_*.txt.gz).

    Some of the earlier release exports are missing tracklistings. If that one is missing them and you want tracklists, let me know and I’ll rebuild it.

  • Nov 19,2008 at 04:51

    Thanks, but this would have been better when data was still being moderated and somewhat correct.

    +1 for the new feature
    -1 for the timing

Leave A Reply